Linked Questions

Popular Questions

I'm using Dask to read two parquet files, compute them, then merge them. However, I'd like to use Ray Clusters to achieve this, in hopes of speeding up the computation; how can I change the code to achieve this? The following is currently my code.

def read_and_merge_parquets():
    df1 = dd.read_parquet(path='parquet1.parquet').compute()
    df2 = dd.read_parquet(path='parquet2.parquet').compute()
    merged_df = df2.merge(df1, on="id", how="left")
    print(merged_df)

I've tried the following naive approach:

ray.init(address='auto')
def read_and_merge_parquets():
    df1 = ray.data.read_parquet(paths='parquet1.parquet').compute()
    df2 = ray.data.read_parquet(paths='parquet2.parquet').compute()
    merged_df = df2.merge(df1, on="id", how="left")
    print(merged_df)

However, I get the following error: AttributeError: 'Dataset' object has no attribute 'compute'. I've also tried to use the enable_dask_on_ray() function, but my script only runs in the head node and does not leverage the worker nodes.

I'd like to use Ray Clusters to read two parquets, compute them, then merge them, using parallelization. I currently have access to 3 workers nodes. How can I change this code to achieve this?

Related Questions