How to use Ray Clusters to read a parquet and run a computation

177 views Asked by Akham01 At 19 August 2023 at 00:03

I'm using Dask to read two parquet files, compute them, then merge them. However, I'd like to use Ray Clusters to achieve this, in hopes of speeding up the computation; how can I change the code to achieve this? The following is currently my code.

def read_and_merge_parquets():
    df1 = dd.read_parquet(path='parquet1.parquet').compute()
    df2 = dd.read_parquet(path='parquet2.parquet').compute()
    merged_df = df2.merge(df1, on="id", how="left")
    print(merged_df)

I've tried the following naive approach:

ray.init(address='auto')
def read_and_merge_parquets():
    df1 = ray.data.read_parquet(paths='parquet1.parquet').compute()
    df2 = ray.data.read_parquet(paths='parquet2.parquet').compute()
    merged_df = df2.merge(df1, on="id", how="left")
    print(merged_df)

However, I get the following error: AttributeError: 'Dataset' object has no attribute 'compute'. I've also tried to use the enable_dask_on_ray() function, but my script only runs in the head node and does not leverage the worker nodes.

I'd like to use Ray Clusters to read two parquets, compute them, then merge them, using parallelization. I currently have access to 3 workers nodes. How can I change this code to achieve this?

Original Q&A

TechQA.

How to use Ray Clusters to read a parquet and run a computation

There are 0 answers

Related Questions in PYTHON

Related Questions in DASK

Related Questions in DASK-DISTRIBUTED

Related Questions in RAY

Popular Questions

Popular Tags

Trending Questions