How do I directly read files from HDFS using dask?

47 views Asked by aerylias At 22 February 2024 at 20:42

I have a dask script where I convert a sas7bdat file, using the libraries dask-yarn to deploy to a YARN cluster, and dask_sas_reader for the conversion. The dask_sas_reader depends on pyreadstat in order to convert my file.

In my script, I try to read the sas7bdat file directly from HDFS. I have verified the path through hdfs dfs -ls <path>, and it is correct.

from dask_sas_reader import sas
from dask.distributed import Client
from dask_yarn import YarnCluster

# Deploying to YARN cluster
cluster = YarnCluster(environment='demo.tar.gz')
cluster.scale(2)

# Connect to the cluster
client = Client(cluster)

dd_df = sas.dask_sas_reader("hdfs://dev/data/airline.sas7bdat"), blocksize=800000)
dd_df.compute().to_parquet("/root/airline.parquet")

client.shutdown()
cluster.shutdown()

However, when I try to do this, I keep getting a pyreadstat error like so:

24/02/22 15:31:20 INFO impl.YarnClientImpl: Submitted application application_1705685221030_296652
**ERROR: `/root/hdfs:/dev/data/airline.sas7bdat` is not a sas file or directory of sas files**
Traceback (most recent call last):
  File "dummy.py", line 19, in <module>
    dd_df = sas.dask_sas_reader("hdfs://dev/data/airline.sas7bdat", blocksize=800000)
  File "/root/miniconda3/envs/demo/lib/python3.8/site-packages/dask_sas_reader/sas.py", line 98, in dask_sas_reader
    ddf = dd.from_delayed(dfs)
  File "/root/miniconda3/envs/demo/lib/python3.8/site-packages/dask/dataframe/io/io.py", line 596, in from_delayed
    parent_meta = delayed(make_meta_util)(dfs[0]).compute()
IndexError: list index out of range

Why are files automatically being looked within my root directory? Is there any way I can read files directly from HDFS instead? Any help is appreciated!

Original Q&A

TechQA.

How do I directly read files from HDFS using dask?

There are 0 answers

Related Questions in PYTHON

Related Questions in HADOOP

Related Questions in HDFS

Related Questions in DASK

Related Questions in DASK-DATAFRAME

Popular Questions

Trending Questions