I have a dask script where I convert a sas7bdat file, using the libraries dask-yarn to deploy to a YARN cluster, and dask_sas_reader for the conversion. The dask_sas_reader depends on pyreadstat in order to convert my file.
In my script, I try to read the sas7bdat file directly from HDFS. I have verified the path through hdfs dfs -ls <path>, and it is correct.
from dask_sas_reader import sas
from dask.distributed import Client
from dask_yarn import YarnCluster
# Deploying to YARN cluster
cluster = YarnCluster(environment='demo.tar.gz')
cluster.scale(2)
# Connect to the cluster
client = Client(cluster)
dd_df = sas.dask_sas_reader("hdfs://dev/data/airline.sas7bdat"), blocksize=800000)
dd_df.compute().to_parquet("/root/airline.parquet")
client.shutdown()
cluster.shutdown()
However, when I try to do this, I keep getting a pyreadstat error like so:
24/02/22 15:31:20 INFO impl.YarnClientImpl: Submitted application application_1705685221030_296652
**ERROR: `/root/hdfs:/dev/data/airline.sas7bdat` is not a sas file or directory of sas files**
Traceback (most recent call last):
File "dummy.py", line 19, in <module>
dd_df = sas.dask_sas_reader("hdfs://dev/data/airline.sas7bdat", blocksize=800000)
File "/root/miniconda3/envs/demo/lib/python3.8/site-packages/dask_sas_reader/sas.py", line 98, in dask_sas_reader
ddf = dd.from_delayed(dfs)
File "/root/miniconda3/envs/demo/lib/python3.8/site-packages/dask/dataframe/io/io.py", line 596, in from_delayed
parent_meta = delayed(make_meta_util)(dfs[0]).compute()
IndexError: list index out of range
Why are files automatically being looked within my root directory? Is there any way I can read files directly from HDFS instead? Any help is appreciated!