I created a Parquet dataset partitioned as follows:
2019-taxi-trips/
- month=1/
- data.parquet
- month=2/
- data.parquet
...
- month=12/
- data.parquet
This organization follows the Parquet dataset partitioning convention used by Hive Metastore. This partitioning scheme was generated by hand, so there is no _metadata
file anywhere in the directory tree.
I would like to now read this dataset into Dask.
With data located on local disk, the following code works:
import dask.dataframe as dd
dd.read_parquet(
"/Users/alekseybilogur/Desktop/2019-taxi-trips/*/data.parquet",
engine="fastparquet"
)
I copied these files to an S3 bucket (via s3 sync
; partition folders are top level keys in the bucket, like so), and attempted to read them off of cloud storage using the same basic function:
import dask.dataframe as dd; dd.read_parquet(
"s3://2019-nyc-taxi-trips/*/data.parquet",
storage_options={
"key": "...",
"secret": "..."
},
engine="fastparquet")
This raises IndexError: list index out of range
. Full stack trace here.
Is not is currently possible to read in such a dataset directly from AWS S3?
There is currently a bug in
fastparquet
that is preventing this code from working. See Dask GH#6713 for details.In the meantime, until this bug is resolved, one easy solution to this issue is to use the
pyarrow
backend instead.