Where is flowers parquet dataset in Databricks

402 views Asked by At

I am working on this notebook. https://databricks.com/notebooks/simple-aws/petastorm-spark-converter-pytorch.html

I tried running the first line

df = spark.read.parquet("/databricks-datasets/flowers/parquet") \
  .select(col("content"), col("label_index")) \
  .limit(1000)

However I got this error


 Path does not exist: dbfs:/databricks-datasets/flowers/parquet;

I am wondering where I can find the parquet version of the flowers dataset on databricks. FYI I am working on the community edition.

2

There are 2 answers

0
Alex Ott On BEST ANSWER

This dataset was converted into Delta format, so path right now is /databricks-datasets/flowers/delta, instead of /databricks-datasets/flowers/parquet, and you need to read it with the corresponding code:

df = spark.read.format('delta').load('/databricks-datasets/flowers/delta')

P.S. You can always use %fs ls path command to see what files are at given path

P.P.S. I'll ask to fix that notebook if it's possible

0
Taras On

label_index is removed from the dataset. You can recreate is as followed

from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="label", outputCol="label_index") 
indexed = indexer.fit(df).transform(df)