Where is flowers parquet dataset in Databricks

Question

Where is flowers parquet dataset in Databricks

394 views Asked by Sajeed At 10 December 2020 at 21:00

I am working on this notebook. https://databricks.com/notebooks/simple-aws/petastorm-spark-converter-pytorch.html

I tried running the first line

df = spark.read.parquet("/databricks-datasets/flowers/parquet") \
  .select(col("content"), col("label_index")) \
  .limit(1000)

However I got this error


 Path does not exist: dbfs:/databricks-datasets/flowers/parquet;

I am wondering where I can find the parquet version of the flowers dataset on databricks. FYI I am working on the community edition.

Original Q&A

There are 2 answers

Taras On 04 August 2022 at 14:17

label_index is removed from the dataset. You can recreate is as followed

from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="label", outputCol="label_index") 
indexed = indexer.fit(df).transform(df)

**Alex Ott** · Accepted Answer · 2021-01-03T13:20:31+00:00

This dataset was converted into Delta format, so path right now is /databricks-datasets/flowers/delta, instead of /databricks-datasets/flowers/parquet, and you need to read it with the corresponding code:

df = spark.read.format('delta').load('/databricks-datasets/flowers/delta')

P.S. You can always use %fs ls path command to see what files are at given path

P.P.S. I'll ask to fix that notebook if it's possible

TechQA.

Where is flowers parquet dataset in Databricks

There are 2 answers

Related Questions in APACHE-SPARK

Related Questions in DATABRICKS

Related Questions in PARQUET

Related Questions in DATABRICKS-COMMUNITY-EDITION

Popular Questions

Popular Tags

Trending Questions