I'm currently implementing an ETL pipeline using Databricks Delta Live Tables. I specified the storage location as a folder in ADLS. When I run the pipeline and look at the files, the .snappy.parquet files that are getting saved to ADLS have unicode characters in them. I am using very small (around 5 rows each) csv files that don't have any null values or special characters. Has anyone ran into this issue / does anyone know how to solve this?
What I've tried:
Saving to a different ADLS location
- This still resulted in corrupt files in ADLS
Reading the Delta Live Table into a spark dataframe, then writing to ADLS
- This still resulted in corrupt files in ADLS
Changing cluster configuration
- This resulted in an Azure Quota Exceed error
When I tried to view the Delta table, I encountered the same issue as shown below:
The data has Unicode solutions. According to this, the "Underlying Data" of a "Delta Table" is "Stored" in the "Compressed Parquet File Format," i.e., in "snappy. Parquet" File Format.
As per this, Parquet is a binary-based (rather than text-based) file format optimized for computers, so Parquet files aren't directly readable by humans. That may be the reason for getting data with Unicode as above. So, if we want to view the data of a snappy. parquet file, read it in Databricks using the code below:
Then we can view the data of the Delta table as shown below:
Alternatively, read the file using Parquet reading tools or upload it to an online Parquet viewer as shown below: