How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?

9.2k views Asked by At

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?

df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+

df.rdd.getNumPartitions() - it has 1 partition

>>> df.rdd.getNumPartitions()

1

df.write.save("/user/hduser/data_check/test.parquet", format="parquet")

If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet" in HDFS and inside that directory multiple files .parquet file, metadata file are getting saved.

Found 4 items

-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47 
/user/hduser/data_check/test.parquet/_SUCCESS 
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata 
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFSrather than folder with multiple files?

Help would be much appreciated.

2

There are 2 answers

1
sande On

Use coalesce(1) after write. it will solve your issue

df.coalesce(1).write
0
Hemlata Chelwani On

This should solve the problem.

df.coalesce(1).write.parquet(parquet_file_path)
df.write.mode('append').parquet("/tmp/output/people.parquet")