I am trying to insert data into an partitioned hive tables using spark by executing the SQL which looks like 'Insert into table_a (select *,to_date(created_at) from table_b)'.

Both tables are in Json format and the source data in GZIPped and 1.2GB in size. But the output table data size has become 34GB as spark is not writing compressed files.

I have already tried the following config and nothing seems to work

sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .enableHiveSupport()
                .config("hive.exec.dynamic.partition", "true")
                .config("hive.exec.dynamic.partition.mode", "nonstrict")
                .config("spark.hadoop.mapred.output.compress", "true")
                .config("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
                .config("spark.hadoop.mapred.output.compression.type", "BLOCK")
                .config("spark.sql.json.compression.codec","gzip")
                .getOrCreate())
sparkSession.sql("set json.compression=GZIP")
sparkSession.sql("set spark.sql.json.compression.codec=GZIP")
df_load = sparkSession.sql("insert into temp.serv_test_out (select *,to_date(from_utc_timestamp(from_unixtime(time_stamp DIV 1000), 'IST')), lpad(hour(from_utc_timestamp(from_unixtime(time_stamp DIV 1000), 'IST')),2,'0') from temp.serv_test)")
df_load.show()

The source data I have is huge and don't want my partitioned table to be 30 times bigger in data size. Any help would be appreciated.

0 Answers