Hive insert creating different parquet that Spark insert

374 views Asked by At

I am using hdp 2.6.5.0-292.

In this environment Data is getting processed using Hive as well as pyspark.

Insert using SQL such as:

INSERT INTO table PARTITION(run_date='{run_date}') (...)  VALUES(...)

is creating parquet 1.8.1, while pyspark dataframe insert into:

[df.write.mode("append").insertInto(table)]

is giving 1.6.0.

How can we make both 1.8.1?

creator:          parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
creator:          parquet-mr version 1.6.0 (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb)

Please help to fix this issue.

This is what my show create table looks like:

CREATE EXTERNAL TABLE abc.xyz( ...) PARTITIONED BY ( run_date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://...' TBLPROPERTIES ( 'spark.sql.create.version'='2.2 or prior', 'spark.sql.sources.schema.numPartCols'='1', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{...}', 'spark.sql.sources.schema.partCol.0'='run_date', 'transient_lastDdlTime'='1608546154')

0

There are 0 answers