I am using hdp 2.6.5.0-292.
In this environment Data is getting processed using Hive as well as pyspark.
Insert using SQL such as:
INSERT INTO table PARTITION(run_date='{run_date}') (...) VALUES(...)
is creating parquet 1.8.1, while pyspark dataframe insert into:
[df.write.mode("append").insertInto(table)]
is giving 1.6.0.
How can we make both 1.8.1?
creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
creator: parquet-mr version 1.6.0 (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb)
Please help to fix this issue.
This is what my show create table looks like:
CREATE EXTERNAL TABLE abc.xyz( ...) PARTITIONED BY ( run_date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://...' TBLPROPERTIES ( 'spark.sql.create.version'='2.2 or prior', 'spark.sql.sources.schema.numPartCols'='1', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{...}', 'spark.sql.sources.schema.partCol.0'='run_date', 'transient_lastDdlTime'='1608546154')