I am trying to create an Hive external table on top of some avro files which are generated using spark-scala. I am using CDH 5.16 which has hive 1.1, spark 1.6.
I created hive external table, which ran successfully. But when i query the data i am getting NULL for all the columns.
My problem is similar to this
Upon some research, i found out it might be the problem with schema. But i couldn't find the schema file for these avro files in the location.
I am pretty new to avro file type. Can some one please help me out here.
Below is my spark code snippet where i have saved the file as avro:
df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
Below is my hive external table create statement:
create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';
Below is the result i am getting when i query the data:
select * from prod_order_avro
At the same time, when i am reading these avro files using spark-scala as dataframe and printing them, i am getting proper result.
Below is the spark code i used to read these data:
val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
My question is,
- While creating these
avrofiles, do i need to change myspark
code to create schema files separately or will it be embedded with
the files. If needs to be separate, then how to achieve it? - If not how to create
hivetable so that schema is retrieved from the file automatically. I read that in latest version hive takes care of this issue by itself if schema is present in the files.
Kindly help me out here


Resolved this..it was a schema issue. The schema was not embedded with the
avrofiles.So i had to extract schema usingavro-toolsand passed it while creating table. Its working now.I followed the below steps:
Extracted few data from
avrofiles stored inhdfsinto a file in local system. Below is the command used for the same:sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txtUsed
avro-tools getschemacommand to extract schema from this data:avro-tools getschema /path/temp.txtCopy the resulting schema(it will be in the form of
jsondata) into a new file with.avscextension and upload the same intoHDFSWhile creating the
Hive External tableadd the below property to it:TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')