I am trying to create an Hive external table
on top of some avro
files which are generated using spark-scala
. I am using CDH 5.16
which has hive 1.1
, spark 1.6
.
I created hive external table
, which ran successfully. But when i query the data i am getting NULL
for all the columns.
My problem is similar to this
Upon some research, i found out it might be the problem with schema. But i couldn't find the schema file for these avro files in the location.
I am pretty new to avro
file type. Can some one please help me out here.
Below is my spark
code snippet where i have saved the file as avro
:
df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
Below is my hive external table create statement:
create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';
Below is the result i am getting when i query the data:
select * from prod_order_avro
At the same time, when i am reading these avro
files using spark-scala
as dataframe
and printing them, i am getting proper result.
Below is the spark
code i used to read these data:
val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
My question is,
- While creating these
avro
files, do i need to change myspark
code to create schema files separately or will it be embedded with
the files. If needs to be separate, then how to achieve it? - If not how to create
hive
table so that schema is retrieved from the file automatically. I read that in latest version hive takes care of this issue by itself if schema is present in the files.
Kindly help me out here
Resolved this..it was a schema issue. The schema was not embedded with the
avro
files.So i had to extract schema usingavro-tools
and passed it while creating table. Its working now.I followed the below steps:
Extracted few data from
avro
files stored inhdfs
into a file in local system. Below is the command used for the same:sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt
Used
avro-tools getschema
command to extract schema from this data:avro-tools getschema /path/temp.txt
Copy the resulting schema(it will be in the form of
json
data) into a new file with.avsc
extension and upload the same intoHDFS
While creating the
Hive External table
add the below property to it:TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')