I am looking to find a way through Hive to take the following AVSC file content and externalize the nested schema "RENTALRECORDTYPE" for the purposes of schema reuse.
{
"type": "record",
"name": "EMPLOYEE",
"namespace": "",
"doc": "EMPLOYEE is a person that works here",
"fields": [
{
"name": "RENTALRECORD",
"type": {
"type": "record",
"name": "RENTALRECORDTYPE",
"namespace": "",
"doc": "Rental record is a record that is kept on every item rented",
"fields": [
{
"name": "due_date",
"doc": "The date when item is due",
"type": "int"
}
]
}
},
{
"name": "hire_date",
"doc": "Employee date of hire",
"type": "int"
}
]
}
This method of defining the schema works fine. I am able to issue the following HiveQL statement and the table is created successfully.
CREATE EXTERNAL TABLE employee
STORED AS AVRO
LOCATION '/user/dtom/store/data/employee'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema/employee.avsc');
However, I want to be able to reference an existing schema instead of duplicating the record definition in multiple schemas. For example, instead of a single schema file, two AVSC files will be generated. i.e. rentalrecord.avsc and employee.avsc.
rentalrecord.avsc
{
"type": "record",
"name": "RENTALRECORD",
"namespace": "",
"doc": "A record that is kept for every rental",
"fields": [
{
"name": "due_date",
"doc": "The date on which the rental is due back to the store",
"type": "int"
}
]
}
employee.avsc
{
"type": "record",
"name": "EMPLOYEE",
"namespace": "",
"doc": "EMPLOYEE is a person that works for the VIDEO STORE",
"fields": [
{
"name": "rentalrecord",
"doc": "A rental record is a record on every rental",
"type": "RENTALRECORD"
},
{
"name": "hire_date",
"doc": "Employee date of hire",
"type": "int"
}
]
}
In the above scenario, we want to be able to externalize the RENTALRECORD schema definition and be able to reuse it in employee.avsc and elsewhere.
When attempting to import the schema using the following two HiveQL statements, it fails…
CREATE EXTERNAL TABLE rentalrecord
STORED AS AVRO
LOCATION '/user/dtom/store/data/rentalrecord'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema /rentalrecord.avsc');
CREATE EXTERNAL TABLE employee
STORED AS AVRO
LOCATION '/user/dtom/store/data/employee'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema/employee.avsc');
The rentalrecord.avsc is imported successfully, but employee.avsc fails on the first field definition. The field of type “RENTALRECORD”. The following error is output by Hive…
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered exception determining schema. Returning signal schema to indicate problem: "RENTALRECORD" is not a defined name. The type of the "rentalrecord" field must be a defined name or a {"type": ...} expression.)
My research tells me that the Avro files do support this form of schema resuse. So either I'm missing something or this is something that is not supported through Hive.
Any help would be greatly appreciated.
I have defined a AVDL with all references and then used the avro tools jar file with idl2schemata option to generate the avsc. The generated avsc worked like a charm with hive!!