I need to automate JSON-to-ORC conversion process. I was able to almost get there by using Apache's ORC-tools package except that JsonReader is doesn't handle Map type and throws an exception. So, the following works but doesn't handle Map type.
Path hadoopInputPath = new Path(input);
try (RecordReader recordReader = new JsonReader(hadoopInputPath, schema, hadoopConf)) { // throws when schema contains Map type
try (Writer writer = OrcFile.createWriter(new Path(output), OrcFile.writerOptions(hadoopConf).setSchema(schema))) {
VectorizedRowBatch batch = schema.createRowBatch();
while (recordReader.nextBatch(batch)) {
writer.addRowBatch(batch);
}
}
}
So, I started looking into using Hive classes for Json-to-ORC conversion, which has an added advantage that in the future I can convert to other formats, such as AVRO with minor code changes. However, I am not sure what the best way to do this using Hive classes. Specifically, it's not clear how to write HCatRecord to a file as shown below.
HCatRecordSerDe hCatRecordSerDe = new HCatRecordSerDe();
SerDeUtils.initializeSerDe(hCatRecordSerDe, conf, tblProps, null);
OrcSerde orcSerde = new OrcSerde();
SerDeUtils.initializeSerDe(orcSerde, conf, tblProps, null);
Writable orcOut = orcSerde.serialize(hCatRecord, hCatRecordSerDe.getObjectInspector());
assertNotNull(orcOut);
InputStream input = getClass().getClassLoader().getResourceAsStream("test.json.snappy");
SnappyCodec compressionCodec = new SnappyCodec();
try (CompressionInputStream inputStream = compressionCodec.createInputStream(input)) {
LineReader lineReader = new LineReader(new InputStreamReader(inputStream, Charsets.UTF_8));
String jsonLine = null;
while ((jsonLine = lineReader.readLine()) != null) {
Writable jsonWritable = new Text(jsonLine);
DefaultHCatRecord hCatRecord = (DefaultHCatRecord) jsonSerDe.deserialize(jsonWritable);
// TODO: Write ORC to file????
}
}
Any ideas on how to complete the code above or simpler ways of doing JSON-to-ORC will be greatly appreciated.
Here is what I ended up doing using Spark libraries per cricket_007 suggestion:
Maven dependency (with some exclusions to keep maven-duplicate-finder-plugin happy):
Java code synopsis:
Hope this helps somebody to get started.