I am trying to read data from json files in S3 into my Hive table. If the column names and json keys are same its all loading properly.
But now I want to read data in such a way that the nested json values goes into specific columns (For eg: for json

{"data1": {"key1": "value1"}}

I want the data1.key1 value to go into column named data1_key1; which I understand is achievable with SERDEPROPERTIES.

My next problem is there can be multiple json keys and I want the key names to be column values in my Hive table.

Also, depending upon those keys, the keys that go into other columns will also change.

For eg my json files will be either:

{"data1" : {"key1":"value1"}}


{"data2" : { "key2" : "value2"}}

This need to create a table as below:

col1 col2
data1 value1
data2 value2

Is this possible? If so how should it be done?

1 Answers

leftjoin On

You can do it using regular expressions. Define json column as string in table DDL and use regexp to parse it. Tested on your data example:


with your_table as ( --Replace this CTE with your table
select stack(2,
             '{"data1": {"key1": "value1"}}',
             '{"data2" : { "key2" : "value2"}}'
             ) as json

select regexp_extract(json,'^\\{ *\\"(\\w+)\\" *:', 1)    as col1, --capturing group 1 in a parenthesis START{spaces"(word)"spaces:
       regexp_extract(json,': *\\"(.+)\\" *\\} *\\}$', 1) as col2 --:spaces"(value characters)"spaces}spaces}END
  from your_table;

Result (see http://demo.gethue.com/hue/editor?editor=312993):


Read the comments in the code please. You can adjust this solution to fit your JSON. This approach allows to extract keys and values from JSON not knowing their names. json_tuple and get_json_object are not applicable in this case.

Alternatively you can use regexSerDe to do the same in the table DDL like in this answer: https://stackoverflow.com/a/47944328/2700344. For the RegexSerDe solution you need to write more complex single regexp containing one capturing group (in parenthesis) for each column.