How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java
Here is my scenario:
- A large jsonl file is in google storage
- Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline.
- Use Apache beam, google dataflow and java to convert jsonl to csv
- Once transformation is done, store csv in google storage (same bucket where jsonl is stored)
- Notify by some means, like transformation_done=true if possible (rest api or event)
Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.
I have edited the question with an example JSONL data
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
{"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"}
{"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"}
While json key's are there in an input file but it's not known while transforming. I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.
Client 1: Input Jsonl File
{"city":"Mumbai", "pincode":"2012A"}
{"city":"Delhi", "pincode":"2012N"}
Client 2: Input Jsonl File
{"Relation":"Finance", "Code":"2012A"}
{"Relation":"Production", "Code":"20XXX"}
Client 3: Input Jsonl File
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
Question: How could I write A Generic beam pipeline which transforms all three as shown below
Client 1: Output CSV file
["city", "pincode"]
["Mumbai","2012A"]
["Delhi", "2012N"]
Client 2: Output CSV file
["Relation", "Code"]
["Finance", "2012A"]
["Production","20XXX"]
Client 3: Output CSV file
["Name", "Session", "Score", "true"]
["Gilbert", "2013", "24", "true"]
["Alexa", "2013", "29", "true"]
Edit: Removed the previous ans as questions have been modified with examples.
There is no generic way provided by anyone to achieve such result. You have to write the logic yourself depending on your requirements and how you are handling the pipeline.
Below there are some examples but you need to verify these for your case as I have only tried these on a small JSONL file.
TextIO
Approach 1
If you can collect the header value of the output csv then it will be much easier. But getting the header beforehand itself another challenge.
probable ways to get the header line(Only assumptions may not work in your case) :
Approach 2
This is a workaround I found for small JsonFiles(~10k lines). This below example may not work for large files.
FileIO
As mentioned by Saransh in comments by using FileIO all you have to do is read the JSONL line by line manually and then convert those into comma separated format.EG:
In the above examples I have used a
getCsvLinemethod(created for code usability) which takes a single line from the file and converts it into a comma separated format.To parse the JSON object I have used GSON.