Load a large JSON file using multi threading/

3.3k views Asked by At

I am trying to load a large 3 GB JSON file. Currently, with JQ utility I can load the entire file in nearly 40 mins. Now, I want to know how I can use parallelism/multi threading approach in JQ in order to complete the process in less amount of time. I am using v1.5

Command Used:

JQ.exe -r -s "map(.\"results\" | map({\"ID\": (((.\"body\"?.\"party\"?.\"xrefs\"?.\"xref\"//[] | map(select(ID))[]?.\"id\"?))//null), \"Name\": (((.\"body\"?.\"party\"?.\"general-info\"?.\"full-name\"?))//null)} | [(.\"ID\"//\"\"|tostring), (.\"Name\"//\"\"|tostring)])) | add[] | join(\"~\")" "\C:\InputFile.txt" >"\C:\OutputFile.txt"

My data:

{
  "results": [
    {
      "_id": "0000001",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2011-02-14T08:21:51.000-05:00",
            "full-name": "Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades",
            "status": "ACTIVE",
            "last-update-user": "TS42922",
            "create-date": "2011-02-14T08:21:51.000-05:00",
            "classifications": {
              "classification": [
                {
                  "code": "PENS"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "LOCCU1",
                "id": "X00893X"
              },
              {
                "type": "ID",
                "id": "1012227139"
              }
            ]
          }
        }
      }
    },
    {
      "_id": "000002",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2015-05-21T15:10:45.174-04:00",
            "full-name": "Innova Capital Sp zoo",
            "status": "ACTIVE",
            "last-update-user": "jw74592",
            "create-date": "1994-08-31T00:00:00.000-04:00",
            "classifications": {
              "classification": [
                {
                  "code": "CORP"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "ULTDUN",
                "id": "144349875"
              },
              {
                "type": "AVID",
                "id": "6098743"
              },
              {
                "type": "LOCCU1",
                "id": "1001210218"
              },
              {
                "type": "ID",
                "id": "1001210218"
              },
              {
                "type": "BLMBRG",
                "id": "10009050"
              },
              {
                "type": "REG_CO",
                "id": "0000068508"
              },
              {
                "type": "SMCI",
                "id": "13159"
              }
            ]
          }
        }
      }
    }
  ]
}

Can someone please help me which command I need to use in v1.5 in order to achieve parallelism/multithreading.

2

There are 2 answers

0
Eric Hartford On

For a file of this size, you need to stream the file in and process one item at a time. First seek to '"results": [' then use a function called something like 'readItem' that uses a stack to match braces, until your opening brace closes, appending each character to your buffer, then deserializes the item once the closing brace is found.

I recommend node.js + lodash for implementation language.

0
jq170727 On

Here is a streaming approach which assumes your 3GB data file is in data.json and the following filter is in filter1.jq:

  select(length==2)
| . as [$p, $v]
| {r:$p[1]}
| if   $p[2:6] == ["body","party","general-info","full-name"]       then .name = $v
  elif $p[2:6] == ["body","party","xrefs","xref"] and $p[7] == "id" then .id   = $v
  else  empty
  end      

When you run jq with

$ jq -M -c --stream -f filter1.jq data.json

jq will produce a stream of results with minimal details you need

{"r":0,"name":"Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades"}
{"r":0,"id":"X00893X"}
{"r":0,"id":"1012227139"}
{"r":1,"name":"Innova Capital Sp zoo"}
{"r":1,"id":"144349875"}
{"r":1,"id":"6098743"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"10009050"}
{"r":1,"id":"0000068508"}
{"r":1,"id":"13159"}

which you can convert to your desired format by using a second filter2.jq:

foreach .[] as $i (
     {c: null, r:null, id:null, name:null}

   ; .c = $i
   | if .r != .c.r then .id=null | .name=null | .r=.c.r else . end   # control break
   | .id   = if .c.id == null   then .id   else .c.id   end
   | .name = if .c.name == null then .name else .c.name end

   ; [.id, .name]
   | if contains([null]) then empty else . end
   | join("~")
)

which consumes the output of the first filter when run with

$ jq -M -c --stream -f filter1.jq data.json | jq -M -s -r -f filter2.jq

and produces

X00893X~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
1012227139~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
144349875~Innova Capital Sp zoo
6098743~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
10009050~Innova Capital Sp zoo
0000068508~Innova Capital Sp zoo
13159~Innova Capital Sp zoo

This might be all you need using just two jq processes. If you need more parallelism you could use the record number (r) as to partition the data and process the partitions in parallel. For example, if you save the intermediate output into a temp.json file

$ jq -M -c --stream -f filter1.jq data.json > temp.json

then you could process temp.json in parallel with filters such as

$ jq -M 'select(0==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result0.out &
$ jq -M 'select(1==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result1.out &
$ jq -M 'select(2==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result2.out &

and concatenate your partitions into a single result at the end if necessary. This example uses 3 partitions but you could easily extend this approach to any number of partitions if you need more parallelism.

GNU parallel is also a good option. As mentioned in the JQ Cookbook, jq-hopkok's parallelism folder has some good examples