jq: insert new objects while reading inputs from json file and bash stdout

841 views Asked by At

I want to insert new json objects in between json objects using bash generated uuid.

input json file test.json

{"name":"a","type":1}
{"name":"b","type":2}
{"name":"c","type":3}

input bash command uuidgen -r

target output json

{"id": "7e3ca7b0-48f1-41fe-9a19-092a62cba0dc"}
{"name":"a","type":1}
{"id": "3f793fdd-ec3b-4306-8153-12f3f9faf2c1"}
{"name":"b","type":2}
{"id": "cbcd759a-37e7-4da7-b7fe-7572f474ec31"}
{"name":"c","type":3}

basic jq program to insert new objects

jq -c '{"id"}, .' test.json

output json

{"id":null}
{"name":"a","type":1}
{"id":null}
{"name":"b","type":2}
{"id":null}
{"name":"c","type":3}

jq program to insert uuid generated from bash:

jq -c '{"id" | input}, .' test.json < <(uuidgen)

Unsure about how to handle two inputs, bash command used to create a value in the new object, and the input file to be transformed (new object inserted in between each object).

I want to process small and large json files up to a few gigabytes each.

Greatly appeaciate some help with a well designed solution(s) that would scale for large files and perform the operations quickly and efficiently.

Thanks in advance.

4

There are 4 answers

3
peak On BEST ANSWER

If the input file is already well-formed JSONL, then a simple bash solution would be:

while IFS= read -r line; do
  printf "{\"id\": \"%s\"}\n" $(uuidgen)
  printf '%s\n' "$line"
done < test.json

This might well be the best trivial solution if test.json is very large and known to be valid JSONL.

If the input file is not already JSONL, then you could still use the above approach by piping in jq -c . test.json. And if ‘read’ is too slow, you could still use the above text-processing approach with awk.

For the record, a single-call-to-jq solution along the lines you have in mind could be constructed as follows:

jq -n -c -R --slurpfile objects test.json '
  $objects[] | {"id": input}, .' <(while true ; do uuidgen ; done)

Obviously you cannot "slurp" the unbounded stream of uuidgen values; less obviously perhaps, if you were simply to pipe in the stream, the process will hang.

0
Charles Duffy On

Since @peak has already covered the jq side of the problem, I'm going to take a shot at doing this more efficiently using Python, still wrapped so it can be called in a shell script.

This assumes that your input is JSONL, with one document per line. If it isn't, consider piping through jq -c . before piping into the below.

#!/usr/bin/env bash

py_prog=$(cat <<'EOF'
import json, sys, uuid

for line in sys.stdin:
    print(json.dumps({"id": str(uuid.uuid4())}))
    sys.stdout.write(line)
EOF
)

python -c "$py_prog" <in.json >out.json
3
peak On

If the input was not known in advance to be valid JSONL, one of the following bash+jq solutions might make sense since the overhead of counting the number of objects would be relatively small.

If the input is small enough to fit in memory, you could go with a simple solution:

n=$(jq -n 'reduce inputs as $in (0; .+1)' test.json)

for ((i=0; i < $n; i++)); do uuidgen ; done |
jq -n -c -R --slurpfile objects test.json '
  $objects[] | {"id": input}, .'

Otherwise, that is, if the input is very large, then one could avoid slurping it as follows:

n=$(jq -n 'reduce inputs as $in (0; .+1)' test.json)
jq -nc --rawfile ids <(for ((i=0; i < $n; i++)); do uuidgen ; done) '
  $ids | split("\n") as $ids
  | foreach inputs as $in (-1; .+1; {id: $ids[.]}, $in)
' test.json 
2
Charles Duffy On

Here's another approach where jq is handling input as raw string, already muxed by a separate copy of bash.

while IFS= read -r line; do
  uuidgen
  printf '%s\n' "$line"
done | jq -Rrc '({ "id": . }, input)'

It still has all the performance overhead of calling uuidgen once per input line (plus some extra overhead because bash's read operates one byte at a time) -- but it operates in a fixed amount of memory without needing Python.