multi-line json files to ndjson

71 views Asked by At

Is there a linux/mac command to copy multiple multi-line json files to a single ndjson file?

item1.json

{
  "type": "Feature",
  "version": "1.0.0",
  "id": "item1"
}

item2.json

{
  "type": "Feature",
  "version": "1.0.0",
  "id": "item2"
}

Wanted result: items.json

{"type": "Feature", "version": "1.0.0", "id": "item1"}
{"type": "Feature", "version": "1.0.0", "id": "item2"}
1

There are 1 answers

0
pmf On

In a valid JSON document,

  • whitespace characters that are used outside of strings are not significant to the data, and
  • control characters (from \u0000 to \u001f) that are used inside of strings must be escaped

Newline characters (\n, \u000a) fall into both categories (carriage return characters (\r, \u000d) as well, if that matters). So, inside of strings they cannot exist in their plain form, and outside of them they are insignificant. Thus, you can safely just remove all occurrences by using any capable tool, including JSON-agnostic ones, to bring a JSON file down to a single line.

As for creating an NDJSON file out of many multi-line JSON files, a straightforward approach could be to have a for loop successively provide all the JSON files, tr to delete the line breaks from each, followed by a simple echo to generate the delimiter line breaks in the target file:

for f in item*.json; do tr -d '\n' < "$f"; echo; done
{  "type": "Feature",  "version": "1.0.0",  "id": "item1"}
{  "type": "Feature",  "version": "1.0.0",  "id": "item2"}

An easier approach producing the same result could be using serializing paste -s which simply concatenates all lines of an input file into one using a delimiter. Set the delimiter to the empty string using -d '' to override the default TAB:

paste -sd '' item*.json
{  "type": "Feature",  "version": "1.0.0",  "id": "item1"}
{  "type": "Feature",  "version": "1.0.0",  "id": "item2"}

If you also wanted to compact the JSONs by removing regular space characters, be aware that they only fall into the whitespace category, not the control characters category, so only the ones outside of strings (like those used for indentation) can be deleted without logically altering the document. Hence, you'd be better off using a proper JSON parser, as it can reliably determine whether a given character in the representation is part of an encoded string or not. One such JSON-parsing CLI tool would be jq which comes with a dedicated --compact-output (or -c) flag:

jq -c . item*.json
{"type":"Feature","version":"1.0.0","id":"item1"}
{"type":"Feature","version":"1.0.0","id":"item2"}

Demo for jq on jqplay.org