JSON log file compaction

Question

JSON log file compaction

206 views Asked by miku At 01 June 2015 at 13:58

Given a file with line delimited JSON records:

{"id": 1, "position": 1234}
{"id": 2, "position": 23}
{"id": 3, "position": 43}
{"id": 1, "position": 223}

I would like to compact such a file, by keeping only the last record for an id, e.g. for the above example I would like to have as output:

{"id": 2, "position": 23}
{"id": 3, "position": 43}
{"id": 1, "position": 223}

tldr; Is there a uniq that works with line delimited JSON (and is fast)?

The input files might contain 1 billion records of which maybe 10-20% of the records can be thrown away.

I tried various approaches:

Seen ids

Keep a set of "seen" ids in memory. This runs out of memory.
Sort and unique

Sort the file by "id" first (with a stable sort, so order is preserved). Then run through the file again, and just keep the last record. This is like the usual unix sort | uniq approach. Sorting is expensive here and maybe just too much work.
Extract offset and length information

Extract offset and length information and the id from the file, e.g.
```
id offset length
1  0      27
2  27     25
3  52     25
1  77     26
```

and figure out, which records would belong to the compacted set. Then seek and read through the file. The extraction of this information is reasonably fast, but millions of seeks and reads to extract the records slows this approach down.

What could be a better, faster (or fastest) approach? Are there existing tools that solve this kind of problem?

Original Q&A

There are 1 answers

**miku** · Accepted Answer · 2015-06-19T15:53:41+00:00

This problem can be solved by a three step process:

Extract interesting values (plus line number) with tools like jq or ldjtab.
Use tac and sort -u to keep only relevant lines.
Filter original file and only keep specified lines (tools like filterline or one a few other approaches will filter a file and only keep certain specified lines).

The overall process is quite efficient. Step 1 and 2 are parallelizable. Step 3 can be made fast.

TechQA.

JSON log file compaction

There are 1 answers

Related Questions in JSON

Related Questions in FILTER

Related Questions in UNIQUE

Related Questions in STREAM-COMPACTION

Popular Questions

Popular Tags

Trending Questions