Use jq to count on multiple levels

10.7k views Asked by At

We've discovered some domain names tied to infections. Now we have a list of DNS names in a .json file, and I'd like to produce a summarized output showing: a list of users, the unique domains they visited, the total count. Bonus points if I can also get count per domain name.

Here is a sample of the file:

{"machine": "possible_victim01", "domain": "evil.com", "timestamp":1435071870}
{"machine": "possible_victim01", "domain": "evil.com", "timestamp":1435071875}
{"machine": "possible_victim01", "domain": "soevil.com", "timestamp":1435071877}
{"machine": "possible_victim02", "domain": "bad.com", "timestamp":1435071877}
{"machine": "possible_victim03", "domain": "soevil.com", "timestamp":1435071879}

Ideally, I would like the output to be something like:

{"possible_victim01": "total": 3, {"evil.com": 2, "soevil.com": 1}}
{"possible_victim02": "total": 1, {"bad.com": 1}}
{"possible_victim03": "total": 1, {"soevil.com": 1}}

I would gladly settle for:

{"possible_victim01": "total": 3, ["evil.com", "soevil.com"]}
{"possible_victim02": "total": 1, ["bad.com"]}
{"possible_victim03": "total": 1, ["soevil.com"]}

I can get a total count of records per user, but I lose the list of domains:

cat sample.json | jq -s 'group_by(.machine) | map({machine:.[0].machine,domain:.[0].domain, count:length}) '
[{"machine": "possible_victim01", "domain": "evil.com", "count": 3},  
{"machine": "possible_victim02", "domain": "bad.com", "count": 1},
{"machine": "possible_victim03", "domain": "soevil.com", "count": 1}]

This post describes how to solve the second half of the problem... JQ Aggregations and Crosstabs. I haven't found anything yet that describes the first half, getting to:

{"machine": "possible_victim01", "domain": "evil.com", "count":2}
{"machine": "possible_victim01", "domain": "soevil.com", "count":1}
{"machine": "possible_victim02", "domain": "bad.com", "count":1}
{"machine": "possible_victim03", "domain": "soevil.com", "count":1}
3

There are 3 answers

0
Jimmy On BEST ANSWER

You need to to do group_by twice, once to group by the machine name, and then a sub-grouping to get the sub-counts for each domain.

jq query:

group_by(.machine) | map({
    "machine": .[0].machine, 
    "total":length, 
    "domains": (group_by(.domain) | map({
        "key":.[0].domain, 
        "value":length}) | from_entries
    )
})

Example output:

{
  "machine": "possible_victim01",
  "total": 3,
  "domains": {
    "evil.com": 2,
    "soevil.com": 1
  }
}
{
  "machine": "possible_victim02",
  "total": 1,
  "domains": {
    "bad.com": 1
  }
}
{
  "machine": "possible_victim03",
  "total": 1,
  "domains": {
    "soevil.com": 1
  }
}
0
jq170727 On

Here is a solution using reduce, getpath and setpath

reduce .[] as $o (
  {}
; [$o.machine, "total"] as $p1
| [$o.machine, "domains", $o.domain] as $p2
| setpath($p1; 1+getpath($p1))
| setpath($p2; 1+getpath($p2))
)

If filter.jq contains this filter and data.json contains the sample data then the command

$ jq -M -s -f filter.jq data.json

produces

{
  "possible_victim01": {
    "total": 3,
    "domains": {
      "evil.com": 2,
      "soevil.com": 1
    }
  },
  "possible_victim02": {
    "total": 1,
    "domains": {
      "bad.com": 1
    }
  },
  "possible_victim03": {
    "total": 1,
    "domains": {
      "soevil.com": 1
    }
  }
}
0
peak On

Using group_by in the manner described is fine, but if you have a very large number of lines (i.e. JSON entities) to read as suggested by the sample provided, then you may run into performance issues and/or capacity constraints.

These issues can be resolved very effectively in any version of jq with the "inputs" builtin (e.g. jq 1.5rc1).

Please note that using "inputs" you would invoke jq with the -n option, like this:

jq -n -f program.jq data.json

Please note also that it is preferable here to produce JSON output, and the following seems to be close to what is wanted:

{"possible_victim01": { "total": 3, "evildoers": {"evil.com": 2, "soevil.com": 1} },
 "possible_victim02": ...}`

The following program could be made more concise but the presentation here is intended to make the process transparent, assuming a basic understanding of jq. If there is magic here, it is that one does not have to make a special case of "null".

reduce inputs as $line
  ({};
   . as $in
   | ($line.machine) as $machine
   | ($line.domain) as $domain
   | ($in[$machine].evildoers ) as $evildoers
   | . + { ($machine): {"total": (1 + $in[$machine]["total"]),
                        "evildoers": ($evildoers | (.[$domain] += 1)) }} )

Using the sample input provided, the output is:

{
  "possible_victim01": {
    "total": 3,
    "evildoers": {
      "evil.com": 2,
      "soevil.com": 1
    }
  },
  "possible_victim02": {
    "total": 1,
    "evildoers": {
      "bad.com": 1
    }
  },
  "possible_victim03": {
    "total": 1,
    "evildoers": {
      "soevil.com": 1
    }
  }
}