hjq lets you run jq programs across large volumes of JSON and Avro data using Hadoop MapReduce.
Consider this sample poker game dataset from the JSON Lines website:
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
Suppose we want to count the number of wins for each user. With normal jq, we can just use the program jq '{name, size: .wins | length}'
to get the transformation we want.
However, suppose we have a gigantic dataset, and this is going to take too long. Instead, let us use hjq with the same jq program and blast a map-only job across our cluster:
hjq --mapper '{name, size: .wins | length}' --input /PATH/TO/INPUT --output /PATH/TO/OUTPUT
If you use a reducer, your mapper must spit out objects with "key" and "value" fields. Your reducer will see a sequence of arrays of `{key, value}' objects that each have the same key. For example, suppose we wanted to count the number of times each hand type ("straight", "one pair", etc.) won a game:
hjq --mapper '{key: .wins | .[] | .[0], value: 1}' --reducer '{hand: .[0].key, count: map(.value) | add }' --input /PATH/TO/INPUT --output /PATH/TO/OUTPUT `
hjq relies on jjq a java wrapper around the jq native library. You can build everything with one shot using make
. This will produce a jar file containing everything you need; nothing needs to be installed on the individual nodes of your cluster.