trck
is a tool to query TrailDBs for aggregate metrics based on individual user behavior.
Typical use cases:
- Calculating behavior-based KPIs (like bounce rates)
- Attributions
- Extracting features from discrete time series data for machine learning
Example queries:
- find number of cases where there is no event of type
page view
within 5 seconds of an event of typeclick
- find all user sessions that "bounced" for campaign X
- find number of user sessions that spend more than 5 minutes on a website
trck
is a domain specific language that defines a finite state machine1 to find patterns in data. These programs are compiled into highly optimized parallel native code.
trck
also includes a number of higher level, data-aware optimizations to make processing as efficient as possible, e.g. by looking ahead at the data and skipping parts that have no chance matching a condition altogether.
The data type trck
programs deal with is a sequence of timestamped events: a log file or a browsing history is a common example. trck
uses TrailDB as a storage engine that is optimized for storing large collections of such histories, in a way that makes it easy to process them one by one.
Of course, one can always use TrailDB API directly to analyze that data; however that is often cumbersome and error prone when you need to match complex patterns in trails. Especially when you need to keep track of current matching state for lots of trails simultaneously, when your data is split across multiple TrailDBs.
trck
approach is to explicitly define a state machine which is then executed for each trail separately. There is no global state you need to keep track of, you only focus on one trail at a time.
A trck
program can be considered a device that receives events one by one and reacts to them by transitioning to a different state.
The output of trck
program is generated by actions attached to state transitions. For example, you can increment a counter every time state machine transitions from state A to state B.
State machines are compiled to efficient machine code to ensure maximum performance. On top of that, trck
compiler applies a few high-level optimizations to skip processing parts of the trails when it has no effect on computation result. With other optimizations, like compressing states to maximize cache efficiency and multicore support, it makes possible for trck
programs to process millions of trails per second.
Matching trail patterns reliably can be very tricky because of a large number of edge cases; that's why trck
has a built-in unit test framework and a quickcheck-style property based testing library.
- Ubuntu / Debian Linux / OSX
- Debian packages
bash python (>=2.7) make python-ply jq libjudy-dev (>=1.0.5-5) libjson-c-dev libcmph-dev libc6-dev libtcmalloc-minimal4
- C compiler, tested with
gcc
andclang
.
On OSX, you'll need to install a few packages from brew
:
brew install homebrew/boneyard/judy msgpack google-perftools clang-omp
You'll also need to install TrailDB. As of time of writing, it is recommended to install TrailDB from source as the brew
version has a multithreading bug.
git submodule update --init --remote --recursive
make
Use trck
command line compiler from bin/
.
Example:
./bin/trck -c myprogram.tr -o matcher-traildb
That will compile your program to a binary matcher-traildb
that will dynamically linked to libtraildb
. That binary accepts TrailDB paths as positional arguments and prints results in JSON format to stdout. You can also compile a static binary by using --static
flag (currently Linux only).
You can specify program parameter values using --params file.json
, JSON file should contain a dictionary specifying values for every parameter. See Parameters section for more details.
You can specify output format using --output-format json|msgpack
. Currently only single result mode is supported for msgpack output; that means that you have to use merged results
mode if you use foreach
loops (see below).
You can also apply filters to traildb to select events a cookies to process. There are two kinds of filters:
- field filters. You can set field filter by passing
--filter
flag to a compiledtrck
program. Filter format TBD. Same filter will be applied to every trail. - time window filters. You can pass a path to a csv file using
--window-file
flag for a compiledtrck
program. Every line of the file contains 3 comma separated items:uuid
,start_timestamp
andend_timestamp
. For every trail with specifieduuid
, events having timestamp that doesn't satisfystart_timestamp <= X <= end_timestamp
are ignored. Trails that don't have an entry in the file are ignored entirely.
trck
programs are naturally highly parallelizable. Programs are compiled with OpenMP automatically, if available.
You can explicitly enable or disable OpenMP support using --use-openmp
and --no-openmp
flags to trck
compiler. You can control the number of OpenMP threads at run time for a compiled trck
program by setting OMP_NUM_THREADS
environment variable, as usual with OpenMP programs, and by default it will use one thread per core.
Matching rules describe state machines in a text form. State machines naturally consist of states and rules describing transitions between states.
Normally state transitions are triggered by events in the trail; therefore transition rules are boolean conditions that match event fields.
In trck
you can also trigger transitions on timeouts; they trigger when machine stays in state X for N seconds, without consuming any events.
The syntax is similar to Erlang/Prolog but with indentation-based block syntax aka off-side rule. If you are familiar with Erlang, matcher program is essentially a series of tail recursive message receiver loops, where "messages" are events in the trail.
trck
program consists of one or more blocks. Block has a name and a body, separated by keyword receive
. Body consists of multiple pattern matching clauses that are evaluated sequentially. Incoming event must match one of the clauses, and the corresponding action is executed.
[block_name] ->
receive
... [condition] -> [action] ...
... [condition] -> [action] ...
...
after [timeout] -> [action]
Example:
main ->
receive
type = "click", campaign_id = "A"-> yield $clicks
* -> repeat
This program calculates the number of "click" events for a campaign. receive
block can also be seen as an infinite loop, for every arriving event we have two pattern matching clauses here. Action here is the yield
statement, that increments a counter variable $clicks
every time event in a trail matches the condition.
repeat
action just proceeds to the next event, and matching process starts again from the first pattern matching clause.
Clauses must be exhaustive: if none of them match, you get a runtime error. Normally you'd have catch-all *->repeat
clause as the last one.
Let's look closer at the matching clause syntax.
...
type = "click", campaign_id = "A"-> yield $clicks, skip_dup
...
The part before the ->
arrow specifies matching conditions. A comma means boolean AND. There is no OR operator as of now, but you can emulate it by having multiple clauses.
Part after ->
arrow is clause action. It may contain yield
statements to produce results (like counting events). It also has to contain information on where to go next: it can be either repeat
to continue with the next event in this block, or it can be a block name like skip_dup
above. In that case, matching will continue with the next event in the trail in the specified block. Another special action is quit
that terminates state machine, in case if you don't want to process any more events in this trail.
You can also limit receive block duration by specifying windows using after
clause. In this example, we'll again count "click" events but will also skip 3 seconds after every click event we encounter, to discard potential duplicates.
main ->
receive
type = "click", campaign_id = "A"-> yield $clicks, skip_dup
* -> repeat
skip_dup ->
receive
* -> repeat
after 3s -> main
Here, after the first clause in main
matches, we yield a result, and then go to skip_dup
block that just loops over until after 3 seconds, then goes back to the main
block.
after
clauses cannot use repeat
as a transition, and naturally there is no matching condition besides time window specification. Time window can be specified as number and time unit (day/hour/minute/second), e.g. 31h
, 3d
, 16m
, 53s
.
You can nest blocks within a parent window
block. This is useful when you need to use multiple blocks to implement a matching pattern, but there is "global" expiration period that needs to cover multiple blocks.
The syntax is window ... after
:
parentblock1 ->
window
block1 ->
... clauses ...
block2 ->
... clauses ...
after 30d -> someblock
Nested blocks can contain transitions to blocks that are outside the window, but not vice versa.
I.e. this is legal, block1
can transition to foo
:
...
foo ->
...
parentblock1 ->
window
block0 ->
... clauses ...
block1 ->
type = "X" -> foo
block2 ->
... clauses ...
after 30d -> someblock
But you can't go from foo
to block1
, this code won't compile:
foo ->
receive
type = "X" -> block1
* -> repeat
parentblock1 ->
window
block0 ->
... clauses ...
block1 ->
... clauses ...
block2 ->
... clauses ...
after 30d -> someblock
Jobs may use parameters that are passed externally to them in matching conditions:
main ->
receive
type = "click", campaign_id = %id -> yield $clicks
type = "click", campaign_id in #foo -> yield $clicks
* -> repeat
%id
and #foo
can be passed at run time to the compiled program. Parameters can be scalars like above, or sets of values. Set parameter names start with #
, scalar names start with %
.
You can only compare parameters to field values, not other parameters or constants. That limitation allows the compiler to apply very aggressive optimization techniques. Note that parameters also have (implicit) field "types", e.g. you cannot compare %id
to campaign_id
field value in one clause and compare it to user_id
field value in another. You can work around this by simply having two separate parameters.
For batch processing, you often need to run the program for a number of values; for example, finding number of clicks for every campaign. That, of course, could be done by making campaign id a parameter and running program N times for every parameter value. However, that may be pretty inefficient especially if you have more than a few values.
That's where foreach
statement is useful. You can create a top level for-loop that loops through values and executes the program for every value, then separate result set is produced for each iteration. That is, in above example, every campaign would get a separate conversion counter. This is the same as if you ran your program N times, once for each campaign, except performance is much better.
Syntax is
foreach %bar,#something,%foo in @param
block1 ->
receive
....
...
Here @param
is specified in parameter config file (see --params
option for compiled binary), and contains a list of tuples, each item type should match the corresponding variable.
In the above example, first and third items of every tuple must be strings and the second item is a set (encoded in JSON as array):
{
"@param" : [
["A", ["item1", "item2"], "B"]
["Z", ["item1", "item3"], "C"]
]
}
If you use quit
action within foreach
statement, it will only stop trail processing for this specific parameter set (tuple).
One can think of foreach
as a state machine cloning operation; without foreach
you have a separate instance of your state machine for each trail, with separate state. With foreach
and N tuples as parameters, you'll instead have N instances of your state machine for every trail. Therefore quit
statement only terminates one of these state machines.
Sometimes you just want to run foreach
over a simple list of all values of some field occurring in the traildb. For example, run your query over all possible campaigns. Since extracting all of them manually and saving to a parameters config file can be a bit tedious -- all required information is already in traildbs anyway -- there is a shortcut for this case:
foreach %cid
block1 ->
receive
campaign_id = %cid, ... -> ...
....
...
You can omit array parameter name in the foreach
clause, in that case, matcher will extract list of possible field values from traildbs and will loop over them.
yield
is used to return results from the program. As discussed in the previous section, a separate result is returned for each iteration of foreach
. Meaning that if you want to run your program over multiple configurations passed in foreach
array, you'll get a separate result for each configuration.
If you want trck
to automatically merge these results, you can add merged results
qualifier to your foreach
loop:
foreach %aeid,%seid in @arr merged results
...
Simplest form of yield
is
yield $counter
Where $counter
is an output variable. This statement would increase counter named $counter
by one. There is no need to declare output variables in advance.
foreach %cid in @arr
start ->
receive
type = "cli", campaign_id = %cid -> yield $clicks
* -> repeat
This example would produce a separate counter for each item in @arr
, e.g.:
[
{"%cid" : "A", "$clicks" : 5},
{"%cid" : "B", "$clicks" : 3}
]
Another form of yield
allows you to return field values as a tuple. Syntax is
yield field1,field2 to #result
Where field1
, field2
are field names and #result
is an output variable. #result
is a set: that is, yielding the same
tuple of fields multiple times will produce only one copy in #result
. In JSON format as returned by trck
, variable #result
would be a list of strings, each string containing field set separated by commas.
Very similar to sets, can be used in the case when you don't only want distinct items but also their counts. Instead of producing an array of distinct items, you get a dictionary that contains items as keys and counts as values.
yield field1,field2 to &result
In JSON format as returned by trck
, variable &result
would look like this:
...
"&result" : { "foo" : 1, "bar" : 2}
...
Information on HLLs can be found here Sets and multisets can also be yielded as an HLL data structure encoded in hex using the syntax:
yield field1,field2 to ^result
In JSON format as returned by trck
, variable ^result
would look like this:
...
"^result": "0e01a7040001028738000101f80f0001049a21000101bc1200"
...
The HLL is encoded as follows:
0e 01 a7040001028738000101f80f0001049a21000101bc1200
-- -- ----------------------------------------------
^ ^ ^
| | |
(1)(2) (3)
1) HLL precision (p), bins = 2^p
2) HLL version:
0 = Empty
1 = Non-empty
3) HLL run-length encoded bins, RLE pairs can be stored in two or three bytes,
if the MSB is set, then two bytes were used to store the length.
foreach %aeid in @arr
start ->
receive
type = "cli", advertisable_eid = %aeid -> yield segment_eid,ad_eid to #ads
type = "imp", advertisable_eid = %aeid -> yield domain to &domains
type = "pxl", advertisable_eid = %aeid -> yield cookie to ^cookies
* -> repeat
As described above, this example produces a separate result for each item in @arr
, e.g.:
[
{"%aeid" : "A", "#ads" : ["seg1,ad1", "seg2,ad2"], "&domains" : {"example.com" : 12, "news.org" : 41}, "^cookies": "0e01a7040001028738000101f80f0001049a21000101bc1200"},
{"%aeid" : "B", "#ads" : ["seg1,ad2"], "&domains" : {"example.com" : 84, "news.org" : 11}, "^cookies": "0e01d128000102ba06000102b63e000101bc1200"}
]
You can call external functions when yield
ing to a set or a multiset; just drop a file with the same name as your .tr
script but with .tr.c
extension and you'll be able to call functions from that file in yield
statements. See test_ffi.tr
in test/tr
directory for an example.
foreach %cid in @arr
start ->
receive
type = "cli", campaign_id = %cid -> yield segment_eid,ad_eid to #ads
type = "imp", campaign_id = %cid -> yield domain to &domains
* -> repeat
As described above, this example produces a separate result for each item in @arr
, e.g.:
[
{"%cid" : "A", "#ads" : ["seg1,ad1", "seg2,ad2"], "&domains" : {"example.com" : 12, "news.org" : 41} },
{"%cid" : "B", "#ads" : ["seg1,ad2"], "&domains" : {"example.com" : 84, "news.org" : 11}}
]
If you used merged results
mode, you'll get a single result dictionary:
foreach %cid in @arr merged results
start ->
receive
type = "cli", campaign_id = %cid -> yield segment_eid,ad_eid to #ads
type = "imp", campaign_id = %cid -> yield domain to &domains
* -> repeat
Produces following output instead:
{"#ads": ["seg1,ad1", "seg2,ad2", "seg1,ad2"],
"&domains": {"example.com": 96, "news.org": 52}}
In addition to returning values of traildb fields in yield ... to
statements, you can also return special variables:
timestamp
contains current event timestamp in the trailcookie
contains current cookie as a hexadecimal stringstart_timestamp
contains timestamp when currentreceive
block was entered. Right now it is only possible to use this variable if that block has anafter
clause, i.e. block cannot be infinite.start_timestamp[label]
contains timestamp when parentwindow
block with the namelabel
was entered. It is only possible to use this variable within the referred block.
A program first parsed and converted to intermediate JSON-based format. trparser.py
does this and it is fairly independent of the rest of the pipeline.
Second, fsm2c.py
produces a C module and a header from that JSON representation. That contains state machine itself. That part is technically independent of the underlying storage and can be used with a storage engine other than traildb or even online event processing system that doesn't store entire trails.
Then C module produced by fsm2c.py
is compiled with match_traildb.c
which provides necessary primitive functions that are called by generated code to access data from traildbs and also implements efficient foreach
block execution, setting parameters, collecting results and printing them.
1 arguably not a finite state machine in a strict sense, but close enough