Skip to content

ananis25/seqmatcher

Repository files navigation

seqmatcher

PyPI Changelog License

seqmatcher provides a DSL to match and edit sequences of events. Similar to how regular expressions help match patterns in text (which is a stream of characters), a collection of sequences (stream of events) can be analyzed similarly. This is a total ripoff of the work done here by Mikhail Panko.

The original notebooks introduce the semantics for the regex-like syntax and implement it as javascript code over lists of objects. Without a JIT like V8, it would be pretty slow to execute the same code in python. So instead we,

  • persist the dataset in the parquet format to read it quick.
  • read it using the awkward library which supports jagged arrays and optional datatypes.
  • compile the pattern matching routines at runtime using the numba library and run it against the awkward array data.

Performance wins:

  • Numba implements bindings to LLVM, so the compiled code runs pretty quick.
  • Awkward arrays are immutable and store all attributes, including nested ones, in contiguous buffers. So, matching and extracting subsequences copies very little data, and just record slices of the original arrays to use as output.

Things that are tricky:

  • Numba requires static variable types for compilation, so that constrains us to a consistent schema across all sequences and events.
  • A columnar data layout also makes modifying the matched sequences tricky (TODO: still gotta implement it in jitted code).

Installation

Install this library using pip:

$ pip install seqmatcher

Usage

Please refer to the example notebook on how to specify and match patterns.

Development

To contribute to this library, checkout the code in a new virtual enviroment.

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

About

match and edit sequences of events

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published