A CPU and memory performant tag database for Exlir/erlang, featuring fast query times, the ability to represent complex tag hierarchies and implication rules, query JITing and optimization, and a (comparitavely) small memory footprint.
The graph of tags and the implication rules between them can contain any number of arbitrary cycles, allowing structures such as parent/child hierarchies, and tag aliases, all of which remain performant when querying for.
The bulk of AllTheTags is implemented as a C++ library with no knowledge
of Elixir or Erlang, and thus can be extracted out of the Elxir project with
minimal effort. The native code is located in c_src
, with native tests in
c_src/tests
and Elixir facing tests (stress testing things like concurrent
access to the database) are located under test
.
AllTheTags, when used from Elixir/erlang, is threadsafe, and database instances can be queried simultaniously from multiple threads. The NIF API layer will take care of any synchronization when database writes occur.
If the C++ project is used standalone, the user will have to guarentee that
reads and writes are synchronized; take a look at erl_api.cc
as an example
of how to achieve this.
While the code is tested against a modestly sized test suite, use it at your own risk (or even better, report bugs!), as any C++ bugs have the potential to crash the Erlang VM or corrupt memory.
For starters, to learn about implementing Erlang NIFs.
Secondly, the Erlang VM is very powerful, but not great at tasks that require "bare metal" performance. The lack of mutable state is also a fantastic feature of the language, but means that very finely grained memory performance is difficult. Because of this, the bulk of AllTheTags is an erlang NIF implemented in C++, which has been fairly heavily tested under valgrind and stress tested under multithreading conditions. The query engine uses various advanced techniques for optimizing complex queries, such as just-in-time compilation and query tree rearrangement/deduping to get the most speed possible.
The main Elixir module exposed is AllTheTags
:
# Create a new instance of an AllTheTags database
{:ok, db} = AllTheTags.new
# Create some entities in the database (which are represented as integers)
# An explicit entity ID can be passed in
{:ok, e1} = AllTheTags.new_entity(db, 0)
{:ok, e2} = AllTheTags.new_entity(db)
# e1 == 0
# e2 == (some other integer)
# If the same entity is created twice, :error is returned
:error = AllTheTags.new_entity(db, 0)
# Tags are created the same way, and are represented by integer IDs
# An optional explicit ID can be passed to set the ID of the created tag
{:ok, mytag} = AllTheTags.new_tag(db, 10)
# mytag == 10
# Same as with entities, if a tag is registered with the same ID twice, :error
# will be returned
:error = AllTheTags.new_tag(db, 10)
# Create some tags in the database
@a = 0; @b = 1; @c = 2; @d = 3
[@a, @b, @c, @d] |> Enum.each(fn(tag) ->
AllTheTags.new_tag(db, tag)
end)
# Create some implication rules between the tags
AllTheTags.imply_tag(@a, @b) # Anything tagged with 'a' will act as if it's also tagged with 'b'
# Make some more implications, to form a directed implication graph as follows (flows downwards):
# a
# / \
# b c
# \ /
# d
# |
# e
AllTheTags.imply_tag(db, @a, @c)
AllTheTags.imply_tag(db, @b, @d)
AllTheTags.imply_tag(db, @c, @d)
AllTheTags.imply_tag(db, @d, @e)
# Add the tag @b on e1, @c on e2
AllTheTags.add_tag(db, e1, @b)
AllTheTags.add_tag(db, e2, @c)
# Querying the database
AllTheTags.do_query(db, @a)
# => {:ok, []}
AllTheTags.do_query(db, @b)
# => {:ok, [e1]}
AllTheTags.do_query(db, @c)
# => {:ok, [e2]}
AllTheTags.do_query(db, @d)
# => {:ok, [e1, e2]}
AllTheTags.do_query(db, {:or, @b, @c})
# => {:ok, [e1, e2]}
# Remove implication edges with `unimply`
AllTheTags.unimply_tag(db, @a, @c)
# resulting implication graph:
# a
# /
# b c
# \ /
# d
# |
# e
Queries are executed by calling AllTheTags.do_query(database, <query>)
,
where query can be anything with this structure:
- a literal integer ID of a tag
- a tuple of
{:and, clause1, clause2}
or{:or, clause1, clause2}
, whereclause1
andclause2
are a any of these possible nodes - a tuple of
{:not, clause}
, which inverts the query such that matching enties don't matchclause
nil
, which will match all entities (the empty query)
For instance, to query for all entities which have tag @foo = 1
and also lack tag @bar = 2
:
{:and, @foo, {:not, @bar}}
All entities which have tag @foo or @bar or @baz:
{:or, @foo, {:or, @bar, @baz}}
Queries for all entities:
nil
Queries for none of the entities:
{:not, nil}
num_tags/1
the number of tags in the databasenum_entities/1
the number of entities in the databaseget_implies/2
get all the tags that a tag impliesget_implied_by/2
get all the tags that imply a given tag
These module methods should be further documented, but here they are in brief:
entity_tags/2
get the tags on an entity, both direct and impliedunimply_tag/3
remove implication edge between two tags, called likeunimply_tag(db, implier, implied)
remove_tag/3
remove tag from an entity, called likeremove_tag(db, entity, tag)
mark_dirty/1
marks the database as "dirty", it'll have to one recalculation before executing a query, but if marked dirty adding implication edges between tags is a cheap operationis_dirty/1
returntrue
orfalse
if the database is in a dirty state
For now, take a look at all_the_tags_test.exs
for usage of the above
Warning: implementation details follow
AllTheTags internally stores a metagraph (which is a directed acyclic graph) of all the tag implication rules as a means to quickly test if an entity has a given tag - or any tag implied by a tag - belong to it. The metagraph is built up incrementally with every implication edge added (and is generally a very quick operation), and assumes that all edges will not be added in a single go with no query calls inbetween (such as would be the case for a long running server only occassionally having its implication rules changed).
The algorithm for removing an edge between to tags that are mutually implicative
(i.e. belong to the same metanode), and recalculating the metagraph incrementally,
is an overly complex operation, and thus the entire metagraph is recalculated before
a query in this case. This is an O(|V|+|E|)
(linear) operation and only needs to
happen once before all subsequent queries.
mark_dirty/1
can be called to manually mark the database as being in a dirty state.
While the graph is dirty, the metagraph is not incrementally updated, so when
initially loading the edges into the database (such as on a full server restart),
mark_dirty/1
should be called to speed up edge insertion.
gotta add benchmarks for combinations of no query optimization, huffman code tree optimization, JIT optimization, and combinations thereof.