Hypex is a fast HyperLogLog implementation in Elixir which provides an easy way to count unique values with a small memory footprint. This library is based on the paper documenting the algorithm written by Philippe Flajolet et al.
Hypex is available on Hex. You can install the package via:
- Add hypex to your list of dependencies in
mix.exs
:
def deps do
[{ :hypex, "~> 1.1" }]
end
- Ensure hypex is started before your application:
def application do
[applications: [:hypex]]
end
Hypex is extremely straightforward to use, you simply create a new Hypex instance and start adding values to it:
iex> hypex = Hypex.new(4)
{Hypex.Array, 4, {:array, 16, 0, 0, 100}}
iex> hypex = Hypex.update(hypex, "my term")
{Hypex.Array, 4,
{:array, 16, 0, 0,
{10, {0, 2, 0, 0, 0, 0, 0, 0, 0, 0}, 10, 10, 10, 10, 10, 10, 10, 10, 10}}}
iex> hypex |> Hypex.cardinality |> round
1
The 4
being passed to Hypex.new/1
is the width which determines the underlying memory structure of a Hypex instance. This value can be within the range 4 <= width <= 16
, per the HyperLogLog algorithm. If you don't provide a width, it defaults to 16
. Be aware that you should typically scale this number higher based upon the more unique values you expect to see.
For any other examples of how to use Hypex, please read the documentation.
As of v1.1.0
, the default implementation has moved from a Bitstring to an Erlang Array. This is mainly due to Arrays performing faster on all operations when compared with Bitstrings. However in the case that you're operating in a low-memory environment (or simply want predictable memory usage), you might still wish to use the Bitstring implementation. You can do this by simply using Hypex.new(4, Bitstring)
when creating a Hypex.
A rough memory estimate (in bytes) for a Bitstring Hypex can be calculated using the formula ((2 ^ width) * width) / 8
- although this will only include the memory of the registers and not the rest of the tuple structure (which should be minimal). This means that using the highest width available of 16
, your memory usage will still only be 131,072
bytes.
At this point I don't know of a good way to measure the size of the Array implementation, but a rough estimate would suggest that it's probably within the range of 6-8 times more memory (if anyone can help measure, I'd appreciate it). Still, this amount of memory shouldn't pose an issue for most systems, and the throughput likely matters more to most users.
Below are some rough benchmarks for Hypex instances with the different underlying structures. Note that the update/2
tests are inserting a unique value - in the case a duplicate value is inserted, the operation is typically constant across widths at under 0.5 µs/op
.
These tests use a width of 4, so it should be noted that larger widths will have slower performance. However, these benchmarks are for reference only and you should gauge which widths work best for the data you're operating with, rather than the performance shown below.
## Array Hypex
Array Hypex.new/1 0.53 µs/op
Array Hypex.update/2 2.13 µs/op
Array Hypex.cardinality/1 6.87 µs/op
Array Hypex.merge/2 16.61 µs/op
## Bitstring Hypex
Bitstring Hypex.new/1 0.46 µs/op
Bitstring Hypex.update/2 2.13 µs/op
Bitstring Hypex.cardinality/1 6.70 µs/op
Bitstring Hypex.merge/2 8.69 µs/op
If you feel something can be improved, or have any questions about certain behaviours or pieces of implementation, please feel free to file an issue. Proposed changes should be taken to issues before any PRs to avoid wasting time on code which might not be merged upstream.
If you do make changes to the codebase, please make sure you test your changes thoroughly, and include any unit tests alongside new or changed behaviours. Hypex currently uses the excellent excoveralls to track code coverage.
$ mix test
$ mix coveralls
$ mix coveralls.html && open cover/excoveralls.html