Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HLL serializable #21

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions src/main/java/net/agkn/hll/HLL.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
* limitations under the License.
*/

import java.io.Serializable;
import java.util.Arrays;

import it.unimi.dsi.fastutil.ints.Int2ByteOpenHashMap;
Expand All @@ -34,12 +35,12 @@

/**
* A probabilistic set of hashed <code>long</code> elements. Useful for computing
* the approximate cardinality of a stream of data in very small storage.<p/>
* the approximate cardinality of a stream of data in very small storage.
*
* A modified version of the <a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf">
* 'HyperLogLog' data structure and algorithm</a> is used, which combines both
* probabilistic and non-probabilistic techniques to improve the accuracy and
* storage requirements of the original algorithm.<p/>
* storage requirements of the original algorithm.
*
* More specifically, initializing and storing a new {@link HLL} will
* allocate a sentinel value symbolizing the empty set ({@link HLLType#EMPTY}).
Expand All @@ -48,7 +49,7 @@
* be sacrificed for memory footprint: the values in the sorted list are
* "promoted" to a "{@link HLLType#SPARSE}" map-based HyperLogLog structure.
* Finally, when enough registers are set, the map-based HLL will be converted
* to a bit-packed "{@link HLLType#FULL}" HyperLogLog structure.<p/>
* to a bit-packed "{@link HLLType#FULL}" HyperLogLog structure.
*
* This data structure is interoperable with the implementations found at:
* <ul>
Expand All @@ -59,7 +60,7 @@
*
* @author timon
*/
public class HLL implements Cloneable {
public class HLL implements Cloneable, Serializable {
// minimum and maximum values for the log-base-2 of the number of registers
// in the HLL
public static final int MINIMUM_LOG2M_PARAM = 4;
Expand Down Expand Up @@ -156,7 +157,7 @@ public class HLL implements Cloneable {
* @param expthresh tunes when the {@link HLLType#EXPLICIT} to
* {@link HLLType#SPARSE} promotion occurs,
* based on the set's cardinality. Must be at least -1 and at most 18.
* <table>
* <table summary="">
* <thead><tr><th><code>expthresh</code> value</th><th>Meaning</th></tr></thead>
* <tbody>
* <tr>
Expand Down Expand Up @@ -238,7 +239,7 @@ public HLL(final int log2m, final int regwidth, final int expthresh, final boole
}

/**
* Construct an empty HLL with the given {@code log2m} and {@code regwidth}.<p/>
* Construct an empty HLL with the given {@code log2m} and {@code regwidth}.
*
* This is equivalent to calling <code>HLL(log2m, regwidth, -1, true, HLLType.EMPTY)</code>.
*
Expand Down Expand Up @@ -596,7 +597,7 @@ public long cardinality() {
// Clear
/**
* Clears the HLL. The HLL will have cardinality zero and will act as if no
* elements have been added.<p/>
* elements have been added.
*
* NOTE: Unlike {@link #addRaw(long)}, <code>clear</code> does NOT handle
* transitions between {@link HLLType}s - a probabilistic type will remain
Expand Down Expand Up @@ -938,7 +939,7 @@ public byte[] toBytes(final ISchemaVersion schemaVersion) {

/**
* Deserializes the HLL (in {@link #toBytes(ISchemaVersion)} format) serialized
* into <code>bytes</code>.<p/>
* into <code>bytes</code>.
*
* @param bytes the serialized bytes of new HLL
* @return the deserialized HLL. This will never be <code>null</code>.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@
* a low bit in a byte. However, a high byte in a word is written at a lower index
* in the array than a low byte in a word. The first word is written at the lowest
* array index. Each serializer is one time use and returns its backing byte
* array.<p/>
* array.
*
* This encoding was chosen so that when reading bytes as octets in the typical
* first-octet-is-the-high-nibble fashion, an octet-to-binary conversion
* would yield a high-to-low, left-to-right view of the "short words".<p/>
* would yield a high-to-low, left-to-right view of the "short words".
*
* Example:<p/>
* Example:
*
* Say short words are 5 bits wide. Our word sequence is the values
* <code>[31, 1, 5]</code>. In big-endian binary format, the values are
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ public interface IWordDeserializer {
long readWord();

/**
* Returns the number of words that could be encoded in the sequence.<p/>
* Returns the number of words that could be encoded in the sequence.
*
* NOTE: the sequence that was encoded may be shorter than the value this
* method returns due to padding issues within bytes. This guarantees
Expand All @@ -39,4 +39,4 @@ public interface IWordDeserializer {
* @return the maximum number of words that could be read from the sequence.
*/
int totalWordCount();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ public class SerializationUtil {
* List of registered schema versions, indexed by their version numbers. If
* an entry is <code>null</code>, then no such schema version is registered.
* Similarly, registering a new schema version simply entails assigning an
* {@link ISchemaVersion} instance to the appropriate index of this array.<p/>
* {@link ISchemaVersion} instance to the appropriate index of this array.
*
* By default, only {@link SchemaVersionOne} is registered. Note that version
* zero will always be reserved for internal (e.g. proprietary, legacy) schema
Expand Down Expand Up @@ -172,7 +172,7 @@ public static byte packVersionByte(final int schemaVersion, final int typeOrdina
* If 'auto' is chosen, this value should be <code>63</code>.
* </li>
* <li>
* If a cutoff of 2<sup>n</sup> is desired, for <code>0 <= n < 31</code>,
* If a cutoff of 2<sup>n</sup> is desired, for <code>0 &lt;= n &lt; 31</code>,
* this value should be <code>n + 1</code>.
* </li>
* </ul>
Expand All @@ -190,7 +190,7 @@ public static byte packCutoffByte(final int explicitCutoff, final boolean sparse
/**
* Generates a byte that encodes the parameters of a
* {@link HLLType#FULL} or {@link HLLType#SPARSE}
* HLL.<p/>
* HLL.
*
* The top 3 bits are used to encode <code>registerWidth - 1</code>
* (range of <code>registerWidth</code> is thus 1-9) and the bottom 5
Expand Down
8 changes: 5 additions & 3 deletions src/main/java/net/agkn/hll/util/BitVector.java
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,16 @@

import net.agkn.hll.serialization.IWordSerializer;

import java.io.Serializable;

/**
* A vector (array) of bits that is accessed in units ("registers") of <code>width</code>
* bits which are stored as 64bit "words" (<code>long</code>s). In this context
* a register is at most 64bits.
*
* @author rgrzywinski
*/
public class BitVector implements Cloneable {
public class BitVector implements Cloneable, Serializable {
// NOTE: in this context, a word is 64bits

// rather than doing division to determine how a bit index fits into 64bit
Expand Down Expand Up @@ -172,7 +174,7 @@ public LongIterator registerIterator() {
/**
* Sets the value of the specified index register if and only if the specified
* value is greater than the current value in the register. This is equivalent
* to but much more performant than:<p/>
* to but much more performant than:
*
* <pre>vector.setRegister(index, Math.max(vector.getRegister(index), value));</pre>
*
Expand Down Expand Up @@ -259,4 +261,4 @@ public BitVector clone() {
System.arraycopy(words, 0, copy.words, 0, words.length);
return copy;
}
}
}
8 changes: 4 additions & 4 deletions src/main/java/net/agkn/hll/util/HLLUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ public static long pwMaxMask(final int registerSizeInBits) {
* The cutoff for using the "small range correction" formula, in the
* HyperLogLog algorithm.
*
* @param m the number of registers in the HLL. <em>m<em> in the paper.
* @param m the number of registers in the HLL. <em>m</em> in the paper.
* @return the cutoff for the small range correction.
* @see #smallEstimator(int, int)
*/
Expand All @@ -161,7 +161,7 @@ public static double smallEstimatorCutoff(final int m) {
* appropriate if both the estimator is smaller than <pre>(5/2) * m</pre> and
* there are still registers that have the zero value.
*
* @param m the number of registers in the HLL. <em>m<em> in the paper.
* @param m the number of registers in the HLL. <em>m</em> in the paper.
* @param numberOfZeroes the number of registers with value zero. <em>V</em>
* in the paper.
* @return a corrected cardinality estimate.
Expand All @@ -174,7 +174,7 @@ public static double smallEstimator(final int m, final int numberOfZeroes) {
* The cutoff for using the "large range correction" formula, from the
* HyperLogLog algorithm, adapted for 64 bit hashes.
*
* @param log2m log-base-2 of the number of registers in the HLL. <em>b<em> in the paper.
* @param log2m log-base-2 of the number of registers in the HLL. <em>b</em> in the paper.
* @param registerSizeInBits the size of the HLL registers, in bits.
* @return the cutoff for the large range correction.
* @see #largeEstimator(int, int, double)
Expand All @@ -189,7 +189,7 @@ public static double largeEstimatorCutoff(final int log2m, final int registerSiz
* for 64 bit hashes. Only appropriate for estimators whose value exceeds
* the return of {@link #largeEstimatorCutoff(int, int)}.
*
* @param log2m log-base-2 of the number of registers in the HLL. <em>b<em> in the paper.
* @param log2m log-base-2 of the number of registers in the HLL. <em>b</em> in the paper.
* @param registerSizeInBits the size of the HLL registers, in bits.
* @param estimator the original estimator ("E" in the paper).
* @return a corrected cardinality estimate.
Expand Down