Count-Min Sketch Support #2496

jonathanc-n · 2024-08-19T08:27:46Z

jonathanc-n
Aug 19, 2024

References: #2425

Overview

Count-Min Sketch is a probabilistic data structure used to estimate the frequency of elements in a data stream. It is particularly useful in scenarios where memory efficiency is important, such as network traffic analysis, real-time analytics, etc.

The core idea behind Count-Min Sketch is to use multiple hash functions to map each element in a data stream to different positions in a set of arrays. Each bucket in these arrays holds a count that represents the estimated frequency of the elements mapped to it.

Hashing: When an element appears in the data stream, it is passed through multiple hash functions. Each hash function produces an index corresponding to a position in a different array. The count at each of these positions is then incremented.

Estimating Frequency: To estimate the frequency of an element, the element is passed through the same hash functions, and the counts at the corresponding positions in the arrays are retrieved. The minimum value among these counts is taken as the estimate for the frequency of that element. This approach helps mitigate overestimation caused by hash collisions.

Structure of Count-Min Sketch

Array of Counters: The Count-Min Sketch maintains a 2-D array, where each row corresponds to a different hash function, and each column represents a possible index generated by these hash functions. The size of the array is determined by the desired accuracy and confidence level. The more rows and columns, the more accurate the results are at the cost of increased memory usage.

Updating the Sketch: When a new element is encountered, it is hashed multiple times, each hash directing the element to a different row of counters. The counts at the corresponding columns are incremented.

Querying: To estimate the frequency of an element, the same hash functions are applied, and the counts from the respective positions are retrieved.

Supported Commands on Redis

CMS.INCRBY - Increases the count of one or more items by increment.
CMS.INFO - Returns information about a sketch.
CMS.INITBYDIM - Initializes a Count-Min Sketch to dimensions specified by the user.
CMS.INITBYPROB - Initializes a Count-Min Sketch to accommodate requested tolerances.
CMS.MERGE - Merges several sketches into one sketch.
CMS.QUERY - Returns the count for one or more items in a sketch.

Implementation

I was looking into Redis' count min sketch implementation and noticed these things:

Hash Function Choice: Redis uses a modified version of the MurmurHash2 function to generate hash values for the items.

Storage Format
CMS Array: The CMS data structure is stored as a 2D array (flattened into a 1D array) of uint32_t integers. Each element in this array represents the count for a particular combination of hash function and item. The size of the array is determined by the width (number of columns) and depth (number of rows) of the CMS.

Storing the entire array with a single serialized key, with possibility of subkeys if parts of the array are being

String Representation: In Redis, data structures like strings and bitmaps are generally treated as "string" types. However, for CMS, the structure is specifically tailored to handle frequency counts. The memory is made to be manipulated easily. Counts will be stored as uin32_t in binary format making it super compact.

Metadata: Metadata like width, depth, and counter can be stored at the beginning of the serialized value if using a single key-value pair.

Tuning
Redis' CMS implementation follows a fixed configuration for the depth and width based on error and probability parameters. However, tuning these parameters can significantly impact the performance and accuracy of the CMS:

Depth and Width Calculation: The width and depth of the CMS can be derived from the desired error rate (error) and the probability of failure (delta). The function CMS_DimFromProb calculates the width as ceil(2 / error) and the depth as ceil(log10f(delta) / log10f(0.5)). This fixed-size approach simplifies the usage and ensures consistent performance.

Merging CMS Structures: The function CMS_Merge allows combining multiple CMS instances into one. This is particularly useful in distributed systems where frequency data from different nodes need to be aggregated. The function takes care of summing the counts from multiple sources while checking for potential overflows to prevent data corruption.

I will be looking into how to properly integrate this with the storage layer, and support the commands on top of it, I will share my solution here once I have properly mapped it out. If anyone has any insights please feel free to share

Note: will be adding complexities

Resources
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
https://github.com/RedisBloom/RedisBloom/blob/master/docs/docs/count-min-sketch.md
https://redis.io/docs/latest/develop/data-types/probabilistic/count-min-sketch/
https://redis.io/docs/latest/commands/?group=cms

ty @mapleFU for the format outline :)

mapleFU · 2024-08-19T08:33:47Z

mapleFU
Aug 19, 2024
Collaborator

CMS Array: The CMS data structure is stored as a 2D array (flattened into a 1D array) of uint32_t integers. Each element in this array represents the count for a particular combination of hash function and item. The size of the array is determined by the width (number of columns) and depth (number of rows) of the CMS.
String Representation: In Redis, data structures like strings and bitmaps are generally treated as "string" types. However, for CMS, the structure is specifically tailored to handle frequency counts. The memory is made to be manipulated easily.

Sorry but I don't know how would you like to storing the CMS Array? With "List" or self encoding int32? Is it in one key-value? Or having subfields? What would the metadata like?

9 replies

jonathanc-n Aug 20, 2024
Author

@mapleFU Can I clarify that streams in KVrocks aren't from live data streams (kafka, etc.), but are from command driven processes? Is that what is done for hyperloglog? (same as redis)

mapleFU Aug 20, 2024
Collaborator

Can I clarify that streams in KVrocks aren't from live data streams (kafka, etc.), but are from command driven processes?

I don't understand what you mean here. Flink has a redis connector[1] and Kafka also having one[2]. kvrocks will not "pull" from these datasource. However, in the contract, these datasource can "push" to kvrocks with commands. We don't have something like "Kafka Consumer" [3]

[1] https://nightlies.apache.org/flink/flink-docs-release-1.1/apis/streaming/connectors/redis.html
[2] https://docs.confluent.io/kafka-connectors/redis/current/overview.html
[3] https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

jonathanc-n Aug 20, 2024
Author

Oh yeah that was what I was thinking. Thank you for clearing that up

jonathanc-n Sep 5, 2024
Author

@mapleFU For count min sketch, do you think that i would even need cms files and could just stick to having redis_cms files?

mapleFU Sep 5, 2024
Collaborator

For me keep the style is simple and a good practice

jonathanc-n · 2024-08-28T23:03:47Z

jonathanc-n
Aug 28, 2024
Author

Sorry, I haven't been able to work on this for a week. For the functionality, does it just need to be a data structure that can persist in the rocksdb storage layer and have elements added to it using the cmsketch's incrby command?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count-Min Sketch Support #2496

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Count-Min Sketch Support #2496

jonathanc-n Aug 19, 2024

Overview

Structure of Count-Min Sketch

Supported Commands on Redis

Implementation

Replies: 2 comments · 9 replies

mapleFU Aug 19, 2024 Collaborator

jonathanc-n Aug 20, 2024 Author

mapleFU Aug 20, 2024 Collaborator

jonathanc-n Aug 20, 2024 Author

jonathanc-n Sep 5, 2024 Author

mapleFU Sep 5, 2024 Collaborator

jonathanc-n Aug 28, 2024 Author

jonathanc-n
Aug 19, 2024

Replies: 2 comments 9 replies

mapleFU
Aug 19, 2024
Collaborator

jonathanc-n Aug 20, 2024
Author

mapleFU Aug 20, 2024
Collaborator

jonathanc-n Aug 20, 2024
Author

jonathanc-n Sep 5, 2024
Author

mapleFU Sep 5, 2024
Collaborator

jonathanc-n
Aug 28, 2024
Author