Skip to content

Commit

Permalink
statistics: refine comments to FMSketch (#53601)
Browse files Browse the repository at this point in the history
  • Loading branch information
Rustin170506 authored May 28, 2024
1 parent d2d1257 commit 772db20
Showing 1 changed file with 31 additions and 16 deletions.
47 changes: 31 additions & 16 deletions pkg/statistics/fmsketch.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,18 +46,31 @@ var fmSketchPool = sync.Pool{
// TODO: add this attribute to PB and persist it instead of using a fixed number(executor.maxSketchSize)
const MaxSketchSize = 10000

// FMSketch (Flajolet–Martin Sketch) is a probabilistic data structure used for estimating the number of distinct elements in a stream.
// It uses a hash function to map each element to a binary number and counts the number of trailing zeroes in each hashed value.
// The maximum number of trailing zeroes observed gives an estimate of the logarithm of the number of distinct elements.
// This approach allows the FM sketch to handle large streams of data in a memory-efficient way.
//
// See https://en.wikipedia.org/wiki/Flajolet%E2%80%93Martin_algorithm
// FMSketch (Flajolet-Martin Sketch) is a probabilistic data structure that estimates the count of unique elements in a stream.
// It employs a hash function to convert each element into a binary number and then counts the trailing zeroes in each hashed value.
// **This variant of the FM sketch uses a set to store unique hashed values and a binary mask to track the maximum number of trailing zeroes.**
// The estimated count of distinct values is calculated as 2^r * count, where 'r' is the maximum number of trailing zeroes observed and 'count' is the number of unique hashed values.
// The fundamental idea is that our hash function maps the input domain onto a logarithmic scale.
// This is achieved by hashing the input value and counting the number of trailing zeroes in the binary representation of the hash value.
// Each distinct value is mapped to 'i' with a probability of 2^-(i+1).
// For example, a value is mapped to 0 with a probability of 1/2, to 1 with a probability of 1/4, to 2 with a probability of 1/8, and so on.
// This is achieved by hashing the input value and counting the trailing zeroes in the hash value.
// If we have a set of 'n' distinct values, the count of distinct values with 'r' trailing zeroes is n / 2^r.
// Therefore, the estimated count of distinct values is 2^r * count = n.
// The level-by-level approach increases the accuracy of the estimation by ensuring a minimum count of distinct values at each level.
// This way, the final estimation is less likely to be skewed by outliers.
// For more details, refer to the following papers:
// 1. https://www.vldb.org/conf/2001/P541.pdf
// 2. https://algo.inria.fr/flajolet/Publications/FlMa85.pdf
type FMSketch struct {
// A set to store unique hashed values.
hashset *swiss.Map[uint64, bool]
// A binary mask used to track the maximum number of trailing zeroes in the hashed values.
// Also used to track the level of the sketch.
// Every time the size of the hashset exceeds the maximum size, the mask will be moved to the next level.
mask uint64
// The maximum size of the hashset. If the size exceeds this value, the mask size will be doubled and some hashed values will be removed from the hashset.
// The maximum size of the hashset. If the size exceeds this value, the mask will be moved to the next level.
// And the hashset will only keep the hashed values with trailing zeroes greater than or equal to the new mask.
maxSize int
}

Expand Down Expand Up @@ -88,27 +101,29 @@ func (s *FMSketch) NDV() int64 {
if s == nil {
return 0
}
// The size of the mask (incremented by one) is 2^r, where r is the maximum number of trailing zeroes observed in the hashed values.
// The count of unique hashed values is the number of unique elements in the hashset.
// This estimation method is based on the Flajolet-Martin algorithm for estimating the number of distinct elements in a stream.
// The estimated count of distinct values is 2^r * count, where 'r' is the maximum number of trailing zeroes observed and 'count' is the number of unique hashed values.
// The fundamental idea is that the hash function maps the input domain onto a logarithmic scale.
// This is achieved by hashing the input value and counting the number of trailing zeroes in the binary representation of the hash value.
// So the count of distinct values with 'r' trailing zeroes is n / 2^r, where 'n' is the number of distinct values.
// Therefore, the estimated count of distinct values is 2^r * count = n.
return int64(s.mask+1) * int64(s.hashset.Count())
}

// insertHashValue inserts a hashed value into the sketch.
func (s *FMSketch) insertHashValue(hashVal uint64) {
// If the hashed value is already in the sketch (determined by bitwise AND with the mask), return without inserting.
// This is because the number of trailing zeroes in the hashed value is less than or equal to the mask value.
// If the hashed value is already covered by the mask, we can skip it.
// This is because the number of trailing zeroes in the hashed value is less than the mask.
if (hashVal & s.mask) != 0 {
return
}
// Put the hashed value into the hashset.
s.hashset.Put(hashVal, true)
// If the count of unique hashed values exceeds the maximum size,
// double the mask size and remove any hashed values from the hashset that are now within the mask.
// This is to ensure that the mask value is always a power of two minus one (i.e., a binary number of the form 111...),
// which allows us to quickly check the number of trailing zeroes in a hashed value by performing a bitwise AND operation with the mask.
// We track the unique hashed values level by level to ensure a minimum count of distinct values at each level.
// This way, the final estimation is less likely to be skewed by outliers.
if s.hashset.Count() > s.maxSize {
// If the size of the hashset exceeds the maximum size, move the mask to the next level.
s.mask = s.mask*2 + 1
// Clean up the hashset by removing the hashed values with trailing zeroes less than the new mask.
s.hashset.Iter(func(k uint64, _ bool) (stop bool) {
if (k & s.mask) != 0 {
s.hashset.Delete(k)
Expand Down

0 comments on commit 772db20

Please sign in to comment.