-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statistics: add bucket structure. #2993
Conversation
statistics/builder.go
Outdated
col.Numbers[bucketIdx] = i * sampleFactor | ||
col.Repeats[bucketIdx] += sampleFactor | ||
col.Buckets[bucketIdx].Count = (i + 1) * sampleFactor | ||
col.Buckets[bucketIdx].Repeats += sampleFactor | ||
} else if i*sampleFactor-lastNumber <= valuesPerBucket { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be (i+1)*sampleFactor-lastNumber)
.
statistics/builder.go
Outdated
col.Repeats[bucketIdx] = 0 | ||
col.Buckets[bucketIdx].Count = (i + 1) * sampleFactor | ||
col.Buckets[bucketIdx].Value = samples[i] | ||
col.Buckets[bucketIdx].Repeats = sampleFactor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use NDV is more accurate.
statistics/column.go
Outdated
c.Repeats[curBuck] = c.Repeats[i+1] | ||
c.Buckets[curBuck].Count = c.Buckets[i+1].Count | ||
c.Buckets[curBuck].Value = c.Buckets[i+1].Value | ||
c.Buckets[curBuck].Repeats = c.Buckets[i+1].Repeats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly copy the whole bucket?
statistics/column.go
Outdated
return greaterThanBucketValueCount, nil | ||
} | ||
return (nextNumber + greaterThanBucketValueCount) / 2, nil | ||
return c.totalRowCount() - lessCount, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greater count should be total count minus less or equal count.
statistics/column.go
Outdated
c.Numbers[curBuck] = c.Numbers[i+1] | ||
c.Values[curBuck] = c.Values[i+1] | ||
c.Repeats[curBuck] = c.Repeats[i+1] | ||
c.Buckets[curBuck].Count = c.Buckets[i+1].Count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.Bucket[curBuck] = Bucket{
...
}
statistics/builder.go
Outdated
} else { | ||
// The bucket is full, store the item in the next bucket. | ||
lastNumber = col.Numbers[bucketIdx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's use less, count is calculate by (i+1) * sampleFactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you delete this, then when you test if this bucket is full, it will always false, which will lead to too many buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you are right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a test to make sure that the number of buckets do not exceed the predefined bucket count.
statistics/builder.go
Outdated
} | ||
valuesPerBucket := t.Count/bucketCount + 1 | ||
|
||
// As we use samples to build the histogram, the bucket number and repeat should multiply a factor. | ||
sampleFactor := t.Count / int64(len(samples)) | ||
ndvFactor := t.Count / ndv | ||
log.Warnf("sample %d ndv %d ndvFact %d", sampleFactor, ndv, ndvFactor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this log necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
LGTM |
statistics/column.go
Outdated
|
||
// bucket is an element of histogram. | ||
// | ||
// A bucket number is the number of items stored in all previous buckets and the current bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comment.
statistics/builder.go
Outdated
col.Numbers[bucketIdx] = i * sampleFactor | ||
col.Repeats[bucketIdx] += sampleFactor | ||
} else if i*sampleFactor-lastNumber <= valuesPerBucket { | ||
col.Buckets[bucketIdx].Count = (i + 1) * sampleFactor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define a variable for i+1
. Many places use it.
LGTM |
add Buctket structure and adjust the way of estimating.
@shenli @coocood @zimulala @lamxTyler PTAL