Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics: add bucket structure. #2993

Merged
merged 9 commits into from
Apr 6, 2017
Merged

statistics: add bucket structure. #2993

merged 9 commits into from
Apr 6, 2017

Conversation

hanfei1991
Copy link
Member

add Buctket structure and adjust the way of estimating.

@shenli @coocood @zimulala @lamxTyler PTAL

col.Numbers[bucketIdx] = i * sampleFactor
col.Repeats[bucketIdx] += sampleFactor
col.Buckets[bucketIdx].Count = (i + 1) * sampleFactor
col.Buckets[bucketIdx].Repeats += sampleFactor
} else if i*sampleFactor-lastNumber <= valuesPerBucket {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be (i+1)*sampleFactor-lastNumber).

col.Repeats[bucketIdx] = 0
col.Buckets[bucketIdx].Count = (i + 1) * sampleFactor
col.Buckets[bucketIdx].Value = samples[i]
col.Buckets[bucketIdx].Repeats = sampleFactor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use NDV is more accurate.

c.Repeats[curBuck] = c.Repeats[i+1]
c.Buckets[curBuck].Count = c.Buckets[i+1].Count
c.Buckets[curBuck].Value = c.Buckets[i+1].Value
c.Buckets[curBuck].Repeats = c.Buckets[i+1].Repeats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly copy the whole bucket?

return greaterThanBucketValueCount, nil
}
return (nextNumber + greaterThanBucketValueCount) / 2, nil
return c.totalRowCount() - lessCount, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greater count should be total count minus less or equal count.

c.Numbers[curBuck] = c.Numbers[i+1]
c.Values[curBuck] = c.Values[i+1]
c.Repeats[curBuck] = c.Repeats[i+1]
c.Buckets[curBuck].Count = c.Buckets[i+1].Count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.Bucket[curBuck] = Bucket{
     ...
}

} else {
// The bucket is full, store the item in the next bucket.
lastNumber = col.Numbers[bucketIdx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's use less, count is calculate by (i+1) * sampleFactor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you delete this, then when you test if this bucket is full, it will always false, which will lead to too many buckets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you are right

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a test to make sure that the number of buckets do not exceed the predefined bucket count.

}
valuesPerBucket := t.Count/bucketCount + 1

// As we use samples to build the histogram, the bucket number and repeat should multiply a factor.
sampleFactor := t.Count / int64(len(samples))
ndvFactor := t.Count / ndv
log.Warnf("sample %d ndv %d ndvFact %d", sampleFactor, ndv, ndvFactor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this log necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@alivxxx
Copy link
Contributor

alivxxx commented Apr 5, 2017

LGTM

@hanfei1991 hanfei1991 added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 5, 2017

// bucket is an element of histogram.
//
// A bucket number is the number of items stored in all previous buckets and the current bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment.

col.Numbers[bucketIdx] = i * sampleFactor
col.Repeats[bucketIdx] += sampleFactor
} else if i*sampleFactor-lastNumber <= valuesPerBucket {
col.Buckets[bucketIdx].Count = (i + 1) * sampleFactor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a variable for i+1. Many places use it.

@zimulala
Copy link
Contributor

zimulala commented Apr 5, 2017

LGTM

@zimulala zimulala merged commit b6ff4ad into master Apr 6, 2017
@zimulala zimulala deleted the hanfei/stats branch April 6, 2017 02:15
@zimulala zimulala added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants