Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: skip points outside of archive retention on first pass #82

Merged
merged 19 commits into from
Sep 28, 2023

Conversation

jesusvazquez
Copy link
Member

@jesusvazquez jesusvazquez commented Sep 25, 2023

We have realized that during first pass and processing of the archive, we are counting in points outside of the retention and this can mess up some calculations. So we are fixing the code to only include points within the expected archive boundaries min timestamp and max timestamp.

Previous we kept points in a lower archive if that exact timestamp did not exist in a higher-resolution archive. This was incorrect behavior because the lower archive might be a SUM aggregation, and that results in spike points amidst the original high-res points. Instead, we only take points if there is no higher-resolution archive covering that entire time span. Similarly, we do not keep any points in a high res archive if those points are covered by a lower-res archive. In effect, we always know which points are valid based on the boundaries between the archives:

   MaxT     MinT
0: [XXXXXXXXXX] 1d, 1m
1: [           XXXXXXXXXXXXXXXXX] 1w, 10m
2: [                            XXXXXXXXXXXXX]  1y, 60m

Based on this new approach, we can greatly reduce the amount of logic in the ReadSamples function and make it much faster and more space-efficient. We also discovered new edge cases and added tests for those.

@jesusvazquez jesusvazquez marked this pull request as ready for review September 25, 2023 17:47
@jesusvazquez jesusvazquez requested a review from a team as a code owner September 25, 2023 17:47
@jesusvazquez jesusvazquez marked this pull request as draft September 25, 2023 17:47
// If we have already seen a point with the same timestamp it means
// we already have a point from an archive with higher precision that
// we want to keep. So we skip this point.
if _, ok := seenTs[p.Timestamp]; ok {
Copy link
Member Author

@jesusvazquez jesusvazquez Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pstibrany and @ywwg This solves problem 2 that we discussed yesterday where we want to keep points from prior archives because they have higher resolutions.


for _, p := range archivePoints {
if p.Timestamp < minArchiveTs {
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pstibrany and @ywwg This solves problem 1 where we just drop all points that are beyond the retention of the archive but still present in the ring buffer.

// may overlap and have older points
sort.Slice(keptPoints, func(i, j int) bool {
return keptPoints[i].Timestamp < keptPoints[j].Timestamp
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the two additions above, we now only need to sort once.

@jesusvazquez jesusvazquez marked this pull request as ready for review September 26, 2023 10:25
Copy link
Contributor

@fionaliao fionaliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not too familar with this code so you probably want another approval from Owen/Peter too.

@pstibrany
Copy link
Member

AS mentioned privately, I don't think it solves the problem that we're facing. With archives and samples like this:

Archive #0:

218	1663891200	2022-09-23T00:00:00Z
217	1663891320	2022-09-23T00:02:00Z
216	1663891440	2022-09-23T00:04:00Z
214	1663891560	2022-09-23T00:06:00Z
213	1663891680	2022-09-23T00:08:00Z

Archive #1:

651	1663891200	2022-09-23T00:00:00Z
427	1663891500	2022-09-23T00:05:00Z

I believe the end result will still be:

218	1663891200	2022-09-23T00:00:00Z
217	1663891320	2022-09-23T00:02:00Z
216	1663891440	2022-09-23T00:04:00Z
427	1663891500	2022-09-23T00:05:00Z *** we don't want this sample ***
214	1663891560	2022-09-23T00:06:00Z
213	1663891680	2022-09-23T00:08:00Z

Using archives:

 "Archives": [
    {
      "Offset": 40,
      "SecondsPerPoint": 60,
      "Points": 1051200 // 60s * 1051200 points = 730 days
    },
    {
      "Offset": 12614440,
      "SecondsPerPoint": 300,
      "Points": 735840 // 300s * 735840 points = 2555 days
    }
  ]

@jesusvazquez
Copy link
Member Author

jesusvazquez commented Sep 26, 2023

@pstibrany I've added 647f350 together with a unit test where I precisely take into account previous archives maxts to skip points in subsequent archives. Lets try again with these changes.

@jesusvazquez jesusvazquez force-pushed the jvp/remove-points-outside-of-retention-on-first-pass branch from 07d14e0 to 647f350 Compare September 26, 2023 11:31
I'm trying to achieve something like this

   MaxT     MinT
0: [XXXXXXXXXX] 1d, 1m
1: [           XXXXXXXXXXXXXXXXX] 1w, 10m
2: [                            XXXXXXXXXXXXX]  1y, 60m

So we process archive 0 first and then we track archive 0 min ts and
while processing archive 1 we discard all samples up to that mint and
grab the remaining ones.
Comment on lines 71 to 79
for _, p := range archivePoints {
// We want to track the max timestamp of the archive because we know
// it virtually represents now() and we wont have newer points.
// Then the min timestamp of the archive would be maxTs - the archive
// retention.
if p.Timestamp > maxArchiveTs {
maxArchiveTs = p.Timestamp
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this for each archive again and again?

Don't all archives have the same "now" and they only differ in their retention period?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. We had seen the "offset" value in the database dump but that is a byte offset, not a timestamp offset.

pkg/graphite/convert/whisperconverter/whisper.go Outdated Show resolved Hide resolved
// if we are already in the second or subsequent archive and we had
// some points in the prior archives, we want to skip
// samples in the previous archives
maxArchiveTs = lastMinTs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We update maxArchiveTs here, but this variable it's not used afterwards.

lastMinTs = minArchiveTs

for _, p := range archivePoints {
if p.Timestamp < minArchiveTs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check if minArchiveTs <= p.Timestamp && p.Timestamp <= maxArchiveTs holds here?

If we do that, we don't need to keep seenTs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually check for minArchiveTs should be exclusive, ie. minArchiveTs < p.Timestamp must hold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because points are written to all archives at once, all archives will contain points from timestamps that are within the retentions. So we do need a check like this. I think we can do it more efficiently if we pre-sort the points and then do a double-index comparison

pkg/graphite/convert/whisperconverter/whisper.go Outdated Show resolved Hide resolved
Comment on lines +259 to +260
name: "test retention when first archives are empty",
metricName: "mymetric",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be unrealistic scenario to me. If first archive (with raw data) is empty, how can there be any aggregations at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an end-to-end test broke, so we added this test to cover that edge case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ywwg ywwg requested a review from pstibrany September 27, 2023 16:46
@ywwg
Copy link
Member

ywwg commented Sep 27, 2023

I think the logic checking to see if the point is a duplicate timestamp is redundant -- by definition we are only looking at points for timestamps where that archive is the highest resolution block. We can be much more efficient -- start with the oldest block and just blast through the points that are within the correct bounds

@ywwg
Copy link
Member

ywwg commented Sep 27, 2023

a much faster algo is:

  • determine maxTs by finding the highest ts in the first archive
  • calculate ts bounds for each archive
  • loop through, starting at lowest resolution block.
    • sort its points
    • only add points to Kept that are within the bounds

done

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good.

I think we have tiny mistake in bounds computation (archive with 10 points, with max timestamp 100 and resolution 1 should NOT include point 90).

pkg/graphite/convert/whisperconverter/whisper.go Outdated Show resolved Hide resolved
continue
// Don't include any points in this archive that were covered in a higher
// resolution archive.
if p.Timestamp >= lastMinTs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly:

Suggested change
if p.Timestamp >= lastMinTs {
if p.Timestamp > lastMinTs {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I was wondering if I'd got that switched. thanks

Comment on lines +259 to +260
name: "test retention when first archives are empty",
metricName: "mymetric",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ywwg
Copy link
Member

ywwg commented Sep 28, 2023

thanks for the bounds-check fix!

@ywwg
Copy link
Member

ywwg commented Sep 28, 2023

explaining the new tests I had to write, this is a valid archive:

Meta data:
  aggregation method: sum
  max retention: 2012232704

Archive 0 info:
  offset: 52
  seconds per point: 1
  points: 86400
  retention: 86400
  size: 1036800

Archive 1 info:
  offset: 1036852
  seconds per point: 3600
  points: 840
  retention: 3024000
  size: 10080

Archive 2 info:
  offset: 1046932
  seconds per point: 86400
  points: 73000
  retention: 6307200000
  size: 876000

Archive 0 data:
Archive 1 data:
Archive 2 data:
0: 1517443200,          1
1: 1517529600,          1
2: 1517616000,          1

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code makes sense to me. Thanks.

@ywwg ywwg merged commit b4b782c into main Sep 28, 2023
@ywwg ywwg deleted the jvp/remove-points-outside-of-retention-on-first-pass branch September 28, 2023 19:29
This was referenced Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants