Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact: fix objstore bucket operations check to improve flakiness #6064

Merged
merged 5 commits into from
Mar 30, 2023

Conversation

douglascamata
Copy link
Contributor

@douglascamata douglascamata commented Jan 20, 2023

Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent if the goal is to be precise. As removing this assertion might not be a good idea, doing a lower boundary check suffices.

This is aimed to fix CI errors like the one below, copied from https://github.com/thanos-io/thanos/actions/runs/3957791998/jobs/6778588621:

=== CONT  TestCompactWithStoreGateway
    compact_test.go:726: compact_test.go:726:
        
         unexpected error: unable to find metrics [thanos_objstore_bucket_operations_total] with expected values after 50 retries. Last error: <nil>. Last values: [631]

If you add a time.Sleep before this assertion and increase the time it will increase thanos_objstore_bucket_operations_total more and more. The Compactor isn't halted and for some reasons the bucket operations keep going up. I don't know if they stop at some point.

Signed-off-by: Douglas Camata [email protected]

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Assert on thanos_objstore_bucket_operations_total using GreateOrEqual, because if some time passes the actual value will change to be bigger than what is expected.

Verification

Bunch of local tests run with some time.Sleep throughout the code, which failed before, now passed.

Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent.

Signed-off-by: Douglas Camata <[email protected]>
@@ -724,7 +724,7 @@ func testCompactWithStoreGateway(t *testing.T, penaltyDedup bool) {
operationMatcher, err := matchers.NewMatcher(matchers.MatchEqual, "operation", "get")
testutil.Ok(t, err)
testutil.Ok(t, c.WaitSumMetricsWithOptions(
e2emon.Equals(573),
e2emon.GreaterOrEqual(573),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of these assertions is to make sure the compactor is efficient and does not make more loops/API requests than necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the idea. But this is flaky because the number of operations depend heavily on time (could be a bug, I don't know), which becomes a hassle when processes are fighting for CPU time in CI or even if you run all e2e tests locally (because they will run in parallel).

If I put a sleep on line 725, depending on how much time the sleep takes the assertion will fail because thanos_objstore_bucket_operations_total will have a completely different value. This test has to be fixed and made independent of time, otherwise it will stay flaky. But I do not have a clear picture of how to achieve it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. In that case I am not sure what's the best option here, maybe we can use LessThan(1000) or some larger value to make sure the compactor does not go into infinite loops. If we have to use GreaterThan, we might as well remove the assertion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpetkovski I implemented a custom matcher to use in the assertion, to be able to assert the value is between 0 and 1000, instead of just checking one of the two bounds. This is not present yet in efficientgo/e2emon, so it lives in the e2ethanos package for now.

I've ran the test a few times locally with this configuration and it was so good that I am removing the skip on the penalty compactor test.

PTAL 🙇

fpetkovski
fpetkovski previously approved these changes Mar 29, 2023
Copy link
Contributor

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

matej-g
matej-g previously approved these changes Mar 29, 2023
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, if this finally fixes #4866 I'm buying you a beer 🍻. From my understanding, we are not able to assert on exact number due to what you described in the description, so having this as the second best seems reasonable.

Apart from failing lint, this looks good!

@@ -0,0 +1,16 @@
package e2ethanos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add this directly to the upstream? It probably would be useful to other as well 👍

Copy link
Contributor Author

@douglascamata douglascamata Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add there in parallel, but I personally do not want to be blocked here until it's merged there. Honestly the flaky tests are being a massive waste of time and resources (paid or not) on my PRs (and everybody else's I bet), plus we have the penalty dedup test completely skipped since several months.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, PR opened upstream: efficientgo/e2e#65

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can replace it eventually next time we bump E2E version 👍

Signed-off-by: Douglas Camata <[email protected]>
@douglascamata douglascamata dismissed stale reviews from matej-g and fpetkovski via ad56598 March 29, 2023 10:12
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! This has been a pain point for a long time!

Just one tiny question for the new assert method.


// Between is a MetricValueExpectation function for WaitSumMetrics that returns true if given single sum is between
// the lower and upper bounds (non-inclusive, as in `lower < x < upper`).
func Between(lower, upper float64) e2emon.MetricValueExpectation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so the expectation is that whatever metric is being asserted would freeze at a particular value, between lower and upper bounds?

Copy link
Contributor Author

@douglascamata douglascamata Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is precisely the meaning of the mathematical expression lower < x < upper: x can be anything between lower and upper. It's effectively the same as running Greater(lower) and Less(upper).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants