-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compact: fix objstore bucket operations check to improve flakiness #6064
Conversation
Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent. Signed-off-by: Douglas Camata <[email protected]>
test/e2e/compact_test.go
Outdated
@@ -724,7 +724,7 @@ func testCompactWithStoreGateway(t *testing.T, penaltyDedup bool) { | |||
operationMatcher, err := matchers.NewMatcher(matchers.MatchEqual, "operation", "get") | |||
testutil.Ok(t, err) | |||
testutil.Ok(t, c.WaitSumMetricsWithOptions( | |||
e2emon.Equals(573), | |||
e2emon.GreaterOrEqual(573), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of these assertions is to make sure the compactor is efficient and does not make more loops/API requests than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the idea. But this is flaky because the number of operations depend heavily on time (could be a bug, I don't know), which becomes a hassle when processes are fighting for CPU time in CI or even if you run all e2e tests locally (because they will run in parallel).
If I put a sleep on line 725, depending on how much time the sleep takes the assertion will fail because thanos_objstore_bucket_operations_total
will have a completely different value. This test has to be fixed and made independent of time, otherwise it will stay flaky. But I do not have a clear picture of how to achieve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. In that case I am not sure what's the best option here, maybe we can use LessThan(1000)
or some larger value to make sure the compactor does not go into infinite loops. If we have to use GreaterThan
, we might as well remove the assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fpetkovski I implemented a custom matcher to use in the assertion, to be able to assert the value is between 0 and 1000, instead of just checking one of the two bounds. This is not present yet in efficientgo/e2emon
, so it lives in the e2ethanos
package for now.
I've ran the test a few times locally with this configuration and it was so good that I am removing the skip on the penalty compactor test.
PTAL 🙇
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
…ctor-test Signed-off-by: Douglas Camata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, if this finally fixes #4866 I'm buying you a beer 🍻. From my understanding, we are not able to assert on exact number due to what you described in the description, so having this as the second best seems reasonable.
Apart from failing lint, this looks good!
@@ -0,0 +1,16 @@ | |||
package e2ethanos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add this directly to the upstream? It probably would be useful to other as well 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add there in parallel, but I personally do not want to be blocked here until it's merged there. Honestly the flaky tests are being a massive waste of time and resources (paid or not) on my PRs (and everybody else's I bet), plus we have the penalty dedup test completely skipped since several months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, PR opened upstream: efficientgo/e2e#65
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can replace it eventually next time we bump E2E version 👍
Signed-off-by: Douglas Camata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing! This has been a pain point for a long time!
Just one tiny question for the new assert method.
|
||
// Between is a MetricValueExpectation function for WaitSumMetrics that returns true if given single sum is between | ||
// the lower and upper bounds (non-inclusive, as in `lower < x < upper`). | ||
func Between(lower, upper float64) e2emon.MetricValueExpectation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the expectation is that whatever metric is being asserted would freeze at a particular value, between lower and upper bounds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expectation is precisely the meaning of the mathematical expression lower < x < upper
: x
can be anything between lower
and upper
. It's effectively the same as running Greater(lower)
and Less(upper)
.
Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent if the goal is to be precise. As removing this assertion might not be a good idea, doing a lower boundary check suffices.
This is aimed to fix CI errors like the one below, copied from https://github.com/thanos-io/thanos/actions/runs/3957791998/jobs/6778588621:
If you add a
time.Sleep
before this assertion and increase the time it will increasethanos_objstore_bucket_operations_total
more and more. The Compactor isn't halted and for some reasons the bucket operations keep going up. I don't know if they stop at some point.Signed-off-by: Douglas Camata [email protected]
Changes
thanos_objstore_bucket_operations_total
usingGreateOrEqual
, because if some time passes the actual value will change to be bigger than what is expected.Verification
Bunch of local tests run with some
time.Sleep
throughout the code, which failed before, now passed.