-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logictestccl: Fix regional_by_row_query_behavior flake #87391
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
How did you determine the sleep time?
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @yuzefovich)
pkg/ccl/logictestccl/testdata/logic_test/regional_by_row_query_behavior
line 2463 at r1 (raw file):
SET vectorize = "on" statement ok nodeidx=0
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the following which appears to address similar issues:
cockroach/pkg/ccl/logictestccl/testdata/logic_test/regional_by_row_query_behavior
Lines 6 to 12 in 27b795d
# Set the closed timestamp interval to be short to shorten the amount of time | |
# we need to wait for the system config to propagate. | |
statement ok | |
SET CLUSTER SETTING kv.closed_timestamp.side_transport_interval = '10ms'; | |
statement ok | |
SET CLUSTER SETTING kv.closed_timestamp.target_duration = '10ms'; |
It's using a
10ms
time, so I tried running this as a stress test with 1 second sleep, which passed. I ran into a similar issue recently, which required adding a 5s
sleep for stress tests to pass, but that test was lacking the above settings.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)
pkg/ccl/logictestccl/testdata/logic_test/regional_by_row_query_behavior
line 2463 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
why was this removed?
In case the other node we're logging into has some additional latency in receiving the config. This part is not essential to the test. Node 0
might be the default anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried running stress tests for an extended period, and did hit the error with a 1s
sleep. Retesting with 2s
...
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead use retry
option on a new query that checks that the zone configs are as expected? It might be worth pinging @irfansharif to see whether he has suggestions for the best way to wait until the expected state of system config is reached.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Upon further debugging it appears the problem actually stems from logic tests not honoring the !metamorphic
test directive, which I'm currently investigating.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead use retry option on a new query that checks that the zone configs are as expected?
I believe the zone configs are cached, so if we do the lookup before they're available we'll cache a config with no lease preferences. We might need a test directive which retries and purges cached configs.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the fix and description to address both the unavailable zone configs problem and changing query plans in metamorphic mode.
RFAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @rytaft)
-- commits
line 12 at r2:
My instinct is to extend the existing metamorphic
directive of forcing the production batch sizes to also force the usage of the LOS if it is the only metamorphic variable that is causing the flakes. The rationale for this is that we don't lose the test coverage whereas if we skip the test under metamorphic build, we'll be skipping it 80% of the time (since each build is metamorphic with 0.8 probability in tests), so it's more likely that the flakes will be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
My instinct is to extend the existing
metamorphic
directive of forcing the production batch sizes to also force the usage of the LOS if it is the only metamorphic variable that is causing the flakes. The rationale for this is that we don't lose the test coverage whereas if we skip the test under metamorphic build, we'll be skipping it 80% of the time (since each build is metamorphic with 0.8 probability in tests), so it's more likely that the flakes will be merged.
OK, it sounds like you're not in favor of any directive which would allow skipping of a logic test in metamorphic mode. To force the usage of LOS we'd have to reduce the costs of it in the optimizer, so maybe a new session flag could control that or maybe a new flag in eval.TestingKnobs
. The current metamorphic
directive is confusingly-named. Recent changes to the logic tests, in function processConfigs
added a comment saying that flag checks "whether the test works only in non-metamorphic setting", and I too thought at first the directive avoids use of any metamorphic settings or skips the test in metamorphic mode. I changed the name to metamorphic-batch-sizes
. Maybe I need to change the name again if we're going to make this set a "force LOS" setting. Or, maybe a new directive should be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)
Previously, msirek (Mark Sirek) wrote…
OK, it sounds like you're not in favor of any directive which would allow skipping of a logic test in metamorphic mode. To force the usage of LOS we'd have to reduce the costs of it in the optimizer, so maybe a new session flag could control that or maybe a new flag in
eval.TestingKnobs
. The currentmetamorphic
directive is confusingly-named. Recent changes to the logic tests, in functionprocessConfigs
added a comment saying that flag checks "whether the test works only in non-metamorphic setting", and I too thought at first the directive avoids use of any metamorphic settings or skips the test in metamorphic mode. I changed the name tometamorphic-batch-sizes
. Maybe I need to change the name again if we're going to make this set a "force LOS" setting. Or, maybe a new directive should be created.
OK, I thought it was related to LOS, but it turns out it was really the coldata-batch-size
setting. I've pushed an update.
6c70033
to
59703b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 11 files at r2, 9 of 9 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @rytaft)
-- commits
line 21 at r4:
It might be good to get a quick LGTM on this from @irfansharif.
pkg/config/system.go
line 413 at r3 (raw file):
s.mu.RLock() if len(s.mu.zoneCache) != 0 { s.mu.zoneCache = map[ObjectID]zoneEntry{}
nit: not sure how important this is, but we can optimize the allocations a bit by clearing the old map and reusing it with
for k := range m {
delete(m, k)
}
In other parts of the codebase I've seen comments that this pattern translates to very efficient runtime.mapclear
.
Now that I've typed it out, the optimization is probably not worth it here, so feel free to ignore the comment - I decided to keep this in just as an FYI.
pkg/config/system.go
line 418 at r3 (raw file):
s.mu.shouldSplitCache = map[ObjectID]bool{} } s.mu.RUnlock()
nit: we tend to unlock in a defer
in cases like this.
pkg/sql/logictest/logic.go
line 3929 at r3 (raw file):
if serverArgsCopy.ForceProductionValues { if err := coldata.SetBatchSizeForTests(coldata.DefaultColdataBatchSize); err != nil { panic(errors.Wrapf(err, "Could not set batch size for test."))
nit: I think somewhere in our style guide we say to not use capital letters nor periods for the error messages (unless they contain multiple sentences).
Fixes cockroachdb#86284 This fixes a test flake in regional_by_row_query_behavior which is caused by 2 issues: 1. Zone configs not becoming available in `SystemConfig` between table creation time and query execution time. 2. Metamorphic testing modify the `coldata-batch-size` setting, causing a different number of kv reads to be done, which affects the kv trace results in this test. The flake can be reproduced via the command: ``` ./dev testlogic ccl --files=regional_by_row_query_behavior \ --config=multiregion-9node-3region-3azs --stress ``` For point 1, the `retry` logic test directive is enhanced to purge the zone config cache in the SystemConfig when a failing test is retried, allowing the updated zone config to be read. For point 2, the logic test blocking flag `!metamorphic` which resets batch size related settings that were modified in metamorphic mode back to their default production values, is updated to also cover the `coldata-batch-size` setting. The flag is renamed to `!metamorphic-batch-sizes` to better reflect its actual effect. Release justification: low risk fix for test flake Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @michae2, @rytaft, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
It might be good to get a quick LGTM on this from @irfansharif.
Thanks, added him as a reviewer.
pkg/config/system.go
line 413 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: not sure how important this is, but we can optimize the allocations a bit by clearing the old map and reusing it with
for k := range m { delete(m, k) }
In other parts of the codebase I've seen comments that this pattern translates to very efficient
runtime.mapclear
.Now that I've typed it out, the optimization is probably not worth it here, so feel free to ignore the comment - I decided to keep this in just as an FYI.
OK, thanks. I'll leave as-is then. Thanks for the tip.
pkg/config/system.go
line 418 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: we tend to unlock in a
defer
in cases like this.
Modified as suggested
Code quote:
PurgeZoneConfigCache() {
pkg/sql/logictest/logic.go
line 3929 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I think somewhere in our style guide we say to not use capital letters nor periods for the error messages (unless they contain multiple sentences).
fixed
Code quote:
panic(errors.Wrapf(err, "Could not set batch size for test."))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @michae2, @rytaft, and @yuzefovich)
Build succeeded: |
Fixes #86284
This fixes a test flake in regional_by_row_query_behavior which is
caused by 2 issues:
Zone configs not becoming available in
SystemConfig
betweentable creation time and query execution time.
Metamorphic testing modify the
coldata-batch-size
setting,causing a different number of kv reads to be done, which affects
the kv trace results in this test.
The flake can be reproduced via the command:
For point 1, the
retry
logic test directive is enhanced to purge thezone config cache in the SystemConfig when a failing test is retried,
allowing the updated zone config to be read.
For point 2, the logic test blocking flag
!metamorphic
which resetsbatch size related settings that were modified in metamorphic mode back
to their default production values, is updated to also cover the
coldata-batch-size
setting. The flag is renamed to!metamorphic-batch-sizes
to better reflect its actual effect.Release justification: low risk fix for test flake
Release note: None