-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539
Comments
The failing test is A couple of local runs pass just fine. The log message @nvanbenschoten Can you see if these log messages are overly spammy here, or if we're hitting an unexpected condition? |
Got a timeout failure after 200 runs, without any "found no intent, but did not error" log messages. Test log: test.log.gz I'm inclined to throw this over to the test owners for deflaking (SQL Foundations), but I'll let @nvanbenschoten have a look at the QueryIntent errors above first. |
This log line was unexpected to me. I didn't think we could hit it. It turns out that this is possible to hit if a key or byte-limited request (e.g. This all works correctly and the I'll remove the log line, update the commentary, and add some test cases for this. Otherwise, this is unrelated to the test failure itself, so I'll move this back to SQL Foundations to debug the test failure that @erikgrinaker found. |
See cockroachdb#108539 (comment). This commit adds intentional handling to the txnPipeliner for the case where a response is paginated and not all QueryIntent requests were evaluated. Previously, we handled this case, but we logged a warning and had a comment that said it was unexpected. The commit also adds a test for the case. Epic: None Release note: None
108639: kv: intentionally handle paginated responses in txnPipeliner r=arulajmani a=nvanbenschoten See #108539 (comment). This commit adds intentional handling to the txnPipeliner for the case where a response is paginated and not all QueryIntent requests were evaluated. Previously, we handled this case, but we logged a warning and had a comment that said it was unexpected. The commit also adds a test for the case. Epic: None Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
@rafiss Do we understand the variance in runtime for this test? In an isolated environment, it typically takes 2 minutes, but occasionally much longer (more than 15 minutes). I see we increased the machine size, but it seems like it would be worthwhile to understand why this test suddenly sees its runtime increase by ~10x. |
Hm just had another thought that it could be related to metamorphic constants. I noticed in the test log that you shared, some settings like |
109010: sql,backfill: deflake TestValidationWithProtectedTS r=rafiss a=rafiss This test was slow enough to the point of flaking. I tracked down the problem to the metamorphic constant `kv-batch-size`. Now we set a testing knob to make sure that the value is always the production value. This also required a fix to plumb the testing knob through in one more place. fixes #106960 informs #108539 Release note: None Co-authored-by: Rafi Shamim <[email protected]>
Seen here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/11266662?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true
I can't actually find the error that made the test fail, but the first strange symptom that stands out is that this is logged 656,000 times. Hoping the KV team can help interpret what can cause this behavior.
Jira issue: CRDB-30513
The text was updated successfully, but these errors were encountered: