upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539

rafiss · 2023-08-10T18:00:02Z

Seen here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/11266662?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true

I can't actually find the error that made the test fail, but the first strange symptom that stands out is that this is logged 656,000 times. Hoping the KV team can help interpret what can cause this behavior.

W230810 14:10:42.795920 22530 kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:718  [T1,n1,client=127.0.0.1:56466,hostssl,user=root,intExec=read-members] 3178  QueryIntent(ErrorIfMissing=true) found no intent, but did not error; resp=header:<>

Jira issue: CRDB-30513

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2023-08-11T14:16:03Z

The failing test is TestRoleMembersIDMigration1500Users. It timed out after 15 minutes, but did continue to make progress right up until the timeout. There were lots of QueryIntent log messages, as you point out. Here's the log:
test.log.gz

A couple of local runs pass just fine. TestRoleMembersIDMigration1500Users is skipped under stress, and has been seen to fail previously, e.g. #100158. I'm running 1000 local stress runs just to see if I can get it to fail, after 100 I did see a couple of runs take >400s (it normally completes in ~120s), so they're occasionally slow, but seem to pass. None of these hit the "found no intent" error.

The log message found no intent, but did not error was introduced as part of #98183, for 23.2.

@nvanbenschoten Can you see if these log messages are overly spammy here, or if we're hitting an unexpected condition?

erikgrinaker · 2023-08-11T14:38:42Z

Got a timeout failure after 200 runs, without any "found no intent, but did not error" log messages. Test log: test.log.gz

I'm inclined to throw this over to the test owners for deflaking (SQL Foundations), but I'll let @nvanbenschoten have a look at the QueryIntent errors above first.

nvanbenschoten · 2023-08-11T20:11:37Z

The log message found no intent, but did not error was introduced as part of #98183, for 23.2.

@nvanbenschoten Can you see if these log messages are overly spammy here, or if we're hitting an unexpected condition?

This log line was unexpected to me. I didn't think we could hit it. It turns out that this is possible to hit if a key or byte-limited request (e.g. Scan(/Table/23, /Table/24)) is prefixed with a QueryIntent for a key near the end of the table. If there is a range split in the middle of the table, then the DistSender will split up the request on the range boundary. If the first half of the request then hits the limit, we will evaluate only the (first half of the) Scan and never get to evaluate the QueryIntent.

This all works correctly and the txnPipeliner handles the empty QueryIntentResponse properly, but I can't say I had this case in mind when writing that client-side code. The QueryIntent API was designed for it though, so that's good.

I'll remove the log line, update the commentary, and add some test cases for this.

Otherwise, this is unrelated to the test failure itself, so I'll move this back to SQL Foundations to debug the test failure that @erikgrinaker found.

See cockroachdb#108539 (comment). This commit adds intentional handling to the txnPipeliner for the case where a response is paginated and not all QueryIntent requests were evaluated. Previously, we handled this case, but we logged a warning and had a comment that said it was unexpected. The commit also adds a test for the case. Epic: None Release note: None

108639: kv: intentionally handle paginated responses in txnPipeliner r=arulajmani a=nvanbenschoten See #108539 (comment). This commit adds intentional handling to the txnPipeliner for the case where a response is paginated and not all QueryIntent requests were evaluated. Previously, we handled this case, but we logged a warning and had a comment that said it was unexpected. The commit also adds a test for the case. Epic: None Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

erikgrinaker · 2023-08-14T17:15:36Z

@rafiss Do we understand the variance in runtime for this test? In an isolated environment, it typically takes 2 minutes, but occasionally much longer (more than 15 minutes). I see we increased the machine size, but it seems like it would be worthwhile to understand why this test suddenly sees its runtime increase by ~10x.

rafiss · 2023-08-15T04:10:02Z

Hm just had another thought that it could be related to metamorphic constants. I noticed in the test log that you shared, some settings like kv-batch-size and datum-row-converter-batch-size were 1, among other similar settings that could affect performance. I'll experiment with those settings.

109010: sql,backfill: deflake TestValidationWithProtectedTS r=rafiss a=rafiss This test was slow enough to the point of flaking. I tracked down the problem to the metamorphic constant `kv-batch-size`. Now we set a testing knob to make sure that the value is always the production value. This also required a fix to plumb the testing knob through in one more place. fixes #106960 informs #108539 Release note: None Co-authored-by: Rafi Shamim <[email protected]>

rafiss added the C-test-failure Broken test (automatically or manually discovered). label Aug 10, 2023

blathers-crl bot added the T-kv KV Team label Aug 10, 2023

erikgrinaker assigned nvanbenschoten Aug 11, 2023

erikgrinaker added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 11, 2023

nvanbenschoten added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-kv KV Team labels Aug 11, 2023

nvanbenschoten removed their assignment Aug 11, 2023

nvanbenschoten changed the title ~~pkg/upgrade/upgrades/upgrades_test is flaky [found no intent, but did not error]~~ pkg/upgrade/upgrades/upgrades_test is flaky Aug 11, 2023

nvanbenschoten changed the title ~~pkg/upgrade/upgrades/upgrades_test is flaky~~ upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky Aug 11, 2023

nvanbenschoten mentioned this issue Aug 11, 2023

kv: intentionally handle paginated responses in txnPipeliner #108639

Merged

erikgrinaker removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 12, 2023

rafiss mentioned this issue Aug 14, 2023

upgrades: deflake TestRoleMembersIDMigration1500Users #108711

Merged

exalate-issue-sync bot assigned rafiss Aug 14, 2023

craig bot closed this as completed in dc2c52d Aug 14, 2023

stevendanna mentioned this issue Aug 15, 2023

backupccl: TestBackupRestoreAppend can hang under race on master #108800

Closed

rafiss mentioned this issue Aug 18, 2023

sql,backfill: deflake TestValidationWithProtectedTS #109010

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539

upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539

rafiss commented Aug 10, 2023 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented Aug 11, 2023

erikgrinaker commented Aug 11, 2023

nvanbenschoten commented Aug 11, 2023

erikgrinaker commented Aug 14, 2023

rafiss commented Aug 15, 2023

upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539

upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539

Comments

rafiss commented Aug 10, 2023 • edited by cockroach-jira-scripts Loading

erikgrinaker commented Aug 11, 2023

erikgrinaker commented Aug 11, 2023

nvanbenschoten commented Aug 11, 2023

erikgrinaker commented Aug 14, 2023

rafiss commented Aug 15, 2023

rafiss commented Aug 10, 2023 •

edited by cockroach-jira-scripts

Loading