Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539

Closed
rafiss opened this issue Aug 10, 2023 · 5 comments · Fixed by #108711
Closed

upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky #108539

rafiss opened this issue Aug 10, 2023 · 5 comments · Fixed by #108711
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@rafiss
Copy link
Collaborator

rafiss commented Aug 10, 2023

Seen here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/11266662?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true

I can't actually find the error that made the test fail, but the first strange symptom that stands out is that this is logged 656,000 times. Hoping the KV team can help interpret what can cause this behavior.

W230810 14:10:42.795920 22530 kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:718  [T1,n1,client=127.0.0.1:56466,hostssl,user=root,intExec=read-members] 3178  QueryIntent(ErrorIfMissing=true) found no intent, but did not error; resp=header:<> 

Jira issue: CRDB-30513

@rafiss rafiss added the C-test-failure Broken test (automatically or manually discovered). label Aug 10, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 10, 2023
@erikgrinaker
Copy link
Contributor

The failing test is TestRoleMembersIDMigration1500Users. It timed out after 15 minutes, but did continue to make progress right up until the timeout. There were lots of QueryIntent log messages, as you point out. Here's the log:
test.log.gz

A couple of local runs pass just fine. TestRoleMembersIDMigration1500Users is skipped under stress, and has been seen to fail previously, e.g. #100158. I'm running 1000 local stress runs just to see if I can get it to fail, after 100 I did see a couple of runs take >400s (it normally completes in ~120s), so they're occasionally slow, but seem to pass. None of these hit the "found no intent" error.

The log message found no intent, but did not error was introduced as part of #98183, for 23.2.

@nvanbenschoten Can you see if these log messages are overly spammy here, or if we're hitting an unexpected condition?

@erikgrinaker erikgrinaker added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 11, 2023
@erikgrinaker
Copy link
Contributor

Got a timeout failure after 200 runs, without any "found no intent, but did not error" log messages. Test log: test.log.gz

I'm inclined to throw this over to the test owners for deflaking (SQL Foundations), but I'll let @nvanbenschoten have a look at the QueryIntent errors above first.

@nvanbenschoten
Copy link
Member

The log message found no intent, but did not error was introduced as part of #98183, for 23.2.

@nvanbenschoten Can you see if these log messages are overly spammy here, or if we're hitting an unexpected condition?

This log line was unexpected to me. I didn't think we could hit it. It turns out that this is possible to hit if a key or byte-limited request (e.g. Scan(/Table/23, /Table/24)) is prefixed with a QueryIntent for a key near the end of the table. If there is a range split in the middle of the table, then the DistSender will split up the request on the range boundary. If the first half of the request then hits the limit, we will evaluate only the (first half of the) Scan and never get to evaluate the QueryIntent.

This all works correctly and the txnPipeliner handles the empty QueryIntentResponse properly, but I can't say I had this case in mind when writing that client-side code. The QueryIntent API was designed for it though, so that's good.

I'll remove the log line, update the commentary, and add some test cases for this.

Otherwise, this is unrelated to the test failure itself, so I'll move this back to SQL Foundations to debug the test failure that @erikgrinaker found.

@nvanbenschoten nvanbenschoten added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-kv KV Team labels Aug 11, 2023
@nvanbenschoten nvanbenschoten removed their assignment Aug 11, 2023
@nvanbenschoten nvanbenschoten changed the title pkg/upgrade/upgrades/upgrades_test is flaky [found no intent, but did not error] pkg/upgrade/upgrades/upgrades_test is flaky Aug 11, 2023
@nvanbenschoten nvanbenschoten changed the title pkg/upgrade/upgrades/upgrades_test is flaky upgrade/upgrades: TestRoleMembersIDMigration1500Users is flaky Aug 11, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 11, 2023
See cockroachdb#108539 (comment).

This commit adds intentional handling to the txnPipeliner for the case
where a response is paginated and not all QueryIntent requests were
evaluated. Previously, we handled this case, but we logged a warning and
had a comment that said it was unexpected.

The commit also adds a test for the case.

Epic: None
Release note: None
@erikgrinaker erikgrinaker removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 12, 2023
craig bot pushed a commit that referenced this issue Aug 12, 2023
108639: kv: intentionally handle paginated responses in txnPipeliner r=arulajmani a=nvanbenschoten

See #108539 (comment).

This commit adds intentional handling to the txnPipeliner for the case where a response is paginated and not all QueryIntent requests were evaluated. Previously, we handled this case, but we logged a warning and had a comment that said it was unexpected.

The commit also adds a test for the case.

Epic: None
Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in dc2c52d Aug 14, 2023
@erikgrinaker
Copy link
Contributor

@rafiss Do we understand the variance in runtime for this test? In an isolated environment, it typically takes 2 minutes, but occasionally much longer (more than 15 minutes). I see we increased the machine size, but it seems like it would be worthwhile to understand why this test suddenly sees its runtime increase by ~10x.

@rafiss
Copy link
Collaborator Author

rafiss commented Aug 15, 2023

Hm just had another thought that it could be related to metamorphic constants. I noticed in the test log that you shared, some settings like kv-batch-size and datum-row-converter-batch-size were 1, among other similar settings that could affect performance. I'll experiment with those settings.

craig bot pushed a commit that referenced this issue Aug 18, 2023
109010: sql,backfill: deflake TestValidationWithProtectedTS r=rafiss a=rafiss

This test was slow enough to the point of flaking. I tracked down the
problem to the metamorphic constant `kv-batch-size`. Now we set a
testing knob to make sure that the value is always the production value.

This also required a fix to plumb the testing knob through in one more
place.

fixes #106960
informs #108539
Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants