Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: unskip test_demo_partitioning interactive test #106722

Conversation

aadityasondhi
Copy link
Collaborator

Fixes #96797.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aadityasondhi aadityasondhi force-pushed the 20230712.partitioning-interactive-test-unskip branch 2 times, most recently from 63825a5 to 709699e Compare July 12, 2023 20:33
@aadityasondhi aadityasondhi force-pushed the 20230712.partitioning-interactive-test-unskip branch from 709699e to daaeec3 Compare July 12, 2023 21:31
@aadityasondhi aadityasondhi force-pushed the 20230712.partitioning-interactive-test-unskip branch from daaeec3 to 131c1a1 Compare July 12, 2023 22:01
@aadityasondhi aadityasondhi marked this pull request as ready for review July 12, 2023 23:34
@aadityasondhi aadityasondhi requested a review from a team as a code owner July 12, 2023 23:34
@aadityasondhi aadityasondhi requested review from rafiss and tbg July 12, 2023 23:34
@tbg
Copy link
Member

tbg commented Jul 13, 2023

The test was skipped due to this. You can it locally once, as far as I can tell. What indication is there that it just won't be flaky again?

The failure mode here was a DistSender-level error leaking up to the client. This is likely to happen again, as I don't think we've fixed it. So either the test needs to be able to handle this (if this bubbling up is expected behavior) or the bubbling up has to be prevented. The former seems really annoying to implement, and the latter seems conducive to higher quality anyway, so that seems to be the better option. But it's also a real little piece of work that likely won't be done as a drive-by.

In terms of what this kind of test failure means to users, it means that CRDB is harder to script against. I think someone had a similar conversation in KV the other day - if, e.g., commands to relocate a range don't reliably work, they are less useful.

@aadityasondhi
Copy link
Collaborator Author

@tbg I see, I overlooked that when the test passed. Thanks for pointing that out! I propose the following (let me know if you disagree):

  1. We add additional retries to this test to make sure it passes without this error. (band-aid fix to get this test unskipped)
  2. Maybe we can handle the distsender issue as part of this kvclient: route requests via followers when leaseholder is unreachable #93503. The solution there will be rewriting a lot of the logic that causes this error in the first place.

cc @andrewbaptist

@tbg
Copy link
Member

tbg commented Jul 17, 2023

We add additional retries to this test to make sure it passes without this error. (band-aid fix to get this test unskipped)

SGTM.

Maybe we can handle the distsender issue as part of this #93503. The solution there will be rewriting a lot of the logic that causes this error in the first place.

it seems unrelated enough to not address this problem automatically. Even if it did, it would be doing so accidentally. The contract about which errors are returned from the KV API (and when) is not well defined currently.

#49868 is a related issue. I would suggest filing, against T-kv, an issue that asks for a clear contract implemented via a doc comment somewhere and a set of checks (maybe using #106508) that reliably catch and report unexpected errors. #49868 and this issue should then be linked against it. #105330 is another one.

This isn't solving the problem, obviously, but then we have a canonical reference that collects instances in which we see unexpected errors reaching KV clients.

@tbg tbg removed their request for review July 25, 2023 10:00
@rafiss rafiss removed request for a team and rafiss July 25, 2023 14:09
@aadityasondhi aadityasondhi force-pushed the 20230712.partitioning-interactive-test-unskip branch from 131c1a1 to f82a596 Compare July 31, 2023 14:29
Copy link
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@aadityasondhi aadityasondhi force-pushed the 20230712.partitioning-interactive-test-unskip branch from f82a596 to c175eb7 Compare August 17, 2023 09:25
@aadityasondhi
Copy link
Collaborator Author

bors r=AlexTalks

@craig
Copy link
Contributor

craig bot commented Aug 17, 2023

Build succeeded:

@craig craig bot merged commit af53e9f into cockroachdb:master Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestDockerCLI/test_demo_partitioning.tcl failed
4 participants