-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Allow the allocator to rebalance replicas away from an incorrect node/store #13288
Conversation
cc: @jseldess |
Meaning this change will not do so, or it would be incorrect for this code to do so? Please be precise. Reviewed 3 of 3 files at r1. pkg/storage/allocator.go, line 299 at r1 (raw file):
random empty line pkg/storage/allocator_test.go, line 1642 at r1 (raw file):
this is really more pkg/storage/allocator_test.go, line 1705 at r1 (raw file):
nit: tc.constraint.String()? pkg/storage/allocator_test.go, line 1706 at r1 (raw file):
why do you need to specify the capacity? pkg/storage/allocator_test.go, line 1726 at r1 (raw file):
do you really need the special case? pkg/storage/allocator_test.go, line 1874 at r1 (raw file):
this should not have changed. pkg/storage/rule_solver.go, line 39 at r1 (raw file):
why did these change? I think you misunderstood the origin of the value 0.5 here. Comments from Reviewable |
I think this is inaccurate. We will transfer the lease when we select the leaseholder for removal. Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed. Comments from Reviewable |
b0015b7
to
21a2577
Compare
I didn't observe this when I tested it 3 weeks ago. I'll do some more testing to be sure. It might be some other issue as well. Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending. pkg/storage/allocator.go, line 299 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/allocator_test.go, line 1642 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
cleaned this up pkg/storage/allocator_test.go, line 1705 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/allocator_test.go, line 1706 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/allocator_test.go, line 1726 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
nope pkg/storage/allocator_test.go, line 1874 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
In this case it should change. When shouldRebalance existed, it would act as a filter from calling rebalanceCandidates. Now that filter is removed and we rely on the rules of rebalanceCandidates alone. pkg/storage/rule_solver.go, line 39 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
you're right, they shouldn't have, reverted. Comments from Reviewable |
Reviewed 3 of 3 files at r2. pkg/storage/allocator_test.go, line 1874 at r1 (raw file): Previously, BramGruneir (Bram Gruneir) wrote…
There must be something missing in the commit message, because it is surprising that this change given how the commit message is written. pkg/storage/allocator_test.go, line 1708 at r2 (raw file):
unusual to do this for comparing values via pointer, but i guess it's ok pkg/storage/allocator_test.go, line 1709 at r2 (raw file):
what does this look like when one of the arguments is nil? Comments from Reviewable |
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed. pkg/storage/allocator_test.go, line 1874 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I see what you mean, I've added some details. pkg/storage/allocator_test.go, line 1708 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
yeah, changed it to look at storeID instead. pkg/storage/allocator_test.go, line 1709 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Pretty ugly, cleaned this up. Comments from Reviewable |
21a2577
to
ff3923e
Compare
Reviewed 1 of 1 files at r3. Comments from Reviewable |
…rect node/store This change allows the allocator to determine when there are replicas on nodes or stores that fail a constraint check. This can happen when a zone config is changed after the replicas are already present. To achieve this, the shouldRalance function that acts as a filter on calls to rebalanceTarget has been removed and the filtering is performed is directly in rebalanceCandidates.
ff3923e
to
0ecdae5
Compare
After testing this in a realistic setting, the replicate queue is continuously full. So progress is extremely slow. I'll revisit replicateQueue's shouldQueue and figure out the issue. But this is not ready for merging yet. Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
I've performed a good amount of further testing and while most of the replicas move correctly, the whole system begins to stall as once we hit some threshold where most of the replicas have been removed from a store or two and there is a large amount of thrashing for all ranges quickly adding and removing replicas from the same store. I tried re-adding back in shouldRebalance in (both as part of rebalanceCandidates and on its own in front of it), but this didn't change fix the problem. I'm going to start adding more logging to get some better insight into what's happening. Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
After some more investigation, this thrashing occurs when the replica that needs to be removed during a rebalance is the one holding the lease. Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
Replaced by #14106. Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
This change allows the allocator to determine when there are replicas on nodes or stores that fail a constraint check. This can happen when a zone config is changed after the replicas are already present. This change cannot remove replicas if they currently hold the lease so a bit more work is still needed.
Before merging, I'd like to test this out on a real cluster and via allocsim. I'll post any pertinent results to this issue.
This change is