Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Allow the allocator to rebalance replicas away from an incorrect node/store #13288

Closed
wants to merge 1 commit into from

Conversation

BramGruneir
Copy link
Member

@BramGruneir BramGruneir commented Jan 30, 2017

This change allows the allocator to determine when there are replicas on nodes or stores that fail a constraint check. This can happen when a zone config is changed after the replicas are already present. This change cannot remove replicas if they currently hold the lease so a bit more work is still needed.

Before merging, I'd like to test this out on a real cluster and via allocsim. I'll post any pertinent results to this issue.


This change is Reviewable

@BramGruneir
Copy link
Member Author

cc: @jseldess

@tamird
Copy link
Contributor

tamird commented Jan 30, 2017

This change cannot remove replicas if they currently hold the lease so a bit more work is still needed.

Meaning this change will not do so, or it would be incorrect for this code to do so? Please be precise.


Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


pkg/storage/allocator.go, line 299 at r1 (raw file):

	rangeID roachpb.RangeID,
) (*roachpb.StoreDescriptor, error) {

random empty line


pkg/storage/allocator_test.go, line 1642 at r1 (raw file):

		constraint       config.Constraint
		existing         []int
		expectedStoreIDs []int

this is really more permittedStoreIDs, isn't it? also, it seems like this wants to be a *int since it's always one ID or none.


pkg/storage/allocator_test.go, line 1705 at r1 (raw file):

	for _, tc := range testCases {
		t.Run(fmt.Sprintf("%s", tc.constraint), func(t *testing.T) {

nit: tc.constraint.String()?


pkg/storage/allocator_test.go, line 1706 at r1 (raw file):

	for _, tc := range testCases {
		t.Run(fmt.Sprintf("%s", tc.constraint), func(t *testing.T) {
			existingReplicas := make([]roachpb.ReplicaDescriptor, len(tc.existing), len(tc.existing))

why do you need to specify the capacity?


pkg/storage/allocator_test.go, line 1726 at r1 (raw file):

				t.Fatal(err)
			}
			if result == nil {

do you really need the special case?


pkg/storage/allocator_test.go, line 1874 at r1 (raw file):

	// | gen | store 0  | store 1  | store 2  | store 3  | store 4  | store 5  | store 6  | store 7  | store 8  | store 9  | store 10 | store 11 | store 12 | store 13 | store 14 | store 15 | store 16 | store 17 | store 18 | store 19 |
	// +-----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
	// |   0 |   1  48% |   0   0% |   0   0% |   1  51% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |   0   0% |

this should not have changed.


pkg/storage/rule_solver.go, line 39 at r1 (raw file):

func rebalanceFromConvergesOnMean(sl StoreList, candidate roachpb.StoreDescriptor) bool {
	return float64(candidate.Capacity.RangeCount) > sl.candidateCount.mean+rebalanceThreshold

why did these change? I think you misunderstood the origin of the value 0.5 here.


Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm:

This change cannot remove replicas if they currently hold the lease so a bit more work is still needed.

I think this is inaccurate. We will transfer the lease when we select the leaseholder for removal.


Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


Comments from Reviewable

@BramGruneir
Copy link
Member Author

We will transfer the lease when we select the leaseholder for removal.

I didn't observe this when I tested it 3 weeks ago. I'll do some more testing to be sure. It might be some other issue as well.


Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


pkg/storage/allocator.go, line 299 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

random empty line

Done.


pkg/storage/allocator_test.go, line 1642 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is really more permittedStoreIDs, isn't it? also, it seems like this wants to be a *int since it's always one ID or none.

cleaned this up


pkg/storage/allocator_test.go, line 1705 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: tc.constraint.String()?

Done.


pkg/storage/allocator_test.go, line 1706 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why do you need to specify the capacity?

Done.


pkg/storage/allocator_test.go, line 1726 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

do you really need the special case?

nope


pkg/storage/allocator_test.go, line 1874 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this should not have changed.

In this case it should change. When shouldRebalance existed, it would act as a filter from calling rebalanceCandidates. Now that filter is removed and we rely on the rules of rebalanceCandidates alone.


pkg/storage/rule_solver.go, line 39 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why did these change? I think you misunderstood the origin of the value 0.5 here.

you're right, they shouldn't have, reverted.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Feb 6, 2017

Reviewed 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/storage/allocator_test.go, line 1874 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

In this case it should change. When shouldRebalance existed, it would act as a filter from calling rebalanceCandidates. Now that filter is removed and we rely on the rules of rebalanceCandidates alone.

There must be something missing in the commit message, because it is surprising that this change given how the commit message is written.


pkg/storage/allocator_test.go, line 1708 at r2 (raw file):

				t.Fatal(err)
			}
			if !reflect.DeepEqual(tc.expected, actual) {

unusual to do this for comparing values via pointer, but i guess it's ok


pkg/storage/allocator_test.go, line 1709 at r2 (raw file):

			}
			if !reflect.DeepEqual(tc.expected, actual) {
				t.Errorf("rebalancing to the incorrect store, expected %s, got %s", tc.expected, actual)

what does this look like when one of the arguments is nil?


Comments from Reviewable

@BramGruneir
Copy link
Member Author

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/storage/allocator_test.go, line 1874 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

There must be something missing in the commit message, because it is surprising that this change given how the commit message is written.

I see what you mean, I've added some details.


pkg/storage/allocator_test.go, line 1708 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

unusual to do this for comparing values via pointer, but i guess it's ok

yeah, changed it to look at storeID instead.


pkg/storage/allocator_test.go, line 1709 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what does this look like when one of the arguments is nil?

Pretty ugly, cleaned this up.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Feb 8, 2017

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

…rect node/store

This change allows the allocator to determine when there are replicas on nodes or stores that fail a constraint check. This can happen when a zone config is changed after the replicas are already present.
To achieve this, the shouldRalance function that acts as a filter on calls to rebalanceTarget has been removed and the filtering is performed is directly in rebalanceCandidates.
@BramGruneir
Copy link
Member Author

After testing this in a realistic setting, the replicate queue is continuously full. So progress is extremely slow. I'll revisit replicateQueue's shouldQueue and figure out the issue. But this is not ready for merging yet.


Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@BramGruneir
Copy link
Member Author

I've performed a good amount of further testing and while most of the replicas move correctly, the whole system begins to stall as once we hit some threshold where most of the replicas have been removed from a store or two and there is a large amount of thrashing for all ranges quickly adding and removing replicas from the same store.

I tried re-adding back in shouldRebalance in (both as part of rebalanceCandidates and on its own in front of it), but this didn't change fix the problem. I'm going to start adding more logging to get some better insight into what's happening.


Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@BramGruneir
Copy link
Member Author

After some more investigation, this thrashing occurs when the replica that needs to be removed during a rebalance is the one holding the lease.
See #13973 for more details.


Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@BramGruneir
Copy link
Member Author

Replaced by #14106.


Review status: 1 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants