-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: check that a snapshot is sane before sending it #56596
kvserver: check that a snapshot is sane before sending it #56596
Conversation
This test was performing a configuration change by talking directly to one of the range's replicas. Instead, this change switches it to use an Admin request which gets routed to the leaseholder. It's better if the leaseholder coordinates the configuration change; a non-leaseholder can error out because it might try to send snapshot before it applied the preceding descriptor change locally (see cockroachdb#56596). The test seems to currently work realiably, but, without this patch, becomes flaky with the next commit which restricts who can take the lease for a range with no lease. Release note: None
This test was performing a configuration change by talking directly to one of the range's replicas. Instead, this change switches it to use an Admin request which gets routed to the leaseholder. It's better if the leaseholder coordinates the configuration change; a non-leaseholder can error out because it might try to send snapshot before it applied the preceding descriptor change locally (see cockroachdb#56596). The test seems to currently work realiably, but, without this patch, becomes flaky with the next commit which restricts who can take the lease for a range with no lease. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1348 at r2 (raw file):
// replicas, and those snapshots would be invalid if our stale descriptor // doesn't contain the respective replicas. for {
Can you limit the waiting here to up to, say, 10s? Artificially holding up a replication change can back up the replication queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)
The recipient has the check that a snapshot includes it as a replica, and rejects the snapshot with an assertion error if it doesn't. This patch adds a less-panicked check on the sender side, as it turns out that the situation is possible. Release note: None
490aaac
to
8ec911c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1348 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you limit the waiting here to up to, say, 10s? Artificially holding up a replication change can back up the replication queue.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1360 at r4 (raw file):
// TODO(andrei): Find a better way to wait for replication. If we knew the // LAI of the respective command, we could use waitForApplication(). time.Sleep(time.Second)
Can you instead use a retry thing with a max wait. That way it'll pay attention to context cancellation which very much does happen.
8ec911c
to
2caacbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1360 at r4 (raw file):
Previously, ajwerner wrote…
// TODO(andrei): Find a better way to wait for replication. If we knew the // LAI of the respective command, we could use waitForApplication(). time.Sleep(time.Second)
Can you instead use a retry thing with a max wait. That way it'll pay attention to context cancellation which very much does happen.
Gracias
This patch handles an edge case in config changes: 1. Replica.ChangeReplicas() is called on r1 to add a replica to the range 2. ChangeReplicas runs a transaction that modifies the range descriptor 3. Say that the replica doing this loses the lease. Or it never had it to begin with - although at least most of the time ChangeReplicas is called on a leaseholder. 4. ChangeReplicas attempts to send a snapshot to the newly-added replica (which has been added as a learner, but that doesn’t matter here) 5. If r1 has not applied the descriptor change yet, then the snapshot it produces is invalid because it doesn't contain the recipient in the descriptor (the desc inside the snapshot). We'll then attempt to remove the new replica(s) and fail the whole ChangeReplicas. This patch makes snapshot generation wait until the sender replicas has caught up with the descriptor that is has previously written. At that point, it can send the snapshot just fine even if it isn't the leaseholder. I think this is generally a sane thing to do, but what prompted it is a test that was directly calling r.ChangeReplicas() on a replica that's not necessarily the leaseholder. This test becomes flaky with cockroachdb#55148 because the replica cannot always just take the lease as it did before that change: with cockroachdb#55148 it now cannot take the lease if it's not the leader, and so it ends up executing the ChangeReplicas() without holding a lease. That was triggering the scenario described above. I'm separately improving that one test to direct its request at the leaseholder explicitly, but I'm afraid there might be others. Release note: None
2caacbd
to
d84890f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1360 at r4 (raw file):
Previously, ajwerner wrote…
// TODO(andrei): Find a better way to wait for replication. If we knew the // LAI of the respective command, we could use waitForApplication(). time.Sleep(time.Second)
Can you instead use a retry thing with a max wait. That way it'll pay attention to context cancellation which very much does happen.
done, good idea
bors r+ |
Build succeeded: |
This test was performing a configuration change by talking directly to one of the range's replicas. Instead, this change switches it to use an Admin request which gets routed to the leaseholder. It's better if the leaseholder coordinates the configuration change; a non-leaseholder can error out because it might try to send snapshot before it applied the preceding descriptor change locally (see cockroachdb#56596). The test seems to currently work realiably, but, without this patch, becomes flaky with the next commit which restricts who can take the lease for a range with no lease. Release note: None
The recipient has the check that a snapshot includes it as a replica,
and rejects the snapshot with an assertion error if it doesn't. This
patch adds a less-panicked check on the sender side, as it turns out
that the situation is possible.
Release note: None