-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: update the tscache appropriately after a merge #28793
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2473 at r1 (raw file):
// difficult. s.tsCache.SetLowWater(rightDesc.StartKey.AsRawKey(), rightDesc.EndKey.AsRawKey(), s.Clock().Now()) }
I'm kind of unsure about all of this. Leases are finicky; is looking at r.mu.state.Lease
like I do here correct? Is my statement about the chain of causality correct? And (as mentioned in the TODO) would it be better to thread the RHS's timestamp at the moment it executed the GetSnapshotForMerge
request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2473 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
I'm kind of unsure about all of this. Leases are finicky; is looking at
r.mu.state.Lease
like I do here correct? Is my statement about the chain of causality correct? And (as mentioned in the TODO) would it be better to thread the RHS's timestamp at the moment it executed theGetSnapshotForMerge
request?
The code here looks good based on my understanding of merges, but I'd like to understand the interactions a little better before signing off. What invariants does GetSnapshotForMerge
place on requests on the RHS of the merge? I'm pretty sure it blocks all concurrent reads and writes (even those at larger timestamps?), but does it check the tscache or somehow ensure that no reads or writes have been performed earlier at larger timestamp? If not then I think we'll need to do what we're doing here. Also, it's not possible for a Range frozen by a GetSnapshotForMerge
to transfer its lease without aborting the entire merge, right?
There's also an interesting optimization that you alluded to where we don't need to touch the tscache if the RHS replica on the same store also had the lease. I'm not sure whether this would be worth it or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2473 at r1 (raw file):
What invariants does GetSnapshotForMerge place on requests on the RHS of the merge? I'm pretty sure it blocks all concurrent reads and writes (even those at larger timestamps?), but does it check the tscache or somehow ensure that no reads or writes have been performed earlier at larger timestamp?
So it declares that it reads and writes every key in the range to flush out the command queue before it executes. That flushes out even those commands that are executing at higher timestamps, if I'm understanding things correctly. It doesn't look at the tscache at all, though.
Also, it's not possible for a Range frozen by a GetSnapshotForMerge to transfer its lease without aborting the entire merge, right?
Right, it's not possible for the RHS to transfer its lease away after it's executed a GetSnapshotForMerge
unless the merge is aborted. It is, however, possible for its lease to expire, and then another range can come along and acquire the lease. (See TestStoreRangeMergeRHSLeaseExpiration
if you're curious.) But upon applying this lease, the new leaseholder will notice that a merge was initiated and refuse to serve any commands.
To summarize, I think the following is a true statement: after a GetSnapshotForMerge begins executing, the RHS's tscache will never be updated again unless the merge commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2473 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
What invariants does GetSnapshotForMerge place on requests on the RHS of the merge? I'm pretty sure it blocks all concurrent reads and writes (even those at larger timestamps?), but does it check the tscache or somehow ensure that no reads or writes have been performed earlier at larger timestamp?
So it declares that it reads and writes every key in the range to flush out the command queue before it executes. That flushes out even those commands that are executing at higher timestamps, if I'm understanding things correctly. It doesn't look at the tscache at all, though.
Also, it's not possible for a Range frozen by a GetSnapshotForMerge to transfer its lease without aborting the entire merge, right?
Right, it's not possible for the RHS to transfer its lease away after it's executed a
GetSnapshotForMerge
unless the merge is aborted. It is, however, possible for its lease to expire, and then another range can come along and acquire the lease. (SeeTestStoreRangeMergeRHSLeaseExpiration
if you're curious.) But upon applying this lease, the new leaseholder will notice that a merge was initiated and refuse to serve any commands.To summarize, I think the following is a true statement: after a GetSnapshotForMerge begins executing, the RHS's tscache will never be updated again unless the merge commits.
If the RHS leaseholder is not colocated with the LHS and served a read at, say, 100 while the LHS leaseholder's clock bumbles around at 90, what makes sure the 100 is absorbed by the LHS' clock first? I think you're saying that we necessarily absorb a timestamp >= 100 when we send the GetSnapshotForMergeRequest (which btw we should probably rename) because the br.Timestamp
will be set to the read timestamp (which will necessarily contain all read timestamps because of having declared all keys).
This seems reasonable, but the test doesn't seem to exercise it because (I think) the mtc shares the clocks across all nodes, so please add a test for this that exercises it more directly (or do anything else that establishes confidence).
I think the comment about causality could also be a little more specific. We've rarely relied on br.Timestamp
for anything, let alone anything critical :-)
Another option is to return the batch timestamp as data from the GetSnapshotForMerge request, to make matters more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2473 at r1 (raw file):
If the RHS leaseholder is not colocated with the LHS and served a read at, say, 100 while the LHS leaseholder's clock bumbles around at 90, what makes sure the 100 is absorbed by the LHS' clock first? I think you're saying that we necessarily absorb a timestamp >= 100 when we send the GetSnapshotForMergeRequest (which btw we should probably rename) because the br.Timestamp will be set to the read timestamp (which will necessarily contain all read timestamps because of having declared all keys).
Yep, this is about what I was thinking. I left the comment vague in the hopes that someone else could re-derive it without being led astray by subtle flaws in whatever reasoning I laid out.
because the br.Timestamp will be set to the read timestamp (which will necessarily contain all read timestamps because of having declared all keys).
AFAICT it's actually br.Now
that distSender uses to update its clock. But the same reasoning applies.
Another option is to return the batch timestamp as data from the GetSnapshotForMerge request, to make matters more explicit.
Yeah, I'm not sold on this approach, but I was trying to follow what we do for lease transfers, which implicitly rely on causality. If we want to pass timestamps around explicitly, I think we can do better than using the batch timestamp of the GetSnapshotForMerge
: we can use max(tscache.GetMaxRead(RHS), tscache.GetMaxWrite(RHS))
, which is easy to compute when GetSnapshotForMerge
executes.
Thoughts?
And yeah, to your parenthetical: I've got a PR prepared to rename it to SubsumeRequest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2473 at r1 (raw file):
we can use max(tscache.GetMaxRead(RHS), tscache.GetMaxWrite(RHS))
But that won't work if the RHS' leaseholder isn't colocated with the LHS', right?
I'm fine relying on the same mechanism we use for leases (do we really only rely on br.Now
there?) but I'd like to see it tested end-to-end (i.e. one that fails if you disable that mechanism, which I don't think the one here will). This test could only test that specific mechanism. Or am I missing some existing test coverage here? You could reference another test that does it. (It's a shame that mtc doesn't let you use individual clocks).
Let's make the comment explicit.
But that won't work if the RHS' leaseholder isn't colocated with the LHS', right?
I meant that we could send that value along in the response to
GetSnapshotForMerge. In which case it’s fine if they’re not collocated.
AFAICT.
For leases we rely on ba.Timestamp because the old leaseholder sends the
lease to the new leaseholder. Here we need br.Now because the new
leaseholder (LHS) is requesting the lease from the old leaseholder (RHS).
So yeah, a little scary to rely on that, I guess.
Definitely agree this needs a test. Just want to decide what approach to
take in the implementation before I write the test.
…On Sun, Aug 19, 2018 at 2:05 PM Tobias Schottdorf ***@***.***> wrote:
***@***.**** approved this pull request.
*Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/28793>*
status: [image: ] complete! 0 of 0 LGTMs obtained
------------------------------
*pkg/storage/store.go, line 2473 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/28793#-LKDCgkj5KMtIidylpup:-LKIFUd2BzM303udzmwQ:br05a4q>
(raw file
<https://github.com/cockroachdb/cockroach/blob/404eebce2093bffaddfe8619daf6478ac14aa6a2/pkg/storage/store.go#L2473>):*
we can use max(tscache.GetMaxRead(RHS), tscache.GetMaxWrite(RHS))
But that won't work if the RHS' leaseholder isn't colocated with the LHS',
right?
I'm fine relying on the same mechanism we use for leases (do we really
only rely on br.Now there?) but I'd like to see it tested end-to-end
(i.e. one that fails if you disable that mechanism, which I don't think the
one here will). This test could only test that specific mechanism. Or am I
missing some existing test coverage here? You could reference another test
that does it. (It's a shame that mtc doesn't let you use individual clocks).
Let's make the comment explicit.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#28793 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IEgTJXvmEjPUyQbrp7cUXO4Pq4OBks5uSakFgaJpZM4WCqfC>
.
|
Pretty sure mtc will allow you to specify multiple clocks! At least there
is a comment claiming it is supported.
On Sun, Aug 19, 2018 at 3:06 PM Nikhil Benesch <[email protected]>
wrote:
…
> But that won't work if the RHS' leaseholder isn't colocated with the
LHS', right?
I meant that we could send that value along in the response to
GetSnapshotForMerge. In which case it’s fine if they’re not collocated.
AFAICT.
For leases we rely on ba.Timestamp because the old leaseholder sends the
lease to the new leaseholder. Here we need br.Now because the new
leaseholder (LHS) is requesting the lease from the old leaseholder (RHS).
So yeah, a little scary to rely on that, I guess.
Definitely agree this needs a test. Just want to decide what approach to
take in the implementation before I write the test.
On Sun, Aug 19, 2018 at 2:05 PM Tobias Schottdorf <
***@***.***> wrote:
> ***@***.**** approved this pull request.
>
> *Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/28793>*
> status: [image: ] complete! 0 of 0 LGTMs obtained
>
> ------------------------------
>
> *pkg/storage/store.go, line 2473 at r1
> <https://reviewable.io/reviews/cockroachdb/cockroach/28793#-LKDCgkj5KMtIidylpup:-LKIFUd2BzM303udzmwQ:br05a4q>
> (raw file
> <https://github.com/cockroachdb/cockroach/blob/404eebce2093bffaddfe8619daf6478ac14aa6a2/pkg/storage/store.go#L2473>):*
>
> we can use max(tscache.GetMaxRead(RHS), tscache.GetMaxWrite(RHS))
>
> But that won't work if the RHS' leaseholder isn't colocated with the
> LHS', right?
>
> I'm fine relying on the same mechanism we use for leases (do we really
> only rely on br.Now there?) but I'd like to see it tested end-to-end
> (i.e. one that fails if you disable that mechanism, which I don't think the
> one here will). This test could only test that specific mechanism. Or am I
> missing some existing test coverage here? You could reference another test
> that does it. (It's a shame that mtc doesn't let you use individual clocks).
>
> Let's make the comment explicit.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#28793 (review)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA15IEgTJXvmEjPUyQbrp7cUXO4Pq4OBks5uSakFgaJpZM4WCqfC>
> .
>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that we could send that value along in the response to GetSnapshotForMerge. In which case it’s fine if they’re not collocated. AFAICT.
Oh, yeah, this works and it's more transparent than taking clock values and relying on some ordering. I like the explicitness of that and it's easy enough to change in the future.
Pretty sure mtc will allow you to specify multiple clocks!
Oh, good!
Reviewable status: complete! 0 of 0 LGTMs obtained
I like the idea of making the timestamp explicit in the response for |
How expensive? Merges are already extremely expensive; my intuition
suggests a few in-memory ops are unlikely to make much of a difference.
But yeah, your proposal is exactly the middle ground I had in mind.
…On Sun, Aug 19, 2018 at 6:28 PM Nathan VanBenschoten < ***@***.***> wrote:
I like the idea of making the timestamp explicit in the response for
GetSnapshotForMerge and not relying on br.Now, but do we need to pull the
time from the timestamp cache directly? That's not a particularly efficient
operation. Can't we just return GetSnapshotForMergeResponse{Now:
hlc.Now()}, which will necessarily be larger than any previous operations
performed on the old leaseholder?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#28793 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IIzbiW6DwqYVchWQJKjQfl6dYgCwks5uSeaWgaJpZM4WCqfC>
.
|
Works for me.
On Mon, Aug 20, 2018 at 3:31 AM Nikhil Benesch <[email protected]>
wrote:
How expensive? Merges are already extremely expensive; my intuition
suggests a few in-memory ops are unlikely to make much of a difference.
But yeah, your proposal is exactly the middle ground I had in mind.
On Sun, Aug 19, 2018 at 6:28 PM Nathan VanBenschoten <
***@***.***> wrote:
> I like the idea of making the timestamp explicit in the response for
> GetSnapshotForMerge and not relying on br.Now, but do we need to pull the
> time from the timestamp cache directly? That's not a particularly
efficient
> operation. Can't we just return GetSnapshotForMergeResponse{Now:
> hlc.Now()}, which will necessarily be larger than any previous operations
> performed on the old leaseholder?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <
#28793 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AA15IIzbiW6DwqYVchWQJKjQfl6dYgCwks5uSeaWgaJpZM4WCqfC
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28793 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135JDq8hpbkj2KGDa5bwPk0JRG8ui3ks5uShFzgaJpZM4WCqfC>
.
--
…-- Tobias
|
On the order of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I went with threading the value of Clock.Now()
during GetSnapshotForMerge
execution through the merge trigger per @nvanbenschoten's suggestion. I also managed to concoct a convoluted scenario that verified that clocks are updated properly; see the megacomment on the new test.
FYI I rebased on top of #28865 to avoid some merge skew.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd ideally adjust replica changes to operate the same way.
Unfortunately the updated descriptor is not currently included in the
ChangeReplicasTrigger, so the migration is rather involved.
Doesn't seem that involved, but at the same time I'd rather not have you drive-by it. We'd add a new field and only populate it when the version is set. At the receiving end we'd use the field if set and fall back to the old code otherwise.
Reviewed 2 of 2 files at r2, 5 of 5 files at r3, 5 of 5 files at r4, 11 of 11 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.proto, line 1294 at r5 (raw file):
uint64 lease_applied_index = 4; // FreezeStart is the time at which the receiving replica promised to stop
Perhaps make this more concrete: FreezeStart is a timestamp that is guaranteed to be greater than the timestamps at which any requests were serviced by the responding replica before it stopped responding to requests altogether (in anticipation of being subsumed). It is suitable ...
pkg/roachpb/data.proto, line 144 at r5 (raw file):
]; // FreezeStart is the time at which the right-hand range promised to stop
Ditto.
pkg/settings/cluster/cockroach_versions.go, line 270 at r4 (raw file):
}, { // VersionRangeMerges is XXX,
You can fill this in now.
pkg/storage/client_merge_test.go, line 472 at r5 (raw file):
// causality information is not communicated through the normal channels. // Suppose two adjacent ranges, A and B, are collocated on S2, S3, and S4. (S1 // is omitted is for consistency with the store numbering in the test itself.)
is omitted is
pkg/storage/client_merge_test.go, line 502 at r5 (raw file):
Great test.
Importantly, S3 must also update its clock from T1 to T3.
This would also lead to inconsistencies without merges, right? The new leaseholder might accept a conflicting write.
pkg/storage/store.go, line 2954 at r5 (raw file):
// reject it now before we reach that point. var err error log.Errorf(ctx, "updating to %v", ba.Timestamp)
Remove
pkg/storage/store.go, line 2996 at r5 (raw file):
// (if timestamp has been forwarded). if ba.Timestamp.Less(br.Timestamp) { log.Errorf(ctx, "updating to %v", br.Timestamp)
Remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.proto, line 1294 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Perhaps make this more concrete: FreezeStart is a timestamp that is guaranteed to be greater than the timestamps at which any requests were serviced by the responding replica before it stopped responding to requests altogether (in anticipation of being subsumed). It is suitable ...
Done.
pkg/roachpb/data.proto, line 144 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Ditto.
Done.
pkg/settings/cluster/cockroach_versions.go, line 270 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
You can fill this in now.
Filled in in #28865. Leaving this as-is for now because it'll get rebased away shortly.
pkg/storage/client_merge_test.go, line 472 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
is omitted is
Done.
pkg/storage/client_merge_test.go, line 502 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Great test.
Importantly, S3 must also update its clock from T1 to T3.
This would also lead to inconsistencies without merges, right? The new leaseholder might accept a conflicting write.
Yeah, but lease transfers are protected by the ba.Timestamp
bump. (We get screwed here because we're pull-based instead of push-based.)
pkg/storage/store.go, line 2954 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Remove
Was already removed in r6; are you seeing differently?
pkg/storage/store.go, line 2996 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Remove
Ditto above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/store.go, line 2954 at r5 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Was already removed in r6; are you seeing differently?
No, but my comment was on r5.
eb85eae
to
8930159
Compare
When applying a merge, teach the leaseholder of the LHS range to update its timestamp cache for the keyspace previously owned by the RHS range appropriately. Release note: None
Whew, ok, this is green. Merging on @tschottdorf's LGTM since this is holding up a fair few other PRs, but @nvanbenschoten please lmk if you have any post-merge feedback! |
bors r=tschottdorf |
27848: sql: quantize the statement counts reported to the reg server r=knz a=knz Fixes #25114. This preserves every count smaller than 10, then for larger values bucket into buckets that are powers of 10. Release note (sql change): CockroachDB now hides more information from the statement statistics reported to Cockroach Labs. 28793: storage: update the tscache appropriately after a merge r=tschottdorf a=benesch @nvanbenschoten I figured this was in your wheelhouse but feel free to defer the review to @tschottdorf! --- When applying a merge, teach the leaseholder of the LHS range to update its timestamp cache for the keyspace previously owned by the RHS range appropriately. Release note: None 28903: closedts/transport: avoid race when using stopper.RunWorker r=tschottdorf a=benesch Stopper.RunWorker cannot be called in code which can run concurrently with Stopper.Stop. Use Stopper.RunAsyncTask instead. Fix #28755. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>
Build succeeded |
29033: backport-2.1: storage: improve consistency checker diff, add debug decode-key r=a-robinson a=tschottdorf Backport 2/2 commits from #29012. /cc @cockroachdb/release --- Avoid spilling non-printable characters and generally garbage into the logs. Release note: None 29082: backport-2.1: one week of merge PRs r=knz,tschottdorf a=benesch Backport: * 1/1 commits from "sql: put rendered setting value in event log entry" (#29014) * 2/2 commits from " sql,settings: make SET CLUSTER SETTING wait for value to update, disallow in txn" (#29063) * 1/1 commits from "storage: discard a unworthwhile merge TODO" (#28885) * 3/3 commits from "storage: gate merges behind a cluster setting " (#28865) * 1/1 commits from "storage: update the tscache appropriately after a merge" (#28793) * 1/1 commits from "storage: avoid stopping twice in merge test" (#28902) * 1/1 commits from "roachpb,storage: rename GetSnapshotForMerge to Subsume" (#28910) * 1/1 commits from "build: fix generation rules for settings.html" (#28884) * 1/1 commits from "storage: deflake TestStoreRangeMergeTimestampCacheCausality" (#28928) * 1/1 commits from "storage: check for in-progress merge before installing new lease" (#28894) * 6/7 commits from " storage: enable merge queue by default" (#28961) Please see individual PRs for details. /cc @cockroachdb/release Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>
@nvanbenschoten I figured this was in your wheelhouse but feel free to defer the review to @tschottdorf!
When applying a merge, teach the leaseholder of the LHS range to update
its timestamp cache for the keyspace previously owned by the RHS range
appropriately.
Release note: None