Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: update the tscache appropriately after a merge #28793

Merged
merged 1 commit into from
Aug 21, 2018

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Aug 18, 2018

@nvanbenschoten I figured this was in your wheelhouse but feel free to defer the review to @tschottdorf!


When applying a merge, teach the leaseholder of the LHS range to update
its timestamp cache for the keyspace previously owned by the RHS range
appropriately.

Release note: None

@benesch benesch requested review from tbg, nvanbenschoten and a team August 18, 2018 18:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/store.go, line 2473 at r1 (raw file):

		// difficult.
		s.tsCache.SetLowWater(rightDesc.StartKey.AsRawKey(), rightDesc.EndKey.AsRawKey(), s.Clock().Now())
	}

I'm kind of unsure about all of this. Leases are finicky; is looking at r.mu.state.Lease like I do here correct? Is my statement about the chain of causality correct? And (as mentioned in the TODO) would it be better to thread the RHS's timestamp at the moment it executed the GetSnapshotForMerge request?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/store.go, line 2473 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

I'm kind of unsure about all of this. Leases are finicky; is looking at r.mu.state.Lease like I do here correct? Is my statement about the chain of causality correct? And (as mentioned in the TODO) would it be better to thread the RHS's timestamp at the moment it executed the GetSnapshotForMerge request?

The code here looks good based on my understanding of merges, but I'd like to understand the interactions a little better before signing off. What invariants does GetSnapshotForMerge place on requests on the RHS of the merge? I'm pretty sure it blocks all concurrent reads and writes (even those at larger timestamps?), but does it check the tscache or somehow ensure that no reads or writes have been performed earlier at larger timestamp? If not then I think we'll need to do what we're doing here. Also, it's not possible for a Range frozen by a GetSnapshotForMerge to transfer its lease without aborting the entire merge, right?

There's also an interesting optimization that you alluded to where we don't need to touch the tscache if the RHS replica on the same store also had the lease. I'm not sure whether this would be worth it or not.

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/store.go, line 2473 at r1 (raw file):

What invariants does GetSnapshotForMerge place on requests on the RHS of the merge? I'm pretty sure it blocks all concurrent reads and writes (even those at larger timestamps?), but does it check the tscache or somehow ensure that no reads or writes have been performed earlier at larger timestamp?

So it declares that it reads and writes every key in the range to flush out the command queue before it executes. That flushes out even those commands that are executing at higher timestamps, if I'm understanding things correctly. It doesn't look at the tscache at all, though.

Also, it's not possible for a Range frozen by a GetSnapshotForMerge to transfer its lease without aborting the entire merge, right?

Right, it's not possible for the RHS to transfer its lease away after it's executed a GetSnapshotForMerge unless the merge is aborted. It is, however, possible for its lease to expire, and then another range can come along and acquire the lease. (See TestStoreRangeMergeRHSLeaseExpiration if you're curious.) But upon applying this lease, the new leaseholder will notice that a merge was initiated and refuse to serve any commands.

To summarize, I think the following is a true statement: after a GetSnapshotForMerge begins executing, the RHS's tscache will never be updated again unless the merge commits.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/store.go, line 2473 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

What invariants does GetSnapshotForMerge place on requests on the RHS of the merge? I'm pretty sure it blocks all concurrent reads and writes (even those at larger timestamps?), but does it check the tscache or somehow ensure that no reads or writes have been performed earlier at larger timestamp?

So it declares that it reads and writes every key in the range to flush out the command queue before it executes. That flushes out even those commands that are executing at higher timestamps, if I'm understanding things correctly. It doesn't look at the tscache at all, though.

Also, it's not possible for a Range frozen by a GetSnapshotForMerge to transfer its lease without aborting the entire merge, right?

Right, it's not possible for the RHS to transfer its lease away after it's executed a GetSnapshotForMerge unless the merge is aborted. It is, however, possible for its lease to expire, and then another range can come along and acquire the lease. (See TestStoreRangeMergeRHSLeaseExpiration if you're curious.) But upon applying this lease, the new leaseholder will notice that a merge was initiated and refuse to serve any commands.

To summarize, I think the following is a true statement: after a GetSnapshotForMerge begins executing, the RHS's tscache will never be updated again unless the merge commits.

If the RHS leaseholder is not colocated with the LHS and served a read at, say, 100 while the LHS leaseholder's clock bumbles around at 90, what makes sure the 100 is absorbed by the LHS' clock first? I think you're saying that we necessarily absorb a timestamp >= 100 when we send the GetSnapshotForMergeRequest (which btw we should probably rename) because the br.Timestamp will be set to the read timestamp (which will necessarily contain all read timestamps because of having declared all keys).

This seems reasonable, but the test doesn't seem to exercise it because (I think) the mtc shares the clocks across all nodes, so please add a test for this that exercises it more directly (or do anything else that establishes confidence).
I think the comment about causality could also be a little more specific. We've rarely relied on br.Timestamp for anything, let alone anything critical :-)
Another option is to return the batch timestamp as data from the GetSnapshotForMerge request, to make matters more explicit.

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/store.go, line 2473 at r1 (raw file):

If the RHS leaseholder is not colocated with the LHS and served a read at, say, 100 while the LHS leaseholder's clock bumbles around at 90, what makes sure the 100 is absorbed by the LHS' clock first? I think you're saying that we necessarily absorb a timestamp >= 100 when we send the GetSnapshotForMergeRequest (which btw we should probably rename) because the br.Timestamp will be set to the read timestamp (which will necessarily contain all read timestamps because of having declared all keys).

Yep, this is about what I was thinking. I left the comment vague in the hopes that someone else could re-derive it without being led astray by subtle flaws in whatever reasoning I laid out.

because the br.Timestamp will be set to the read timestamp (which will necessarily contain all read timestamps because of having declared all keys).

AFAICT it's actually br.Now that distSender uses to update its clock. But the same reasoning applies.

Another option is to return the batch timestamp as data from the GetSnapshotForMerge request, to make matters more explicit.

Yeah, I'm not sold on this approach, but I was trying to follow what we do for lease transfers, which implicitly rely on causality. If we want to pass timestamps around explicitly, I think we can do better than using the batch timestamp of the GetSnapshotForMerge: we can use max(tscache.GetMaxRead(RHS), tscache.GetMaxWrite(RHS)), which is easy to compute when GetSnapshotForMerge executes.

Thoughts?

And yeah, to your parenthetical: I've got a PR prepared to rename it to SubsumeRequest.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/store.go, line 2473 at r1 (raw file):

we can use max(tscache.GetMaxRead(RHS), tscache.GetMaxWrite(RHS))

But that won't work if the RHS' leaseholder isn't colocated with the LHS', right?

I'm fine relying on the same mechanism we use for leases (do we really only rely on br.Now there?) but I'd like to see it tested end-to-end (i.e. one that fails if you disable that mechanism, which I don't think the one here will). This test could only test that specific mechanism. Or am I missing some existing test coverage here? You could reference another test that does it. (It's a shame that mtc doesn't let you use individual clocks).

Let's make the comment explicit.

@benesch
Copy link
Contributor Author

benesch commented Aug 19, 2018 via email

@benesch
Copy link
Contributor Author

benesch commented Aug 19, 2018 via email

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that we could send that value along in the response to GetSnapshotForMerge. In which case it’s fine if they’re not collocated. AFAICT.

Oh, yeah, this works and it's more transparent than taking clock values and relying on some ordering. I like the explicitness of that and it's easy enough to change in the future.

Pretty sure mtc will allow you to specify multiple clocks!

Oh, good!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@nvanbenschoten
Copy link
Member

I like the idea of making the timestamp explicit in the response for GetSnapshotForMerge and not relying on br.Now, but do we need to pull the time from the timestamp cache directly? That's not a particularly efficient operation. Can't we just return GetSnapshotForMergeResponse{Now: hlc.Now()}, which will necessarily be larger than any previous operations performed on the old leaseholder?

@benesch
Copy link
Contributor Author

benesch commented Aug 20, 2018 via email

@tbg
Copy link
Member

tbg commented Aug 20, 2018 via email

@nvanbenschoten
Copy link
Member

How expensive? Merges are already extremely expensive; my intuition
suggests a few in-memory ops are unlikely to make much of a difference.

On the order of O(# reads on range in last 10 seconds). But yeah you're almost certainly right.

@benesch benesch requested a review from a team August 21, 2018 05:47
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I went with threading the value of Clock.Now() during GetSnapshotForMerge execution through the merge trigger per @nvanbenschoten's suggestion. I also managed to concoct a convoluted scenario that verified that clocks are updated properly; see the megacomment on the new test.

FYI I rebased on top of #28865 to avoid some merge skew.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd ideally adjust replica changes to operate the same way.
Unfortunately the updated descriptor is not currently included in the
ChangeReplicasTrigger, so the migration is rather involved.

Doesn't seem that involved, but at the same time I'd rather not have you drive-by it. We'd add a new field and only populate it when the version is set. At the receiving end we'd use the field if set and fall back to the old code otherwise.

Reviewed 2 of 2 files at r2, 5 of 5 files at r3, 5 of 5 files at r4, 11 of 11 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/roachpb/api.proto, line 1294 at r5 (raw file):

  uint64 lease_applied_index = 4;

  // FreezeStart is the time at which the receiving replica promised to stop

Perhaps make this more concrete: FreezeStart is a timestamp that is guaranteed to be greater than the timestamps at which any requests were serviced by the responding replica before it stopped responding to requests altogether (in anticipation of being subsumed). It is suitable ...


pkg/roachpb/data.proto, line 144 at r5 (raw file):

  ];

  // FreezeStart is the time at which the right-hand range promised to stop

Ditto.


pkg/settings/cluster/cockroach_versions.go, line 270 at r4 (raw file):

	},
	{
		// VersionRangeMerges is XXX,

You can fill this in now.


pkg/storage/client_merge_test.go, line 472 at r5 (raw file):

// causality information is not communicated through the normal channels.
// Suppose two adjacent ranges, A and B, are collocated on S2, S3, and S4. (S1
// is omitted is for consistency with the store numbering in the test itself.)

is omitted is


pkg/storage/client_merge_test.go, line 502 at r5 (raw file):
Great test.

Importantly, S3 must also update its clock from T1 to T3.

This would also lead to inconsistencies without merges, right? The new leaseholder might accept a conflicting write.


pkg/storage/store.go, line 2954 at r5 (raw file):

		// reject it now before we reach that point.
		var err error
		log.Errorf(ctx, "updating to %v", ba.Timestamp)

Remove


pkg/storage/store.go, line 2996 at r5 (raw file):

				// (if timestamp has been forwarded).
				if ba.Timestamp.Less(br.Timestamp) {
					log.Errorf(ctx, "updating to %v", br.Timestamp)

Remove

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/roachpb/api.proto, line 1294 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Perhaps make this more concrete: FreezeStart is a timestamp that is guaranteed to be greater than the timestamps at which any requests were serviced by the responding replica before it stopped responding to requests altogether (in anticipation of being subsumed). It is suitable ...

Done.


pkg/roachpb/data.proto, line 144 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto.

Done.


pkg/settings/cluster/cockroach_versions.go, line 270 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You can fill this in now.

Filled in in #28865. Leaving this as-is for now because it'll get rebased away shortly.


pkg/storage/client_merge_test.go, line 472 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

is omitted is

Done.


pkg/storage/client_merge_test.go, line 502 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Great test.

Importantly, S3 must also update its clock from T1 to T3.

This would also lead to inconsistencies without merges, right? The new leaseholder might accept a conflicting write.

Yeah, but lease transfers are protected by the ba.Timestamp bump. (We get screwed here because we're pull-based instead of push-based.)


pkg/storage/store.go, line 2954 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Remove

Was already removed in r6; are you seeing differently?


pkg/storage/store.go, line 2996 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Remove

Ditto above.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/store.go, line 2954 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Was already removed in r6; are you seeing differently?

No, but my comment was on r5.

@benesch benesch force-pushed the merge-tscache branch 2 times, most recently from eb85eae to 8930159 Compare August 21, 2018 16:02
When applying a merge, teach the leaseholder of the LHS range to update
its timestamp cache for the keyspace previously owned by the RHS range
appropriately.

Release note: None
@benesch
Copy link
Contributor Author

benesch commented Aug 21, 2018

Whew, ok, this is green. Merging on @tschottdorf's LGTM since this is holding up a fair few other PRs, but @nvanbenschoten please lmk if you have any post-merge feedback!

@benesch
Copy link
Contributor Author

benesch commented Aug 21, 2018

bors r=tschottdorf

craig bot pushed a commit that referenced this pull request Aug 21, 2018
27848: sql: quantize the statement counts reported to the reg server r=knz a=knz

Fixes #25114.

This preserves every count smaller than 10, then for larger values
bucket into buckets that are powers of 10.

Release note (sql change): CockroachDB now hides more information from
the statement statistics reported to Cockroach Labs.

28793: storage: update the tscache appropriately after a merge r=tschottdorf a=benesch

@nvanbenschoten I figured this was in your wheelhouse but feel free to defer the review to @tschottdorf!

---

When applying a merge, teach the leaseholder of the LHS range to update
its timestamp cache for the keyspace previously owned by the RHS range
appropriately.

Release note: None

28903: closedts/transport: avoid race when using stopper.RunWorker r=tschottdorf a=benesch

Stopper.RunWorker cannot be called in code which can run concurrently
with Stopper.Stop. Use Stopper.RunAsyncTask instead.

Fix #28755.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Nikhil Benesch <[email protected]>
@craig
Copy link
Contributor

craig bot commented Aug 21, 2018

Build succeeded

@craig craig bot merged commit 8aea353 into cockroachdb:master Aug 21, 2018
@benesch benesch deleted the merge-tscache branch August 21, 2018 21:08
craig bot pushed a commit that referenced this pull request Aug 27, 2018
29033: backport-2.1: storage: improve consistency checker diff, add debug decode-key r=a-robinson a=tschottdorf

Backport 2/2 commits from #29012.

/cc @cockroachdb/release

---

Avoid spilling non-printable characters and generally garbage into the
logs.

Release note: None


29082: backport-2.1: one week of merge PRs r=knz,tschottdorf a=benesch

Backport:
  * 1/1 commits from "sql: put rendered setting value in event log entry" (#29014)
  * 2/2 commits from " sql,settings: make SET CLUSTER SETTING wait for value to update, disallow in txn" (#29063)
  * 1/1 commits from "storage: discard a unworthwhile merge TODO" (#28885)
  * 3/3 commits from "storage: gate merges behind a cluster setting " (#28865)
  * 1/1 commits from "storage: update the tscache appropriately after a merge" (#28793)
  * 1/1 commits from "storage: avoid stopping twice in merge test" (#28902)
  * 1/1 commits from "roachpb,storage: rename GetSnapshotForMerge to Subsume" (#28910)
  * 1/1 commits from "build: fix generation rules for settings.html" (#28884)
  * 1/1 commits from "storage: deflake TestStoreRangeMergeTimestampCacheCausality" (#28928)
  * 1/1 commits from "storage: check for in-progress merge before installing new lease" (#28894)
  * 6/7 commits from " storage: enable merge queue by default" (#28961)

Please see individual PRs for details.

/cc @cockroachdb/release


Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Nikhil Benesch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants