Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: closed timestamps v2 RFC #56675

Merged
merged 1 commit into from
Dec 19, 2022
Merged

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Nov 13, 2020

We need a more flexible closed timestamps mechanism because the
Non-blocking Transactions project will use two different closed
timestamps policies for different ranges. The existing mechanism closes
a single timestamps for all the ranges for which a node is the
leaseholder, which is not good enough.
Besides, the existing mechanism is quite complex. The alternative
proposed by this RFC seems significantly simpler.

Release note: None
Epic: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 58 at r1 (raw file):

    - The side-transport only deals with ranges without in-flight proposals.
    - The side-transport would effectively increase the closed timestamp of the
    last applies Raft entry without going through consensus.

"applied"


docs/RFCS/20191108_closed_timestamps_v2.md, line 59 at r1 (raw file):

    - The side-transport would effectively increase the closed timestamp of the
    last applies Raft entry without going through consensus.
    - The entries coming from the side-transport don't need to be written to any storage.

I would reword this slightly to say that entries from the side-transport don't need to be stored separately upon reception, which avoids the need for something like pkg/kv/kvserver/closedts/storage.


docs/RFCS/20191108_closed_timestamps_v2.md, line 64 at r1 (raw file):

Closed-timestamps are immediately actionable by followers receiving them
(because they refer to state the followers have just applied, not to prospective
state).

This will make it easier for follower reads to wait when they can't be immediately satisfied, instead of being rejected and forwarded to the leaseholder immediately. We don't currently have a way for a follower read that fails due to the follower's closed timestamp to queue up and wait for the follower to catch up. This would be difficult to implement today because of how closed timestamp entries are only understandable in the context of a lease. Keeping these separate and having a clear, centralized location where a replica's closed timestamp increases would make this kind of queue much easier to build.


docs/RFCS/20191108_closed_timestamps_v2.md, line 91 at r1 (raw file):

- Ranges with expiration-based leases can now start to use closed timestamps
too. Currently, only epoch-based leases are compatible with closed timestamps,
because a closed timestamp is tied to a lease epoch.

Maybe note that this isn't a huge win in practice, but serves as an indication that this proposal is breaking down dependencies which made things more complex in the past.


docs/RFCS/20191108_closed_timestamps_v2.md, line 103 at r1 (raw file):

A concept fundamental to this proposal is that closed timestamps become
associated to Raft log positions. Each Raft command carries a closed timestamp

So you decided to move from lease index to raft log index? Why was that? I'm having a hard time determining whether that's the right change, but I don't think it is – it feels wrong.

I think that's because we have a lot less control over Raft log indexes. There are cases when empty Raft entries get added to the Raft log, like after leader elections or to unquiesce a range. I don't think we want these entries to be taken into account when describing the "minimum index a follower needs to be caught up to for a given closed timestamp".

I think it also feel wrong because the Raft log index is a Raft-level concept, but the lease index and closed timestamp are both lease-level concepts. To appreciate this distinction, imagine how this is going to work when we have a split leaseholder/leader situation. The leaseholder can't guarantee that its proposals end up in a certain order in the Raft log, but it can control the order of applied commands through the use of lease indexes. So it can still guarantee monotonicity of closed timestamps in the log, but only for non-rejected commands because the lease index protects against re-ordering and against lease transfers. So a closed timestamp in a Raft entry is only valid if that entry's command passes the lease index check, which is an indication that we should key closed timestamps off of the lease applied index, not the raft applied index.


docs/RFCS/20191108_closed_timestamps_v2.md, line 144 at r1 (raw file):

Each replica will write the closed timestamp it knows about in a new field part
of the `RangeStateAppliedKey` - the range-local key maintaining the range stats,

I'd also mention the lease applied index here. That's the piece of information that makes this most clearly the right spot to store the closed ts.


docs/RFCS/20191108_closed_timestamps_v2.md, line 147 at r1 (raw file):

among others. This key is already being written to on every command application
(because of the stats). Snapshots naturally include the `RangeStateAppliedKey`,
so new replicas will be initialized with a closed timestamp.

We'll also want to do something with splits by propagating the LHS range's closed timestamp to the RHS range in splitTriggerHelper->WriteInitialReplicaState. This is actually a lot cleaner than how things work today, which relies on some subtle below-raft handling.


docs/RFCS/20191108_closed_timestamps_v2.md, line 151 at r1 (raw file):

We've talked about "active/inactive" ranges, but haven't defined them. A range
is inactive at a point in time if there's no inflight write requests for it
(i.e. no writes being evaluated at the moment). The leaseholder can verify this

evaluated or going through Raft


docs/RFCS/20191108_closed_timestamps_v2.md, line 154 at r1 (raw file):

that's also the current situation (right?)

I think so, and it's a good one to verify. I think the way this would work is that the first attempt to perform a follower read would get rejected and then trigger the lease acquisition, so things would converge quickly.


docs/RFCS/20191108_closed_timestamps_v2.md, line 156 at r1 (raw file):

(that's also the current situation (right?)). When a leaseholder detects a range
to be inactive, it can choose bump its closed timestamp however it wants (as
long as it promises to not accept new writes below it). Each node will

It can't bump the closed timestamp past its lease expiration time.


docs/RFCS/20191108_closed_timestamps_v2.md, line 167 at r1 (raw file):

The inactivity notion is similar to quiescence, but otherwise distinct from it
(there's no reason to tie the two). In practice, the process that detects ranges

The process being Raft ticking, which runs every 200ms (defaultRaftTickInterval)? Depending on the rate at which we want closed timestamps to be updated, we may want to decouple that as well.

For non-blocking transactions, we need to take this period into account when determining how far in the future we write, so we may end up wanting a higher frequency of updates.


docs/RFCS/20191108_closed_timestamps_v2.md, line 204 at r1 (raw file):

  {
    group id:  g2,
    members: [{range id:  3, log index:  3}, {range id:  4, log index:  4}], 

Any reason not to just make this additions so there are fewer concepts.


docs/RFCS/20191108_closed_timestamps_v2.md, line 241 at r1 (raw file):

follower not being to serve follower-reads at any timestamp until replication
catches up. The current implementation has to maintain a history of closed ts
updates per range for this reason, and all that can go away.

We should also talk about what the receiver does when it receives these messages. I was thinking it would iterate over each range (implicitly or explicitly) in the update, check whether it has the matching lease applied index, and if so, directly update its RangeAppliedStateKey and corresponding in-memory ReplicaState.

This would clearly be more expensive up-front than the current approach, which is lazy and defers most work to the follower reads that need to reach into the closed timestamp storage and resolve the appropriate closed timestamp. I don't know if this is a problem, though. I suspect that with even a moderate rate of follower reads, optimizing for the read-path by doing slightly more work on updates would actually pay off. We'd need to test this out.


docs/RFCS/20191108_closed_timestamps_v2.md, line 260 at r1 (raw file):

The buckets group requests into
"epochs", each epoch being defined by the smallest timestamp allowed for a
request to evaluate at.

Might be worth re-wording to indicate that the timestamp is < the request is allowed to evaluate at, not <=.


docs/RFCS/20191108_closed_timestamps_v2.md, line 266 at r1 (raw file):

*prev*'s timestamp, regardless of whether the proposal comes from a request in
*prev* or in *cur*. When a request arrives, it enters *cur*. A request that
finds *prev* to be empty increments *prev* and *cur*. This is the moment when

And decrements cur? Actually, you use the term increment *prev* and *cur* below as well. What do you mean by that?


docs/RFCS/20191108_closed_timestamps_v2.md, line 269 at r1 (raw file):

the range's closed timestamp moves forward - because now the timestamp that used
to be associated with *cur* becomes associated with *prev* and so it is closed.
This rule means that the first first request on a range will join *cur* but then

"first first"


docs/RFCS/20191108_closed_timestamps_v2.md, line 285 at r1 (raw file):

Lease requests are special as in they don't carry a closed timestamp. As
explained before, the lease start time acts that proposal's closed ts.

So they are not tracked, right?


docs/RFCS/20191108_closed_timestamps_v2.md, line 351 at r1 (raw file):

`closedts=15` - in other words we could have shifted the buckets also when the
last request out of *prev* exited. The logic presented seems to be the minimal
one that's sufficient for keeping closed timestamps moving forward.

So for this to be optimal, we need to shift in two spots:

  1. when increasing cur's refcnt to 1, if prev's refcnt is 0
  2. when decreasing prev's refcnt to 0, if cur's ts is set

You don't think we should do that?


docs/RFCS/20191108_closed_timestamps_v2.md, line 355 at r1 (raw file):

## Migration

!!! both systems run at the same time for a release

I'm interested to explore this further. Want to do so together sometime this week? It seems like the biggest blocker to getting started on this.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 167 at r1 (raw file):

The process being Raft ticking, which runs every 200ms (defaultRaftTickInterval)?

because we want this to be faster?


docs/RFCS/20191108_closed_timestamps_v2.md, line 190 at r2 (raw file):

timestamp will be `now() - kv.closed_timestamp.target_duration`.

Once closed, these timestamps need to be communicated to the followers of the

One thing that might be worth exploration is how do we keep track of the set of streams that must exist and their membership. Unquiescing will remove a member. That might cover replication changed but it might not because you may forcibly lose the lease. It feels like there's some relatively complex synchronization dances and I don't have a clear vision on how all of this is going to be coordinated.

Maybe it all will just work if we're sloppy here. Is the deal just that the read path will consult the in-memory view of the closed timestamp due to the inactive transport and also the durable one and enforce the latest one?


docs/RFCS/20191108_closed_timestamps_v2.md, line 262 at r2 (raw file):

timestamp and a reference count associated with it; every request in the bucket
evaluated as a higher timestamp than the bucket's timestamp. The timestamp of
bucket *i* is larger than that of bucket i-1*. The buckets group requests into

nit: missing *


docs/RFCS/20191108_closed_timestamps_v2.md, line 266 at r2 (raw file):

request to evaluate at. At any point in time only the last two buckets are
active (and so the implementation can just maintain two buckets and continuously
swap them) - we'll call the last bucket cur* and the previous one *prev*. The

nit: missing *


docs/RFCS/20191108_closed_timestamps_v2.md, line 267 at r2 (raw file):

active (and so the implementation can just maintain two buckets and continuously
swap them) - we'll call the last bucket cur* and the previous one *prev*. The
range's closed timestamp is always prev*'s timestamp - so every proposal carries

nit: missing *

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 109 at r2 (raw file):

associated to Raft log positions. Each Raft command carries a closed timestamp
that says: whoever applies this command *c1* with closed ts *ts1* is free to
serve reads at timestamps <= *ts* on the respective range. Closed timestamps

s/ts/ts1?


docs/RFCS/20191108_closed_timestamps_v2.md, line 111 at r2 (raw file):

serve reads at timestamps <= *ts* on the respective range. Closed timestamps
carried by commands that ultimately are rejected are inconsequential. This is a
subtle, but important different from the existing infrastructure. Currently, an

difference


docs/RFCS/20191108_closed_timestamps_v2.md, line 188 at r2 (raw file):

individual range's policy, but all ranges with the same policy are bumped to the
same timestamp (for compression purposes, see below). For "normal" ranges, that
timestamp will be `now() - kv.closed_timestamp.target_duration`.

Don't we want these timestamps to be determined by some range-level closed timestamp duration attribute? Its unclear why we care about kv.closed_timestamp.target_duration at all.

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 351 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

So for this to be optimal, we need to shift in two spots:

  1. when increasing cur's refcnt to 1, if prev's refcnt is 0
  2. when decreasing prev's refcnt to 0, if cur's ts is set

You don't think we should do that?

If we do want to do it in two spots, might also be worth clarifying that 1 would only happen when bootstrapping whereas 2 is during regular operation?


docs/RFCS/20191108_closed_timestamps_v2.md, line 66 at r2 (raw file):

state).
- The proposal avoids closed timestamp stalls when MLAI is never reached due to
failed proposal and then quiescence.

I'm a little unclear about how something like this can actually happen in practice. Don't we check for pending proposals in shouldReplicaQuiesce?

If you meant "failed evaluations" then it's still not possible, right? That's because we call untrack with a 0 LeaseAppliedIndex in that case.


docs/RFCS/20191108_closed_timestamps_v2.md, line 76 at r2 (raw file):

Currently this happens below Raft - each replica independently figures out an
`initialMaxClosed` value suitable for the new range. With this proposal we'll be
able to simply propagate a closed timestamps above Raft, in the `SplitTrigger`.

s/timestamps/timestamp


docs/RFCS/20191108_closed_timestamps_v2.md, line 354 at r2 (raw file):

  1. Exit your bucket (i.e. decrement the bucket's refcount and, if 0, increment
    prev and cur).

You've mentioned here^ that we would shift the buckets when the last request left prev.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 58 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"applied"

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 59 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I would reword this slightly to say that entries from the side-transport don't need to be stored separately upon reception, which avoids the need for something like pkg/kv/kvserver/closedts/storage.

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 64 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This will make it easier for follower reads to wait when they can't be immediately satisfied, instead of being rejected and forwarded to the leaseholder immediately. We don't currently have a way for a follower read that fails due to the follower's closed timestamp to queue up and wait for the follower to catch up. This would be difficult to implement today because of how closed timestamp entries are only understandable in the context of a lease. Keeping these separate and having a clear, centralized location where a replica's closed timestamp increases would make this kind of queue much easier to build.

added as a bullet in the Flexibility section below


docs/RFCS/20191108_closed_timestamps_v2.md, line 91 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Maybe note that this isn't a huge win in practice, but serves as an indication that this proposal is breaking down dependencies which made things more complex in the past.

meh, I think everybody can already take what they want from the stated fact


docs/RFCS/20191108_closed_timestamps_v2.md, line 103 at r1 (raw file):

so a closed timestamp in a Raft entry is only valid if that entry's command passes the lease index check, which is an indication that we should key closed timestamps off of the lease applied index, not the raft applied index.

I buy this enough. Changed and added a LAIs vs. Raft log indexes section.


docs/RFCS/20191108_closed_timestamps_v2.md, line 144 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'd also mention the lease applied index here. That's the piece of information that makes this most clearly the right spot to store the closed ts.

done, good call


docs/RFCS/20191108_closed_timestamps_v2.md, line 147 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We'll also want to do something with splits by propagating the LHS range's closed timestamp to the RHS range in splitTriggerHelper->WriteInitialReplicaState. This is actually a lot cleaner than how things work today, which relies on some subtle below-raft handling.

added a note here


docs/RFCS/20191108_closed_timestamps_v2.md, line 151 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

evaluated or going through Raft

Clarified.


docs/RFCS/20191108_closed_timestamps_v2.md, line 154 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
that's also the current situation (right?)

I think so, and it's a good one to verify. I think the way this would work is that the first attempt to perform a follower read would get rejected and then trigger the lease acquisition, so things would converge quickly.

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 156 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It can't bump the closed timestamp past its lease expiration time.

done. I think it's technically the start of the stasis period, not the expiration, right?


docs/RFCS/20191108_closed_timestamps_v2.md, line 167 at r1 (raw file):

Previously, ajwerner wrote…

The process being Raft ticking, which runs every 200ms (defaultRaftTickInterval)?

because we want this to be faster?

I've added more words.


docs/RFCS/20191108_closed_timestamps_v2.md, line 204 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Any reason not to just make this additions so there are fewer concepts.

Well I guess I reserve the right to settle on the exact form at implementation time, but at lease conceptually I find it cleaner to separate between the creation of a group and modifications to an existing group rather than the alternative.


docs/RFCS/20191108_closed_timestamps_v2.md, line 241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should also talk about what the receiver does when it receives these messages. I was thinking it would iterate over each range (implicitly or explicitly) in the update, check whether it has the matching lease applied index, and if so, directly update its RangeAppliedStateKey and corresponding in-memory ReplicaState.

This would clearly be more expensive up-front than the current approach, which is lazy and defers most work to the follower reads that need to reach into the closed timestamp storage and resolve the appropriate closed timestamp. I don't know if this is a problem, though. I suspect that with even a moderate rate of follower reads, optimizing for the read-path by doing slightly more work on updates would actually pay off. We'd need to test this out.

Added a paragraph with a twist. See if you like.


docs/RFCS/20191108_closed_timestamps_v2.md, line 260 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The buckets group requests into
"epochs", each epoch being defined by the smallest timestamp allowed for a
request to evaluate at.

Might be worth re-wording to indicate that the timestamp is < the request is allowed to evaluate at, not <=.

clarified


docs/RFCS/20191108_closed_timestamps_v2.md, line 266 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

And decrements cur? Actually, you use the term increment *prev* and *cur* below as well. What do you mean by that?

Added an explanation for the term. The model I find easiest to think about is an infinite sequence of buckets, with cur and prev pointers to the last two.


docs/RFCS/20191108_closed_timestamps_v2.md, line 269 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"first first"

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 285 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

So they are not tracked, right?

right, clarified


docs/RFCS/20191108_closed_timestamps_v2.md, line 351 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

If we do want to do it in two spots, might also be worth clarifying that 1 would only happen when bootstrapping whereas 2 is during regular operation?

The rules typed by Nathan above I think are too complicated - namely shifting when entering a bucket becomes quite counter-intuitive if we also shift when exiting a bucket. Do you like them?

The intuitive thing is to shift when exiting, so let's focus on that. Let's shift when exiting either bucket:

  1. before exiting prev, if cur's ts is set
  2. before exiting cur, if prev is empty

I think this would work, but it's still suboptimal. Consider:

|refcnt: 1 (r1)      |refcnt:0    |
|prev;ts=10          |cur;ts=15   |
---------------------------------

If r1 exits, it will carry closed ts = 10, when it could carry 15 (it carries 10 because, for sequencing purposes, the timestamp a request carries has to be determined before it decrement the refcount on its bucket).

The truth is that the more I think about these buckets, the less I like them. They seem to be lacking any sort of principle, and the scheme is hard to model. What really bothers me is that the decision about which bucket a particular request should join shouldn't be mechanical; it should depend on when the request arrives in relation to when the other requests in the candidate bucket(s) arrived, and also what their write timestamps are. Like, if the target trail is, say, 10, and now = 100, and I'm a request with ts = 50, and prev and cur both have timestamps of 9 and 10 (i.e. there's a requests in there that have been evaluating for a long time), in a two-bucket world I have no choice but to enter cur and hold back the closing of 10 until I finish executing. But that seems outrageous; what gives me the right to do that? Why am I, a ts = 50 request, holding back the closing of 10? If the ts=10 request has just started evaluating, then I'd say it's be OK; hopefully requests starting evaluation around the same time finish evaluation around the same time, so there's not much harm in me joining the ts=10 request. But if that request has been evaluating for an hour? God willing it's about to finish evaluating any time now, and the last thing I want to do is delay the closing of its timestamp by my evaluation time.

How did we get to two buckets in the first place? I guess it goes something like this: what we'd ideally want is a min-heap, with requests ordered by their write timestamps. We'd like a request write timestamp to always be bumped to now - trailing target, and we'd want any proposal to carry as a closed timestamp min(now-trailing target, the heap's min value (i.e. the minimal timestamp of a currently-evaluating request)).
That'd be the ideal, but I guess we don't want to maintain a heap of arbitrary size (the size being the number of concurrent requests). OK, so then we start compressing the requests - let's discretize their timestamps into buckets of 10ms each. A request timestamp would be rounded down to a 10ms step, and would count towards the respective bucket's refcount. Now, if we maintain these buckets in the simplest way possible - a circular buffer - then their count would still be unbounded as it depends on the evaluation time of the slowest request. What we can do is put a limit on the number of buckets, and once the limit is reached, we just have new requests join the last bucket. What should this limit be? Well, 1 doesn't really work, and the next smallest number is 2, so I guess that's how we got it.

But what if we made empty buckets free in our structure? We can bring back the heap - make a heap of buckets, with empty buckets being removed from the heap. Now the size of the heap depends on the number of slow-evaluating requests. We'd still need some limit, but any reasonable limit seems hard to reach because, if you have many slow writes, at some point the latching will also likely block more requests from starting evaluation. So the exact scheme would be that a request's write timestamp is always bumped to now-target (requests don't simply get a free pass in case their joining a bucket with a low timestamp), then the request's bucket is determined, and then each exiting request reads the heap's min bucket to see what the current timestamp it. Closed timestamp become continuous, up to the bucket rounding duration.
So now I ask - shouldn't we do exactly this? That'd be an intuitive, simple to understand data structure that I can reason about, whereas the two-bucket scheme is not. Given that this tracker is per-range, how much contention can there be on this structure? I'm pretty tempted to see how well this would perform.


docs/RFCS/20191108_closed_timestamps_v2.md, line 66 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I'm a little unclear about how something like this can actually happen in practice. Don't we check for pending proposals in shouldReplicaQuiesce?

If you meant "failed evaluations" then it's still not possible, right? That's because we call untrack with a 0 LeaseAppliedIndex in that case.

I believe the thinking is you propose something, the proposal's LAI gets captured as a closed ts MLAI, then the proposal gets rejected below Raft, and then there's no further proposals. You buy?


docs/RFCS/20191108_closed_timestamps_v2.md, line 76 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

s/timestamps/timestamp

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 109 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

s/ts/ts1?

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 111 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

difference

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 188 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Don't we want these timestamps to be determined by some range-level closed timestamp duration attribute? Its unclear why we care about kv.closed_timestamp.target_duration at all.

well, for "normal" ranges kv.closed_timestamp.target_duration will stay in effect (at least that's the plan). For non-blocking ranges we'll have some other policy.
No?


docs/RFCS/20191108_closed_timestamps_v2.md, line 190 at r2 (raw file):

Previously, ajwerner wrote…

One thing that might be worth exploration is how do we keep track of the set of streams that must exist and their membership. Unquiescing will remove a member. That might cover replication changed but it might not because you may forcibly lose the lease. It feels like there's some relatively complex synchronization dances and I don't have a clear vision on how all of this is going to be coordinated.

Maybe it all will just work if we're sloppy here. Is the deal just that the read path will consult the in-memory view of the closed timestamp due to the inactive transport and also the durable one and enforce the latest one?

I've added

In effect, each node will broadcast information about all of its inactive ranges to
every other node (thus every recipient will only be interested in some of the
updates - namely the ones for which it has a replica). The sender doesn't try to
customize messages for each recipient.

You like?


docs/RFCS/20191108_closed_timestamps_v2.md, line 262 at r2 (raw file):

Previously, ajwerner wrote…

nit: missing *

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 266 at r2 (raw file):

Previously, ajwerner wrote…

nit: missing *

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 267 at r2 (raw file):

Previously, ajwerner wrote…

nit: missing *

these are the result of gq doing something bad :(


docs/RFCS/20191108_closed_timestamps_v2.md, line 354 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…
  1. Exit your bucket (i.e. decrement the bucket's refcount and, if 0, increment
    prev and cur).

You've mentioned here^ that we would shift the buckets when the last request left prev.

good catch; I removed that.

@aayushshah15
Copy link
Contributor

then the proposal gets rejected below Raft, and then there's no further proposals.

I just don't think I understand that stuff well, but I don't really get why proposals would ever get "rejected". I thought that we simply consider them "failed" after waiting for some duration and retry.

Could you educate me on what we really mean when we talk about "rejected" proposals? And why would they not get retried?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 103 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so a closed timestamp in a Raft entry is only valid if that entry's command passes the lease index check, which is an indication that we should key closed timestamps off of the lease applied index, not the raft applied index.

I buy this enough. Changed and added a LAIs vs. Raft log indexes section.

Looks much better to me.


docs/RFCS/20191108_closed_timestamps_v2.md, line 156 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

done. I think it's technically the start of the stasis period, not the expiration, right?

I don't think so. I think the stasis period is only there to ensure that reads never miss writes in their uncertainty interval that take place on a new leaseholder. But this isn't the case for closed timestamps or follower reads – we consider such reads to happen at max(read_timestamp, max_timestamp), and so we require the entire uncertainty interval to be closed. I think if we did the same for the regular lease check, we wouldn't need the stasis period at all.

Maybe we should get rid of it? Doing so seems more consistent. It also moves us closer to a world where we validate an individual request's timestamp against a lease, instead of the current clock reading (which is supposed to incorporate the current request's timestamp). I don't see a problem with that. It seems clearer to me and we'll need to do something like it anyway for operations in the future.

I'm curious whether @tbg knows how we go into this situation. Naively, I'd imaging we would have just checked the request's timestamp against the lease. Why the indirection through the clock? Maybe it's because writes only care about the current lease, not the lease at their timestamp, because only the current leaseholder can succeed at proposing a write.

So maybe we should be restructuring lease checks in the code. I think we want logic like:

opTs = max(read_timestamp, max_timestamp)
if in_read_only_code_path {
    if closed_timestamp >= opTs {
        // proceed
    } else if leaseholder && lease_exp >= opTs {
        // proceed
    } else {
        // request lease
    }
} else {
    if !leaseholder {
        // request lease
    } else if lease_exp < opTs {
        // request lease
    } else if lease_exp < clock.Now() + max_offset {
        // request lease. Best-effort check to avoid failed
        // proposals below raft. Not needed for correctness.
    } else {
        // proceed
    } 
}

docs/RFCS/20191108_closed_timestamps_v2.md, line 204 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Well I guess I reserve the right to settle on the exact form at implementation time, but at lease conceptually I find it cleaner to separate between the creation of a group and modifications to an existing group rather than the alternative.

SGTM. We can discuss when we get to the implementation.


docs/RFCS/20191108_closed_timestamps_v2.md, line 241 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Added a paragraph with a twist. See if you like.

I don't dislike it, though I don't think the complexity provides nearly as much value as it initially seems. It's only improving things for cases where a follower's log application is delayed (i.e. its closed timestamp is already lagging), then the range becomes inactive, then the follower begins hearing about the range on its side transport before the range catches up. And then it only improves things for the time between when the range does catch up to the inactivity-index and when it receives its next side-transport notification after this point. But the follower had already been lagging by this interval for the entire time that the range had been active, so what's another 200ms? And if the range becomes active again before receiving the next side-transport notification, then the following entry is guaranteed to carry a closed timestamp even higher than this.

So because we're still increasing a follower's closed timestamp as it catches up while lagging some interval behind, this really doesn't seem like it's getting us much. This is in contrast to the current closed timestamp system, where this kind of thing is a lot more important because we don't maintain a history of old closed timestamps anywhere.

What do you think?


docs/RFCS/20191108_closed_timestamps_v2.md, line 285 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

right, clarified

"acts as"


docs/RFCS/20191108_closed_timestamps_v2.md, line 351 at r1 (raw file):

If r1 exits, it will carry closed ts = 10, when it could carry 15 (it carries 10 because, for sequencing purposes, the timestamp a request carries has to be determined before it decrement the refcount on its bucket).

I don't understand this. How do we know that r1 isn't writing at ts = 11?

The truth is that the more I think about these buckets, the less I like them. They seem to be lacking any sort of principle, and the scheme is hard to model. What really bothers me is that the decision about which bucket a particular request should join shouldn't be mechanical; it should depend on when the request arrives in relation to when the other requests in the candidate bucket(s) arrived, and also what their write timestamps are. Like, if the target trail is, say, 10, and now = 100, and I'm a request with ts = 50, and prev and cur both have timestamps of 9 and 10 (i.e. there's a requests in there that have been evaluating for a long time), in a two-bucket world I have no choice but to enter cur and hold back the closing of 10 until I finish executing. But that seems outrageous; what gives me the right to do that? Why am I, a ts = 50 request, holding back the closing of 10? If the ts=10 request has just started evaluating, then I'd say it's be OK; hopefully requests starting evaluation around the same time finish evaluation around the same time, so there's not much harm in me joining the ts=10 request. But if that request has been evaluating for an hour? God willing it's about to finish evaluating any time now, and the last thing I want to do is delay the closing of its timestamp by my evaluation time.

Let's be specific here so we can get a mental model of how this behaves vs. the optimal approach of a min-heap. What we want to know is 1) what is the rate that the tracker advances, and 2) what is the lag that it imposes. Let's say that each write takes L amount of time to evaluate. The worst case is exactly as you say – a new request joins cur immediately before it shifts. So then prev takes L to shift again, as it needs to wait for the entire duration of the newly added request's evaluation. So the rate of the tracker is 1/L, with a period of L. Then there's the lag. We pick a timestamp that we want to close when initializing cur. We then need to wait for one shift for this to be added to log entries as the closed timestamp. This is then the active closed timestamp for L, so the worst-case lag is 2L.

If we compare that to the min-heap, I believe we get a rate of ∞ (period of 0), the closed timestamp can advance as fast as necessary. However, we do get a lag of L, because we need to wait for the earliest entry to finish evaluating before the closed timestamp can progress.

So the two-bucket approach leads to twice the lag of the "optimal" min-heap approach. I'd estimate that request evaluation takes somewhere on the order of 100μs, so the two-bucket approach will result in a closed timestamp that lags the desired time by 100μs more than the min-heap approach.

How did we get to two buckets in the first place? I guess it goes something like this: what we'd ideally want is a min-heap, with requests ordered by their write timestamps. We'd like a request write timestamp to always be bumped to now - trailing target, and we'd want any proposal to carry as a closed timestamp min(now-trailing target, the heap's min value (i.e. the minimal timestamp of a currently-evaluating request)).
That'd be the ideal, but I guess we don't want to maintain a heap of arbitrary size (the size being the number of concurrent requests). OK, so then we start compressing the requests - let's discretize their timestamps into buckets of 10ms each. A request timestamp would be rounded down to a 10ms step, and would count towards the respective bucket's refcount. Now, if we maintain these buckets in the simplest way possible - a circular buffer - then their count would still be unbounded as it depends on the evaluation time of the slowest request. What we can do is put a limit on the number of buckets, and once the limit is reached, we just have new requests join the last bucket. What should this limit be? Well, 1 doesn't really work, and the next smallest number is 2, so I guess that's how we got it.

This all sounds right to me.

So now I ask - shouldn't we do exactly this? That'd be an intuitive, simple to understand data structure that I can reason about, whereas the two-bucket scheme is not. Given that this tracker is per-range, how much contention can there be on this structure? I'm pretty tempted to see how well this would perform.

I'm very concerned about this. It's moving in the wrong direction. We should be taking this opportunity to make this structure more efficient, not less. Adding a min-heap that every write in a range needs to synchronize on (twice!) will be quite expensive. That's a pair of O(log(concurrent_writes)) operations under an exclusive lock on each write! Plus a few allocations imposed by https://golang.org/pkg/container/heap.

We have a history of improving per-range data structures to avoid expensive exclusive locking, and each time we've improved single-range write throughput by somewhere between 10 and 40%. This is why we can now push 35k writes-per-range. We can't regress on that.

The Raft proposal buffer is a good example of how expensive synchronization can be. By replacing the exclusive locking around the Raft proposals map with a concurrent buffer, we improved single-range write throughput by up to 35%: #38343.

I'd encourage us to continue trying to rationalize the two-bucket + refcount approach and come to a set of rules that we like. One of the major benefits of that structure is that it seems possible to make concurrent. I'd start by imagining atomic refcounts and a read-write lock. Grab the read lock to perform any intra-bucket operations (changing refcounts, reading prev's timestamp, etc.) and the write lock to shift buckets.


docs/RFCS/20191108_closed_timestamps_v2.md, line 66 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I believe the thinking is you propose something, the proposal's LAI gets captured as a closed ts MLAI, then the proposal gets rejected below Raft, and then there's no further proposals. You buy?

Or you propose something, report an LAI, then crash before it gets through Raft. There seem to be all kinds of issues that could happen here with trying to predict a future LAI.


docs/RFCS/20191108_closed_timestamps_v2.md, line 188 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

well, for "normal" ranges kv.closed_timestamp.target_duration will stay in effect (at least that's the plan). For non-blocking ranges we'll have some other policy.
No?

Yes, this seems right to me. At least for now.


docs/RFCS/20191108_closed_timestamps_v2.md, line 147 at r3 (raw file):

proposed

proposer


docs/RFCS/20191108_closed_timestamps_v2.md, line 195 at r3 (raw file):

timestamp. Splits will initialize the RHS by propagating the RHS's closed
timestamp through the `SplitTrigger` structure (which btw will be a lot cleaner
than the way in which we currentl prevent closed ts regressions on the RHS - a

currently

@nvanbenschoten
Copy link
Member

I'd encourage us to continue trying to rationalize the two-bucket + refcount approach and come to a set of rules that we like. One of the major benefits of that structure is that it seems possible to make concurrent. I'd start by imagining atomic refcounts and a read-write lock. Grab the read lock to perform any intra-bucket operations (changing refcounts, reading prev's timestamp, etc.) and the write lock to shift buckets.

I hacked something like this up in nvanbenschoten@e9e13aa. It seems to work fairly well, and the code's not too bad. Feel free to take a look for inspiration.

@nvanbenschoten
Copy link
Member

then the proposal gets rejected below Raft, and then there's no further proposals.

I just don't think I understand that stuff well, but I don't really get why proposals would ever get "rejected". I thought that we simply consider them "failed" after waiting for some duration and retry.

Could you educate me on what we really mean when we talk about "rejected" proposals? And why would they not get retried?

@aayushshah15 I missed this question before. As you point out, Raft proposals are reproposed periodically if they are having trouble achieving consensus. This ensures that if they are dropped during a leadership change, snapshot, or some other Raft-level operation, that they will continue to try to get committed. This is critical, but it's not quite what's being discussed here.

When discussing "rejected" proposals, what we actually mean is proposals that achieve consensus (i.e. get commit) but then get deterministically rejected by the replicated statement machine. The heart of this logic is Replica. shouldApplyCommand, which calls into checkForcedErr. The comment there discusses a few reasons why a command might be rejected. The most important of these reasons is the lease check – a command that was proposed under lease sequence 12 cannot be applied if the Range's lease has changed to 13 in between the time of proposal and application. Without this check, delayed writes from old leaseholders could come back to apply under a new leaseholder and invalidate reads served by this new leaseholder. There is also a check that the lease applied indexes for commands are applied in monotonically increasing order, which is important to protect against re-orderings of writes.

So when talking about applied commands, we mean commands that have committed in the Raft log and aren't rejected by the replicated state machine.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't dislike it, though I don't think the complexity provides nearly as much value as it initially seems. It's only improving things for cases where a follower's log application is delayed (i.e. its closed timestamp is already lagging), then the range becomes inactive, then the follower begins hearing about the range on its side transport before the range catches up. And then it only improves things for the time between when the range does catch up to the inactivity-index and when it receives its next side-transport notification after this point. But the follower had already been lagging by this interval for the entire time that the range had been active, so what's another 200ms? And if the range becomes active again before receiving the next side-transport notification, then the following entry is guaranteed to carry a closed timestamp even higher than this.

So because we're still increasing a follower's closed timestamp as it catches up while lagging some interval behind, this really doesn't seem like it's getting us much. This is in contrast to the current closed timestamp system, where this kind of thing is a lot more important because we don't maintain a history of old closed timestamps anywhere.

What do you think?

I think you're right, I've changed accordingly.


docs/RFCS/20191108_closed_timestamps_v2.md, line 285 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"acts as"

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 351 at r1 (raw file):

If r1 exits, it will carry closed ts = 10, when it could carry 15 (it carries 10 because, for sequencing purposes, the timestamp a request carries has to be determined before it decrement the refcount on its bucket).

I don't understand this. How do we know that r1 isn't writing at ts = 11?

r1 could be writing at ts = 11, but that shouldn't mean that it can't carry a closedts of 15. There's no contradiction - you're writing at 11 and promise that future LAIs will write above 15, and the followers will rejoice.

The difficulty in getting r1 to carry 15 is one of synchronization between the ProposalBuffer and the tracker: if we go with the described design, r1 cannot carry anything higher than its bucket's closed timestamp because it might be reordered (LAI-wise) with other requests that left the buckets before it did (namely, the prev bucket). In order to carry 15 (or even higher!) we'd need to atomically get a LAI from the ProposalBuffer, and swap the buckets (thus ensuring that future requests get timestamps higher than cur's timestamp (to become prev's timestamp after the switch)). This suggests to me that the tracker should live under the ProposalBuffer - so a request would be registered with the buffer when it begins evaluation (which registration would add the request to a bucket and ratchet its write timestamp), and then the request would leave the bucket when it is "added" to the ProposalBuffer.
With this structure in place, r1 could carry 15 (or higher) because we are now free to assign a command's closed timestamp "late" - after it is sequenced by the ProposalBuffer. So, the logic could be:

  • get a LAI
  • decrement your bucket's refcount
  • if I'm exiting prev and the refcount is now 0, then I swap the buckets and the closed ts I carry is:
    • if (the new) prev is not empty, then I carry prev's timestamp (because I don't have tighter info on the lowest timestamp of a request that's currently in that bucket, so I need to be conservative)
    • if (the new) prev is empty, I'm free to carry an even higher timestamp: I can carry now() - kv.closed_timestamp.target_duration.
  • if I'm exiting cur and the refcount is now 0, then I swap the buckets iff prev is also empty. The closed ts I carry is now() - kv.closed_timestamp.target_duration. This is nice because it means that, if requests don't overlap, each request carries a closed timestamp that it computed, not one computed by the previous command, and, even better, a ts computed after it evaluates, not before.

With the designed described in the text, the best we could do was to read prev's ts before entering the ProposalBuffer. That's suboptimal, and more importantly the ordering requirements between the tracker operations and the buffer operations seem subtle - so I think we should tie the two as to not screw up.

What do you think of this design?

I'm very concerned about this. It's moving in the wrong direction. We should be taking this opportunity to make this structure more efficient, not less. Adding a min-heap that every write in a range needs to synchronize on (twice!) will be quite expensive. That's a pair of O(log(concurrent_writes)) operations under an exclusive lock on each write!

But what you're saying is not exactly true. In the proposed scheme, it's not "every write" that needs to perform heap operations; just requests that create or destroy a bucket would - so one every, say, 10ms. My hope is that, in a high-throughout range scenario, the majority of requests would just operate on a single bucket.


docs/RFCS/20191108_closed_timestamps_v2.md, line 147 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
proposed

proposer

done


docs/RFCS/20191108_closed_timestamps_v2.md, line 195 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

currently

done

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)


docs/RFCS/20191108_closed_timestamps_v2.md, line 156 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think so. I think the stasis period is only there to ensure that reads never miss writes in their uncertainty interval that take place on a new leaseholder. But this isn't the case for closed timestamps or follower reads – we consider such reads to happen at max(read_timestamp, max_timestamp), and so we require the entire uncertainty interval to be closed. I think if we did the same for the regular lease check, we wouldn't need the stasis period at all.

Maybe we should get rid of it? Doing so seems more consistent. It also moves us closer to a world where we validate an individual request's timestamp against a lease, instead of the current clock reading (which is supposed to incorporate the current request's timestamp). I don't see a problem with that. It seems clearer to me and we'll need to do something like it anyway for operations in the future.

I'm curious whether @tbg knows how we go into this situation. Naively, I'd imaging we would have just checked the request's timestamp against the lease. Why the indirection through the clock? Maybe it's because writes only care about the current lease, not the lease at their timestamp, because only the current leaseholder can succeed at proposing a write.

So maybe we should be restructuring lease checks in the code. I think we want logic like:

opTs = max(read_timestamp, max_timestamp)
if in_read_only_code_path {
    if closed_timestamp >= opTs {
        // proceed
    } else if leaseholder && lease_exp >= opTs {
        // proceed
    } else {
        // request lease
    }
} else {
    if !leaseholder {
        // request lease
    } else if lease_exp < opTs {
        // request lease
    } else if lease_exp < clock.Now() + max_offset {
        // request lease. Best-effort check to avoid failed
        // proposals below raft. Not needed for correctness.
    } else {
        // proceed
    } 
}

for my part in this, changed to lease expiration


docs/RFCS/20191108_closed_timestamps_v2.md, line 351 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If r1 exits, it will carry closed ts = 10, when it could carry 15 (it carries 10 because, for sequencing purposes, the timestamp a request carries has to be determined before it decrement the refcount on its bucket).

I don't understand this. How do we know that r1 isn't writing at ts = 11?

r1 could be writing at ts = 11, but that shouldn't mean that it can't carry a closedts of 15. There's no contradiction - you're writing at 11 and promise that future LAIs will write above 15, and the followers will rejoice.

The difficulty in getting r1 to carry 15 is one of synchronization between the ProposalBuffer and the tracker: if we go with the described design, r1 cannot carry anything higher than its bucket's closed timestamp because it might be reordered (LAI-wise) with other requests that left the buckets before it did (namely, the prev bucket). In order to carry 15 (or even higher!) we'd need to atomically get a LAI from the ProposalBuffer, and swap the buckets (thus ensuring that future requests get timestamps higher than cur's timestamp (to become prev's timestamp after the switch)). This suggests to me that the tracker should live under the ProposalBuffer - so a request would be registered with the buffer when it begins evaluation (which registration would add the request to a bucket and ratchet its write timestamp), and then the request would leave the bucket when it is "added" to the ProposalBuffer.
With this structure in place, r1 could carry 15 (or higher) because we are now free to assign a command's closed timestamp "late" - after it is sequenced by the ProposalBuffer. So, the logic could be:

  • get a LAI
  • decrement your bucket's refcount
  • if I'm exiting prev and the refcount is now 0, then I swap the buckets and the closed ts I carry is:
    • if (the new) prev is not empty, then I carry prev's timestamp (because I don't have tighter info on the lowest timestamp of a request that's currently in that bucket, so I need to be conservative)
    • if (the new) prev is empty, I'm free to carry an even higher timestamp: I can carry now() - kv.closed_timestamp.target_duration.
  • if I'm exiting cur and the refcount is now 0, then I swap the buckets iff prev is also empty. The closed ts I carry is now() - kv.closed_timestamp.target_duration. This is nice because it means that, if requests don't overlap, each request carries a closed timestamp that it computed, not one computed by the previous command, and, even better, a ts computed after it evaluates, not before.

With the designed described in the text, the best we could do was to read prev's ts before entering the ProposalBuffer. That's suboptimal, and more importantly the ordering requirements between the tracker operations and the buffer operations seem subtle - so I think we should tie the two as to not screw up.

What do you think of this design?

I'm very concerned about this. It's moving in the wrong direction. We should be taking this opportunity to make this structure more efficient, not less. Adding a min-heap that every write in a range needs to synchronize on (twice!) will be quite expensive. That's a pair of O(log(concurrent_writes)) operations under an exclusive lock on each write!

But what you're saying is not exactly true. In the proposed scheme, it's not "every write" that needs to perform heap operations; just requests that create or destroy a bucket would - so one every, say, 10ms. My hope is that, in a high-throughout range scenario, the majority of requests would just operate on a single bucket.

I've changed the closing timestamp scheme by decoupling the bucket maintenance rules from the closed timestamp assignment rules. I think the current scheme is the optimal one (optimal for a fixed number of buckets, that is); I'm quite happy with it.
It's all based on an evolution of what we've discussed - tying together the tracker and the proposal buffer. This allows us to exit buckets under the buffer's write lock, which allows for more flexible logic.

PTAL

I've also incorporated your analysis Nathan in the text.

andreimatei added a commit to andreimatei/cockroach that referenced this pull request Feb 25, 2021
The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](cockroachdb#56675) for details.

Release justifaction: Needed for global tables.
Release note: None
nvanbenschoten pushed a commit to andreimatei/cockroach that referenced this pull request Mar 1, 2021
The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](cockroachdb#56675) for details.

Release justifaction: Needed for global tables.
Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 2, 2021
The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](cockroachdb#56675) for details.

Release justifaction: Needed for global tables.
Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 2, 2021
The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](cockroachdb#56675) for details.

Release justification: Needed for global tables.
Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 2, 2021
The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](cockroachdb#56675) for details.

Release justification: Needed for global tables.
Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 2, 2021
The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](cockroachdb#56675) for details.

Release justification: Needed for global tables.
Release note: None
nvanbenschoten pushed a commit to andreimatei/cockroach that referenced this pull request Mar 2, 2021
The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](cockroachdb#56675) for details.

Release justification: Needed for global tables.
Release note: None
craig bot pushed a commit that referenced this pull request Mar 2, 2021
61137: kvserver: implement closed ts side-transport publisher r=nvanbenschoten a=andreimatei

The side-transport is a component running on each node and periodically
publishing closed timestamps on ranges with the lease on the respective
node. This complements the closing of timestamps through Raft commands
such that inactive ranges still have their timestamp advanced.

This commit introduces only the publishing side (the consumer is coming
separately) - the guy opening streaming connections to all other nodes
with follower replicas for some ranges with local leases and
periodically publishing closed timestamp updates on a bunch of ranges at
once.

Care has been taken to make the communication protocol efficient. Each
stream is stateful and the information in every message is nicely
compressed.

See [the RFC](#56675) for details.

Release justification: Needed for global tables.
Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@tbg tbg removed their request for review June 21, 2021 11:11
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to merge this?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, and @nvanbenschoten)

@nvanbenschoten
Copy link
Member

@andreimatei any interest in merging this?

We need a more flexible closed timestamps mechanism because the
Non-blocking Transactions project will use two different closed
timestamps policies for different ranges. The existing mechanism closes
a single timestamps for all the ranges for which a node is the
leaseholder, which is not good enough.
Besides, the existing mechanism is quite complex. The alternative
proposed by this RFC seems significantly simpler.

Release note: None
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the status to "implemented"

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @nvanbenschoten)

@craig
Copy link
Contributor

craig bot commented Dec 19, 2022

👎 Rejected by too few approved reviews

@andreimatei
Copy link
Contributor Author

Give it a approval #!@*&%

@andreimatei
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 19, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Dec 19, 2022

Build succeeded:

@craig craig bot merged commit bb6ed57 into cockroachdb:master Dec 19, 2022
@andreimatei andreimatei deleted the rfc.closed-ts branch December 20, 2022 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants