-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hlc: document properties and uses of Hybrid Logical Clocks #72278
hlc: document properties and uses of Hybrid Logical Clocks #72278
Conversation
45224c7
to
9c9de08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)
pkg/util/hlc/doc.go, line 107 at r1 (raw file):
additional protection here by persisting the wall time of the clock periodically, although this protection is disabled by default.
Have we considered these protections in the context of serverless sql pods? Is this going to end up being an onerous amount of waiting when scaling back up?
Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)
pkg/util/hlc/doc.go, line 242 at r1 (raw file):
- SQL descriptor leases. TODO(nvanbenschoten): learn how this works and whether this relies on bounded clock skew of liveness or for safety. May need to talk to Andrew about it. See LeaseManager.timeRemaining.
I don't believe that the the descriptor leases rely on bounded clock skew for correctness in any way. The chosen lease expirations are in the MVCC domain and act as transaction deadlines.
The LeaseManager
you're referring to here is startupmigrations.LeaseManager
. I do believe that thing will lose its mutual exclusion under severe skew. Fortunately, I don't think the startup migrations which touch descriptors (as they existed in the past) were susceptible to any correctness issues in the absence of mutual exclusion; they always used read-write to interact with descriptors. Those startup migrations are decreasingly used. I don't believe they have been used since 20.2 for anything new. They primarily set up new state in the cluster. I'd be happy if that code were deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):
// transactions commit with higher timestamps, even if their read and writes // sets do not conflict with the original transaction's. This obviates the // need for uncertainty intervals and prevents the "causal reverse" anamoly
After reading the new file, it seems to me that the removed text here is still accurate (or at least it could be if we fixed up this linearizable option to work as advertised). I'm drawing a distinction between the uncertainty interval and other things that require clock synchronization within max-offset (e.g. non-cooperative lease transfers).
pkg/util/hlc/doc.go, line 30 at r1 (raw file):
clock. There are three channels through which HLC timestamps are passed between nodes
Early on we considered sprinkling more HLC channels around the code (e.g. on periodic heartbeats). These channels would not do anything meaningful for causality tracking, but they might help avoid pushes/restarts occasionally. I'm not sure if we'd ever introduce those at this point, but we might want to tweak the language here to say that these are the three ways that we convey HLC timestamps for causal purposes.
Could we go even further and say that e.g. the raft channel supports causality for cooperative lease transfers, and BatchRequest for observed timestamps, etc? It's interesting to think about the possibility of using separate clocks for these purposes. For example, I don't think the clock used for lease transfers necessarily needs to be the singleton node clock. This might help avoid a high-lock-contention global object. OTOH, observed timestamps basically assume a single clock so maybe there's not much to be gained here.
pkg/util/hlc/doc.go, line 72 at r1 (raw file):
more, see pkg/kv/kvserver/observedts/doc.go. - Non-transactional requests. Most operations in CockroachDB are transactional
"Most KV operations"
pkg/util/hlc/doc.go, line 96 at r1 (raw file):
just concerned about a transaction restarting at a timestamp above the local clock back then because we had yet to separate the "clock timestamp" domain from the "transaction timestamp" domain.
I don't remember exactly what was going on here but I think it's plausible that this is not necessary for correctness, but is needed to ensure that the clock is advanced far enough that it doesn't immediately hit the same retryable error again.
pkg/util/hlc/doc.go, line 107 at r1 (raw file):
Previously, ajwerner wrote…
Have we considered these protections in the context of serverless sql pods? Is this going to end up being an onerous amount of waiting when scaling back up?
Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.
Do SQL pods need this? I don't think they do. Strict monotonicity is important for KV pods but I think SQL pods would not cause correctness problems if their timestamps moved backwards across a restart (within max-offset).
Also, if monotonicity of timestamps for SQL pods were required, remember that what matters is the interval from the last timestamp generated by a previous incarnation of the node to the first timestamp generated by the new incarnation. All the time spent on the scale-down and scale-up process itself still counts, so you shouldn't actually need much of a sleep.
pkg/util/hlc/doc.go, line 206 at r1 (raw file):
replicas to both consider themselves leaseholders at the same time. This can not lead to stale reads for transactional requests, because a transaction with an uncertainty interval that extends past a lease's expiration will not
Really? I was not aware that we checked the end-of-uncertainty timestamp against the lease expiration. But that sounds like a much more elegant solution to the main issue here than the stasis period.
pkg/util/hlc/doc.go, line 213 at r1 (raw file):
use uncertainty intervals, but the mechanics differ slightly. TODO(nvanbenschoten): is that really it? It's the only thing I've ever been
That may be it; I can't remember anything else right now. In the early days we cared a lot more about non-transactional operations, so they motivated a lot of stuff like this.
This is interesting because the usage of max-offset in non-cooperative lease transfers is a big part of the reason you can't just set max-offset to an extremely high value. If we added uncertainty intervals to non-transactional operations (as proposed above) and could avoid the max-offset gap between consecutive leases (and I guess the "strict monotonic" sleep at startup), then you could run with a high max-offset. You'd get read uncertainty errors all the time, but you could retry and rely on observed timestamps to make progress. This would be kind of like a distributed timestamp oracle for folks who insist that they can't get reasonable clock synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)
pkg/util/hlc/doc.go, line 58 at r1 (raw file):
lease's start time. Upon application of this Raft entry, the incoming leaseholder forwards its HLC to this clock reading, transitively ensuring that its clock is >= the new lease's start time.
consider spelling out why the lease recipient should have a clock higher than the lease start.
pkg/util/hlc/doc.go, line 70 at r1 (raw file):
node, and by extension, at the time that the transaction began. This allows the transaction to avoid uncertainty restarts in some circumstances. For more, see pkg/kv/kvserver/observedts/doc.go.
do you want to hint at the subteties around pushed intents here?
pkg/util/hlc/doc.go, line 96 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I don't remember exactly what was going on here but I think it's plausible that this is not necessary for correctness, but is needed to ensure that the clock is advanced far enough that it doesn't immediately hit the same retryable error again.
AFAICS, 72fa944
doesn't deal with updating the gateway's clock; it only deals with ensuring that the txn restarts above the uncertain value it encountered. Which I guess is also what Ben is saying.
pkg/util/hlc/doc.go, line 139 at r1 (raw file):
- Transaction uncertainty intervals. The single-key linearizability property is satisfied in CockroachDB by tracking an uncertainty interval for each transaction, within which the causal ordering between two transactions is
s/causal ordering/real-time ordering ?
9c9de08
to
97973ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):
I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.
FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:
After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.
Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?
pkg/util/hlc/doc.go, line 30 at r1 (raw file):
I'm not sure if we'd ever introduce those at this point, but we might want to tweak the language here to say that these are the three ways that we convey HLC timestamps for causal purposes.
These three came from an audit of the code, so it doesn't look like we did. I added some words to that effect.
Could we go even further and say that e.g. the raft channel supports causality for cooperative lease transfers, and BatchRequest for observed timestamps, etc?
That's a good idea. I'll point each use of causality at the corresponding channel.
pkg/util/hlc/doc.go, line 58 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider spelling out why the lease recipient should have a clock higher than the lease start.
Done.
pkg/util/hlc/doc.go, line 70 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
do you want to hint at the subteties around pushed intents here?
Done.
pkg/util/hlc/doc.go, line 72 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
"Most KV operations"
Done.
pkg/util/hlc/doc.go, line 96 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
AFAICS,
72fa944
doesn't deal with updating the gateway's clock; it only deals with ensuring that the txn restarts above the uncertain value it encountered. Which I guess is also what Ben is saying.
But why was this needed? We don't pull the next epoch's timestamp from the clock, we pull it straight from the error:
Lines 1004 to 1015 in 72fa944
case *ReadWithinUncertaintyIntervalError: | |
// If the reader encountered a newer write within the uncertainty | |
// interval, we advance the txn's timestamp just past the last observed | |
// timestamp from the node. | |
ts, ok := txn.GetObservedTimestamp(pErr.OriginNode) | |
if !ok { | |
log.Fatalf(ctx, | |
"missing observed timestamp for node %d found on uncertainty restart. "+ | |
"err: %s. txn: %s. Observed timestamps: %s", | |
pErr.OriginNode, pErr, txn, txn.ObservedTimestamps) | |
} | |
txn.Timestamp.Forward(ts) |
Maybe we were concerned at that time about a transaction being given a timestamp that was greater than its local clock's HLC?
pkg/util/hlc/doc.go, line 107 at r1 (raw file):
I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because ensureClockMonotonicity
is called in server.Server.PreStart
. SQL Pods don't use a server.Server
, they use a server.SQLServer
.
Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.
This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in Clock.UpdateAndCheckMaxOffset
, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward by max_offset-1
until it crashed a KV node.
Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the Clock.Update
call in Store.Send
on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.
pkg/util/hlc/doc.go, line 139 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/causal ordering/real-time ordering ?
Done.
pkg/util/hlc/doc.go, line 206 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Really? I was not aware that we checked the end-of-uncertainty timestamp against the lease expiration. But that sounds like a much more elegant solution to the main issue here than the stasis period.
Ah, you're right. I tried to make that change in #58904, but didn't get this far. Fixed.
pkg/util/hlc/doc.go, line 213 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
That may be it; I can't remember anything else right now. In the early days we cared a lot more about non-transactional operations, so they motivated a lot of stuff like this.
This is interesting because the usage of max-offset in non-cooperative lease transfers is a big part of the reason you can't just set max-offset to an extremely high value. If we added uncertainty intervals to non-transactional operations (as proposed above) and could avoid the max-offset gap between consecutive leases (and I guess the "strict monotonic" sleep at startup), then you could run with a high max-offset. You'd get read uncertainty errors all the time, but you could retry and rely on observed timestamps to make progress. This would be kind of like a distributed timestamp oracle for folks who insist that they can't get reasonable clock synchronization.
Is this accounting for the need to check the end-of-uncertainty timestamp against the lease expiration if we removed the stasis gap? If we had a higher max-offset then wouldn't the utility window of leases would become smaller, eventually to the point where leases could not be used at all?
pkg/util/hlc/doc.go, line 242 at r1 (raw file):
Previously, ajwerner wrote…
- SQL descriptor leases. TODO(nvanbenschoten): learn how this works and whether this relies on bounded clock skew of liveness or for safety. May need to talk to Andrew about it. See LeaseManager.timeRemaining.
I don't believe that the the descriptor leases rely on bounded clock skew for correctness in any way. The chosen lease expirations are in the MVCC domain and act as transaction deadlines.
The
LeaseManager
you're referring to here isstartupmigrations.LeaseManager
. I do believe that thing will lose its mutual exclusion under severe skew. Fortunately, I don't think the startup migrations which touch descriptors (as they existed in the past) were susceptible to any correctness issues in the absence of mutual exclusion; they always used read-write to interact with descriptors. Those startup migrations are decreasingly used. I don't believe they have been used since 20.2 for anything new. They primarily set up new state in the cluster. I'd be happy if that code were deleted.
Got it, thanks for the confirmation. I'll remove this section then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice to have all of this brain juice in one digestible text.
Reviewed 3 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.
FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:
After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.
Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?
That makes sense at least to me. I would even find it useful to include this reasoning in the comment.
pkg/util/hlc/doc.go, line 107 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because
ensureClockMonotonicity
is called inserver.Server.PreStart
. SQL Pods don't use aserver.Server
, they use aserver.SQLServer
.Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.
This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in
Clock.UpdateAndCheckMaxOffset
, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward bymax_offset-1
until it crashed a KV node.Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the
Clock.Update
call inStore.Send
on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.
Sounds like something we should file an issue for.
pkg/util/hlc/doc.go, line 25 at r2 (raw file):
Causality tracking HLCs provide causality tracking through their logical component upon each
This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.
pkg/util/hlc/doc.go, line 35 at r2 (raw file):
- Raft (unidirectional): proposers of Raft commands (i.e. leaseholders) attach clock readings to these command, which are later consumed by followers when commands are applied to their Raft state machine.
Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through
type RaftCommand struct { |
WriteTimestamp
nor the ClosedTimestamp
.
pkg/util/hlc/doc.go, line 37 at r2 (raw file):
commands are applied to their Raft state machine. - BatchRequest API (bidirectional): clients and servers of the KV BatchRequest
Ditto, are you talking about this one here?
cockroach/pkg/roachpb/api.proto
Lines 1983 to 1992 in 5f43585
message Header { | |
// timestamp specifies time at which reads or writes should be performed. If | |
// the timestamp is set to zero value, its value is initialized to the wall | |
// time of the server node. | |
// | |
// Transactional requests are not allowed to set this field; they must rely on | |
// the server to set it from txn.ReadTimestamp. Also, for transactional | |
// requests, writes are performed at the provisional commit timestamp | |
// (txn.WriteTimestamp). | |
util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false]; |
But this isn't really populated with hlc clock readings.
pkg/util/hlc/doc.go, line 44 at r2 (raw file):
readings back to the root of the flow. Currently, this only takes place on errors, and relates to the "Transaction retry errors" interaction detailed below.
Ditto
pkg/util/hlc/doc.go, line 96 at r2 (raw file):
- Non-transactional requests (Raft + BatchRequest channels). Most KV operations in CockroachDB are transactional and receive their read timestamps from their client. They use an uncertainty interval (see below) to avoid stale reads in
"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.
pkg/util/hlc/doc.go, line 98 at r2 (raw file):
client. They use an uncertainty interval (see below) to avoid stale reads in the presence of clock skew. However, the KV API also exposes the option to send a single-range, strongly consistent, "non-transaction" request. These
maybe a clearer why to phrase this is that
the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.
pkg/util/hlc/doc.go, line 123 at r2 (raw file):
Strict monotonicity HLCs provide strict monotonicity within and across restarts on a single node.
maybe add "as implemented by CockroachDB"
pkg/util/hlc/doc.go, line 132 at r2 (raw file):
Strictly monotonic timestamp allocation ensures that two causally dependent transactions originating from the same node are given timestamps that reflect
Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.
pkg/util/hlc/doc.go, line 213 at r2 (raw file):
more, see LeaseState_UNUSABLE. Note however that it is easy to overstate the salient point here if one is
understate?
pkg/util/hlc/doc.go, line 231 at r2 (raw file):
with an uncertainty interval that extends past a lease's expiration will not be able to use that lease to perform a read (which is enforced by a stasis period immediately before its expiration). However, because some
It's interesting that during a cooperative lease transfer, we don't have to pick a start timestamp for the next read that takes into account any MaxOffset-checks that was applied for reads under the current lease. I suppose that this makes sense, given that success of the lease transfer implies that there wasn't a competing leaseholder, and if there was, the transfer fails and so no harm done.
pkg/util/hlc/doc.go, line 256 at r2 (raw file):
maybe add
HAZARD: this mode of operation is completely untested.
so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.
pkg/util/hlc/doc.go, line 266 at r2 (raw file):
clock’s offset from other nodes. This best-effort validation is done in the RemoteClockMonitor. If any node exceeds the configured maximum offset by more than 80% compared to a majority of other nodes, it self-terminates.
Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice to have all of this brain juice in one digestible text.
Reviewed 3 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.
FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:
After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.
Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?
That makes sense at least to me. I would even find it useful to include this reasoning in the comment.
pkg/util/hlc/doc.go, line 107 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because
ensureClockMonotonicity
is called inserver.Server.PreStart
. SQL Pods don't use aserver.Server
, they use aserver.SQLServer
.Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.
This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in
Clock.UpdateAndCheckMaxOffset
, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward bymax_offset-1
until it crashed a KV node.Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the
Clock.Update
call inStore.Send
on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.
Sounds like something we should file an issue for.
pkg/util/hlc/doc.go, line 25 at r2 (raw file):
Causality tracking HLCs provide causality tracking through their logical component upon each
This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.
pkg/util/hlc/doc.go, line 35 at r2 (raw file):
- Raft (unidirectional): proposers of Raft commands (i.e. leaseholders) attach clock readings to these command, which are later consumed by followers when commands are applied to their Raft state machine.
Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through
type RaftCommand struct { |
WriteTimestamp
nor the ClosedTimestamp
.
pkg/util/hlc/doc.go, line 37 at r2 (raw file):
commands are applied to their Raft state machine. - BatchRequest API (bidirectional): clients and servers of the KV BatchRequest
Ditto, are you talking about this one here?
cockroach/pkg/roachpb/api.proto
Lines 1983 to 1992 in 5f43585
message Header { | |
// timestamp specifies time at which reads or writes should be performed. If | |
// the timestamp is set to zero value, its value is initialized to the wall | |
// time of the server node. | |
// | |
// Transactional requests are not allowed to set this field; they must rely on | |
// the server to set it from txn.ReadTimestamp. Also, for transactional | |
// requests, writes are performed at the provisional commit timestamp | |
// (txn.WriteTimestamp). | |
util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false]; |
But this isn't really populated with hlc clock readings.
pkg/util/hlc/doc.go, line 44 at r2 (raw file):
readings back to the root of the flow. Currently, this only takes place on errors, and relates to the "Transaction retry errors" interaction detailed below.
Ditto
pkg/util/hlc/doc.go, line 96 at r2 (raw file):
- Non-transactional requests (Raft + BatchRequest channels). Most KV operations in CockroachDB are transactional and receive their read timestamps from their client. They use an uncertainty interval (see below) to avoid stale reads in
"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.
pkg/util/hlc/doc.go, line 98 at r2 (raw file):
client. They use an uncertainty interval (see below) to avoid stale reads in the presence of clock skew. However, the KV API also exposes the option to send a single-range, strongly consistent, "non-transaction" request. These
maybe a clearer why to phrase this is that
the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.
pkg/util/hlc/doc.go, line 123 at r2 (raw file):
Strict monotonicity HLCs provide strict monotonicity within and across restarts on a single node.
maybe add "as implemented by CockroachDB"
pkg/util/hlc/doc.go, line 132 at r2 (raw file):
Strictly monotonic timestamp allocation ensures that two causally dependent transactions originating from the same node are given timestamps that reflect
Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.
pkg/util/hlc/doc.go, line 213 at r2 (raw file):
more, see LeaseState_UNUSABLE. Note however that it is easy to overstate the salient point here if one is
understate?
pkg/util/hlc/doc.go, line 231 at r2 (raw file):
with an uncertainty interval that extends past a lease's expiration will not be able to use that lease to perform a read (which is enforced by a stasis period immediately before its expiration). However, because some
It's interesting that during a cooperative lease transfer, we don't have to pick a start timestamp for the next read that takes into account any MaxOffset-checks that was applied for reads under the current lease. I suppose that this makes sense, given that success of the lease transfer implies that there wasn't a competing leaseholder, and if there was, the transfer fails and so no harm done.
pkg/util/hlc/doc.go, line 256 at r2 (raw file):
maybe add
HAZARD: this mode of operation is completely untested.
so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.
pkg/util/hlc/doc.go, line 266 at r2 (raw file):
clock’s offset from other nodes. This best-effort validation is done in the RemoteClockMonitor. If any node exceeds the configured maximum offset by more than 80% compared to a majority of other nodes, it self-terminates.
Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.
97973ff
to
4e9bd1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
That makes sense at least to me. I would even find it useful to include this reasoning in the comment.
Done.
pkg/util/hlc/doc.go, line 107 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Sounds like something we should file an issue for.
Done in #72663.
pkg/util/hlc/doc.go, line 25 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.
Done.
pkg/util/hlc/doc.go, line 35 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through
. I don't think that you mean the
type RaftCommand struct { WriteTimestamp
nor theClosedTimestamp
.
I do mean the WriteTimestamp
, which is used to update the clock on followers here. I added a link. But also, I'm coming at that code with an axe right after this, and I'll be changing this working slightly at that point.
pkg/util/hlc/doc.go, line 37 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto, are you talking about this one here?
cockroach/pkg/roachpb/api.proto
Lines 1983 to 1992 in 5f43585
message Header { // timestamp specifies time at which reads or writes should be performed. If // the timestamp is set to zero value, its value is initialized to the wall // time of the server node. // // Transactional requests are not allowed to set this field; they must rely on // the server to set it from txn.ReadTimestamp. Also, for transactional // requests, writes are performed at the provisional commit timestamp // (txn.WriteTimestamp). util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false]; But this isn't really populated with hlc clock readings.
Added a few links. I'm also coming to clean that code up soon 😃 Some of this is a little aspirational for the next week or so.
pkg/util/hlc/doc.go, line 44 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto
Done.
pkg/util/hlc/doc.go, line 96 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.
Done.
pkg/util/hlc/doc.go, line 98 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
maybe a clearer why to phrase this is that
the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.
Done. That is clearer, thanks.
pkg/util/hlc/doc.go, line 123 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
maybe add "as implemented by CockroachDB"
Done.
pkg/util/hlc/doc.go, line 132 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.
Done.
pkg/util/hlc/doc.go, line 213 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
understate?
I was meaning that the bounded skew is not relied upon to keep the actual MVCC intervals owned by leases disjoint.
pkg/util/hlc/doc.go, line 256 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
maybe add
HAZARD: this mode of operation is completely untested.
so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.
I like it. Done.
pkg/util/hlc/doc.go, line 266 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.
Done.
4e9bd1e
to
db390fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)
pkg/util/hlc/doc.go, line 70 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done.
Done? I don't see the word "intent" anywhere.
pkg/util/hlc/doc.go, line 95 at r3 (raw file):
Separately, when a leaseholder on a given node serves a write, it ensures that the node's HLC clock is >= the written_timestamp of the write.
Hmm, but we want to get rid of this for the purposes of multi-tenancy, as we were talking about last night (i.e. to prevent a SQL pod from injecting timestamps into the cluster). So... what happens then with observed timestamps?
This commit adds a `doc.go` file to the `pkg/util/hlc package` that details that properties and the uses of Hybrid Logical Clocks throughout the system. It is meant to capture an overview of the ways in which HLCs benfit CockroachDB and to exhaustively enumerate the (few) places in which the HLC is a key component to the correctness of different subsystems. This was inspired by cockroachdb#72121, which is once again making me feel uneasy about how implicit most interactions with the HLC clock are. Specifically, the uses of the HLC for causality tracking are subtle and underdocumented. The typing changes made in cockroachdb#58349 help to isolate timestamps used for causality tracking from any other timestamps in the system, but until we remove the escape hatch of dynamically casting a `Timestamp` back to a `ClockTimestamp` with `TryToClockTimestamp()`, it is still too difficult to understand when and why clock signals are being passed between HLCs on different nodes, and where doing so is necessary for correctness. I'm looking to make that change and I'm hoping that documenting this first (with help from the reviewers!) will set that up to be successful. Some of this was adapted from section 4 of the SIGMOD 2020 paper.
db390fe
to
3392a0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)
pkg/util/hlc/doc.go, line 70 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Done? I don't see the word "intent" anywhere.
I made a reference to the written_timestamp
above, but I'll be more explicit and say that this does not change even if an intent is pushed during resolution. Done.
pkg/util/hlc/doc.go, line 95 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Separately, when a leaseholder on a given node serves a write, it ensures that the node's HLC clock is >= the written_timestamp of the write.
Hmm, but we want to get rid of this for the purposes of multi-tenancy, as we were talking about last night (i.e. to prevent a SQL pod from injecting timestamps into the cluster). So... what happens then with observed timestamps?
We want to get rid of hlc.Clock.Update(Timestamp)
. That doesn't mean we have to get rid of ensuring that a node's HLC clock is >= the WrittenTimestamp
(note: this is a new timestamp, see #72121 (comment)) of a write, it just means that the write is originally written with a WrittenTimestamp
in some cases. FWIW, this is already the case for future time writes — just picture s/synthetic_bit=true/written_timestamp=now()/
in cases where now() < write_timestamp
.
bors r+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)
pkg/util/hlc/doc.go, line 95 at r3 (raw file):
it just means that the write is originally written with a WrittenTimestamp in some cases.
Right... But what are these cases going to be? Any timestamp above the current lease expiration? So the idea will be that KV will make a determination that the gateway for a particular write is not to be trusted and so we won't bother ensuring linearizability for its writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)
pkg/util/hlc/doc.go, line 95 at r3 (raw file):
But what are these cases going to be?
Cases where the Txn.WriteTimestamp
is greater than the leaseholder's local HLC clock. The idea is that we'll no longer "push" operation timestamps into the local HLC clock. Instead, we'll (optionally) push into it with separate clock readings and then pull from it to determine a write's WrittenTimestamp
. As an optimization, if this WrittenTimestamp
is equal to or greater than its MVCC Timestamp
(the common case), then we don't need to explicitly include the WrittenTimestamp
in the key/value. Otherwise, we do. And this WrittenTimestamp
is what ObservedTimestamps
are compared against.
Ignoring the clock reading from secondary tenants just increases the chance that we'll fall into this second bucket and need to write both timestamps into a key/value.
Any timestamp above the current lease expiration?
No, writes still won't be allowed with MVCC timestamps above the current lease expiration. Although maybe they could be as long as their WrittenTimestamp (hlc.ClockTimestamp
) is below it and they did not read? I'd have to think through that. But that's not the current proposal.
So the idea will be that KV will make a determination that the gateway for a particular write is not to be trusted and so we won't bother ensuring linearizability for its writes?
The first part about trust is correct. We would ignore the clock reading from untrusted gateways.
The second part about linearizability is not correct. We'd still ensure linearizability as long as the untrusted tenant's HLC is within the max_offset of every other node in the system. But if it isn't, I guess that tenant would be susceptible to stale reads.
Want to talk through this at some point to make sure that I'm not missing something before I go off and pick this fight with the code?
Build failed (retrying...): |
Build succeeded: |
This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.
This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.
76095: kv: don't pass clock information through Raft log r=nvanbenschoten a=nvanbenschoten This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in #72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in #72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on #73732. 76163: bazel: remove old protos when generating new ones r=ajwerner a=ajwerner This is what the Makefile did. It was painful to have the old onces because they'd lead to spurious diffs. Release note: None 76166: sql: support version numbers on descriptor validation r=fqazi a=fqazi Previously, the descriptor validation code did not take a version number, so it was not possible to version gate new validation logic. This was inadequate because when new fields are introduced we don't want their validation to kick in for certain cases like the debug doctor, such as any new fields with non-zero defaults. To address this, this patch add supports for version numbers inside the validation, and updates unit tests to pass this in as well. It also adds a new option on debug doctor to run a version of validation. Release note (cli change): Add new optional version argument to the doctor examine command. This can be used to enable / disable validation when examining older zip directories. 76188: ci: make sure metamorphic nightly uses proper formatter for issues r=nicktrav a=rickystewart The `--formatter=pebble-metamorphic` option got lost in #75585. Release note: None 76191: gazelle: exclude `.pb.go` files r=dt a=rickystewart Should prevent Gazelle from getting confused and adding these to `srcs`. Release note: None 76192: ci: make sure extra env vars are set for `roachtest` jobs r=rail a=rickystewart We [need these](https://github.com/cockroachdb/cockroach/blob/5e7690d6da09821ff431ef32fe8d1430d05aed9f/pkg/cmd/internal/issues/issues.go#L159-L171). Release note: None 76194: tree: remove TODOs about bytea/float cast volatility r=mgartner a=rafiss These comments aren't needed, since the current cast volatility is correct, as far as I can tell. We might want to report this as a bug in Postgres if it describes these casts as immutable incorrectly. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
Before this change, we were updating the local clock with each BatchResponse's WriteTimestamp. This was meant to handle cases where the batch request timestamp was forwarded during evaluation. This was unnecessary for two reasons. The first is that BatchRequest can lead the local HLC clock (explored in cockroachdb#72121 and cockroachdb#72278) as long as any clock reading information in the values reflects the state of the HLC clock (synthetic bit today, "local timestamp" tomorrow). The second is that even if this first reason was not the case, the BatchRequest will only ever be bumped one logical tick above any existing value in the range, so as long as the existing values in the range followed the rule of trailing the leaseholders HLC (NB: they don't), all new writes would too.
This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.
The godoc is currently hosted on here. I'll leave that up for a few days.
This commit adds a
doc.go
file to thepkg/util/hlc
package that details thatproperties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.
This was inspired by #72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in #58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a
Timestamp
back to aClockTimestamp
withTryToClockTimestamp()
, it is still too difficult to understand when and whyclock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.
Some of this was adapted from section 4 of the SIGMOD 2020 paper.