-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hlc: introduce synthetic flag on timestamps #56373
hlc: introduce synthetic flag on timestamps #56373
Conversation
fc19b2a
to
0cc3a6a
Compare
Happy to review, but it will be on Tuesday.
…On Fri, Nov 6, 2020, 18:50 cockroach-teamcity ***@***.***> wrote:
This change is [image: Reviewable]
<https://reviewable.io/reviews/cockroachdb/cockroach/56373>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#56373 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZCCBYFRNAD6WUSTAN3SOQZPBANCNFSM4TM642XQ>
.
|
…rved timestamps This commit introduces a new package which contains logic and documentation related to the observed timestamp system, which allows transactions to track causality between themselves and other, possibly-concurrent, transactions in order to avoid uncertainty related restarts. This ties into our efforts to better document the kv layer. It also ties into synthetic timestamps (cockroachdb#56373), which are important for non-blocking transactions (cockroachdb#52745) and for fixing the interaction between observed timestamps and transaction refreshing (cockroachdb#36431). The proposed fix is to split out a new EffectiveMaxTimestamp field from the MaxTimestamp field on a transaction and set the EffectiveMaxTimestamp in LimitTxnMaxTimestamp. The EffectiveMaxTimestamp field will dictate the uncertainty interval for values with normal timestamps while the original MaxTimestamp will dictate the uncertainty interval for values with synthetic timestamps. In practice, this will mean that observed timestamps do not apply to values with synthetic timestamps. This commit gives us a package to house this new complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments only -- I haven't yet looked at the last 2 commits which tbg will be reviewing, but I'll try to follow along.
Reviewed 12 of 12 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, 27 of 29 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/roachpb/data.go, line 1101 at r4 (raw file):
} if t.ReadTimestamp == o.ReadTimestamp {
why is the synthetic bit relevant here?
pkg/util/hlc/legacy_timestamp.proto, line 41 at r4 (raw file):
// == operator). Consider use of the EqOrdering method when testing for // equality. optional uint32 flags = 3;
is this nullable
because it will be serialized in MVCCMetadata
and we don't want to pay the serialized cost when it is not set?
pkg/util/hlc/timestamp.go, line 272 at r4 (raw file):
func (t *Timestamp) Backward(s Timestamp) { // TODO(nvanbenschoten): if either timestamp is non-synthetic, we can remove // the synthetic bit.
Maybe phrase this TODO as: remove the synthetic bit from the older value if the younger one is non-synthetic.
pkg/util/hlc/timestamp.go, line 283 at r4 (raw file):
} // ToLegacyTimestamp converts a Timestamp to a LegacyTimestamp.
Is it possible to get rid of LegacyTimestamp
?
0cc3a6a
to
10d9549
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/roachpb/data.go, line 1101 at r4 (raw file):
Previously, sumeerbhola wrote…
why is the synthetic bit relevant here?
It isn't really relevant. This is just switching from Equal
to ==
, which has the same semantics but it more idiomatic to test for structural equality.
pkg/util/hlc/legacy_timestamp.proto, line 41 at r4 (raw file):
Previously, sumeerbhola wrote…
is this
nullable
because it will be serialized inMVCCMetadata
and we don't want to pay the serialized cost when it is not set?
Kind of. It's nullable
because we need identical encoding below Raft to ensure that all replicas stay bytewise identical even in a mixed version cluster. This is the same reason why we've never been able to completely remove LegacyTimestamp
. So since the field won't even exist on v20.2 nodes, we need to make sure it doesn't get serialized on v21.1 nodes.
pkg/util/hlc/timestamp.go, line 272 at r4 (raw file):
Previously, sumeerbhola wrote…
Maybe phrase this TODO as: remove the synthetic bit from the older value if the younger one is non-synthetic.
Done.
pkg/util/hlc/timestamp.go, line 283 at r4 (raw file):
Previously, sumeerbhola wrote…
Is it possible to get rid of
LegacyTimestamp
?
I'm trying, but it's putting up a fight: #56190. It's also used in MVCCMetadata
, which is written to "below Raft" in a number of places. I think we're going to need to run an online migration if we ever want to remove it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still anxious that synthetic and regular timestamps will have an opportunity to mix when they shouldn't, given how much existing code is out there. In particular, .Forward()
and .Backward()
seem susceptible. I wish we had proper enum types.
From plowing through the code so far, are you worried at all? And/or do you have ideas on getting at least some amount of separation between them back (clearly separate types are not an option)?
Reviewed 41 of 41 files at r6, 4 of 4 files at r7, 1 of 1 files at r8, 5 of 5 files at r9, 35 of 35 files at r10, 2 of 2 files at r11, 53 of 53 files at r12, 8 of 8 files at r13.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/util/hlc/legacy_timestamp.proto, line 41 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Kind of. It's
nullable
because we need identical encoding below Raft to ensure that all replicas stay bytewise identical even in a mixed version cluster. This is the same reason why we've never been able to completely removeLegacyTimestamp
. So since the field won't even exist on v20.2 nodes, we need to make sure it doesn't get serialized on v21.1 nodes.
Mind putting that in the comment?
pkg/util/hlc/timestamp.go, line 257 at r10 (raw file):
// Forward updates the timestamp from the one given, if that moves it forwards // in time. Returns true if the timestamp was adjusted and false otherwise.
Clarify the language in the comment:
// Forward replaces the receiver with the argument, if that moves it forwards in time.
pkg/util/hlc/timestamp.go, line 269 at r10 (raw file):
// Backward updates the timestamp from the one given, if that moves it // backwards in time.
ditto about clarifying language
pkg/util/hlc/timestamp.proto, line 51 at r10 (raw file):
// timestamp makes no such indication. By avoiding a connection to "real // time", synthetic timestamps can be used to write values at a future // time and to indicate that observed timestamps do not apply to such
I feel that nobody will understand this in the future. Perhaps
// Synthetic timestamps are central to non-blocking transactions, which essentially write at "future timestamps". They are also used in observed timestamps, where they indicate versions that were moved from the timestamp at which they were originally written; only synthetic timestamps require observing the full uncertainty interval, whereas readings off the node's clock can tighten it for non-synthetic versions.
pkg/util/hlc/timestamp_test.go, line 46 at r10 (raw file):
} a = makeTS(1, 1) if a.EqOrdering(b) {
curious why you didn't like require.False(t, a.EqOrdering(b), "%+v == %+v", a, b)
pkg/util/hlc/timestamp_test.go, line 207 at r10 (raw file):
assert.Equal(t, c.exp, c.ts.String()) // Flags don't round-trip through parsing, so ignore them.
Shouldn't they? If not, at least update comment on ParseTimestamp.
pkg/util/span/frontier.go, line 63 at r13 (raw file):
// Less implements heap.Interface. func (h frontierHeap) Less(i, j int) bool { if h[i].ts.EqOrdering(h[j].ts) {
Can you leave that _ func()
field on the struct to disable explicit comparisons going forward on both timestamp types?
bytes no_equality = 3 [(gogoproto.nullable) = true, (gogoproto.customname) = "_"];
worked nicely in my local testing.
10d9549
to
3d517fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
I'm still anxious that synthetic and regular timestamps will have an opportunity to mix when they shouldn't, given how much existing code is out there. In particular, .Forward() and .Backward() seem susceptible. I wish we had proper enum types.
From plowing through the code so far, are you worried at all? And/or do you have ideas on getting at least some amount of separation between them back (clearly separate types are not an option)?
My current plan is to never allow synthetic timestamps to do any harm by fataling if we ever call hlc.Update
with a synthetic timestamp. This effectively means that we need to separate out the "transaction time domain" from the "clock time domain", and we'll be made aware pretty quickly if these ever mix. We already have this separation in some places. For instance, we have a BatchRequest.Timestamp
and a BatchResponse.Now
, which are independent from the timestamps passed around in Transaction
objects.
My other thought is that I'm going to split hlc.Update
into two methods – hlc.UpdateForStability
and hlc.UpdateForCausality
. This doesn't directly tie in to the discussion here, but it means that we're going to need to be a little more explicit about when, where, and why we update the hlc clock. I expect that this will uncover cases where timestamps from the "transaction time domain" and leaking back into the "clock time domain".
One more aggressive idea to enforce this separation is to use another flag to indicate which domain a timestamp is part of. A conversion from a "clock timestamp" to a "transaction timestamp" would be allowed, but not the other way around. We could then assert that no "transaction timestamp" (synthetic or otherwise) ever makes its way back into a clock. My primary concern with this would be its impact on performance. We'd probably want to make the transaction domain the zero value for the flag.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/util/hlc/legacy_timestamp.proto, line 41 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mind putting that in the comment?
Done.
pkg/util/hlc/timestamp.go, line 257 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Clarify the language in the comment:
// Forward replaces the receiver with the argument, if that moves it forwards in time.
Done.
pkg/util/hlc/timestamp.go, line 269 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto about clarifying language
Done.
pkg/util/hlc/timestamp.proto, line 51 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I feel that nobody will understand this in the future. Perhaps
// Synthetic timestamps are central to non-blocking transactions, which essentially write at "future timestamps". They are also used in observed timestamps, where they indicate versions that were moved from the timestamp at which they were originally written; only synthetic timestamps require observing the full uncertainty interval, whereas readings off the node's clock can tighten it for non-synthetic versions.
I combined the two comments. I wanted to keep some more general definition so that we aren't limited in where we use these timestamps, but I like how clear your examples are.
pkg/util/hlc/timestamp_test.go, line 46 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
curious why you didn't like
require.False(t, a.EqOrdering(b), "%+v == %+v", a, b)
Just to stay consistent with the rest of the tests in this file.
pkg/util/hlc/timestamp_test.go, line 207 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Shouldn't they? If not, at least update comment on ParseTimestamp.
Yeah, I should have just done this to begin with. Done.
pkg/util/span/frontier.go, line 63 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you leave that
_ func()
field on the struct to disable explicit comparisons going forward on both timestamp types?bytes no_equality = 3 [(gogoproto.nullable) = true, (gogoproto.customname) = "_"];
worked nicely in my local testing.
I went back and forth on it. This won't work if we want to continue using proto to generate the struct def:
pkg/util/hlc/timestamp.pb.go:139:22: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:174:10: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:177:50: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:178:24: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:204:6: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:206:7: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:300:11: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:432:5: cannot refer to blank field or method
pkg/util/hlc/timestamp.pb.go:433:8: cannot refer to blank field or method
But we could do so ourselves using option (gogoproto.typedecl) = false;
. But I shied away from this for two reasons:
- it increases the struct size from 16 bytes to 40 bytes. We can do better with a
_ func()
, which limits to growth to 24 bytes, but that's still runtime overhead. And it's more pointers for the GC to chase. - it adds yet another dependency on gogoproto that we'll never be able to break away from.
So in the end, I decided it wasn't worth it. It turns out that we don't end up checking for timestamp equality in very many places anyway – we pretty much always use Less
or LessEq
.
3d517fc
to
8cfda31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 36 files at r15, 21 of 54 files at r17, 2 of 8 files at r18.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
This file was removed when RocksDB was removed.
Informs cockroachdb#52745. Informs cockroachdb#36431. This commit introduces an 8-bit flags field on the hlc timestamp struct. The flags are used to provide details about the timestamp and its meaning. They do not affect the sort order of Timestamps. The commit then introduces the first flag: SYNTHETIC. As discussed in cockroachdb#52745, a synthetic timestamp is defined as a timestamp that makes no claim about the value of clocks in the system. While standard timestamps are pulled from HLC clocks and indicate that some node in the system has a clock with a reading equal to or above its value, a synthetic timestamp makes no such indication. By avoiding a connection to "real time", synthetic timestamps can be used to write values at a future time and to indicate that observed timestamps do not apply to such writes for the purposes of tracking causality between the write and its observers. Observed timestamps will be a critical part of implementing non-blocking transactions (cockroachdb#52745) and fixing the interaction between observed timestamps and transaction refreshing (cockroachdb#36431). The original plan was to reserve the high-order bit in the logical portion of a timestamp as a "synthetic bit". This is how I began implementing things, but was turned off for a few reasons. First, it was fairly subtle and seemed too easy to get wrong. Using a separate field is more explicit and avoids a class of bugs. Second, I began to have serious concerns about how the synthetic bit would impact timestamp ordering. Every timestamp comparison would need to mask out the bit or risk being incorrect. This was even true of the LSM custom comparator. This seemed difficult to get right, and seemed particularly concerning since we're planning on marking only some of a transaction's committed values as synthetic to fix cockroachdb#36431, so if we weren't careful, we could get atomicity violations. There were also minor backwards compatibility concerns. But a separate field is more expensive in theory, so we need to be careful. However, it turns out that a separate field is mostly free in each case that we care about. In memory, the separate field is effectively free because the Timestamp struct was previously 12 bytes but was always padded out to 16 bytes when included as a field in any other struct. This means that the flags field is replacing existing padding. Over the wire, the field will not be included when zero and will use a varint encoding when not zero, so again, it is mostly free. In the engine key encoding, the field is also not included when zero, and takes up only 1 byte when non-zero, so it is mostly free.
Now that structural equality is different than semantic equality because of the new flags pointer in LegacyTimestamp, we need to be careful about comparing by value. These seem to be the only issues because of how rarely LegacyTimestamp is used. I detected these by adding a `_ func()` field to the struct and observed what fell out when compiling.
Be more explicit and avoid any issues with structural equality.
Now that structural equality is different than semantic equality because of the new flags field in Timestamp, we need to be careful about comparing by value. I detected these by adding a `_ func()` field to the struct and observed what fell out when compiling.
8cfda31
to
db9ef7e
Compare
TFTRs! bors r+ |
Build failed (retrying...): |
Strange failure in bors r- |
Canceled. |
I can't explain the failure other than some kind of network flake. The "dead" node never actually dies, it just hits a network error:
We also see this in the stderr:
I ran bors r+ |
56373: hlc: introduce synthetic flag on timestamps r=nvanbenschoten a=nvanbenschoten Informs #52745. Informs #36431. This commit introduces an 8-bit `flags` field on the hlc timestamp struct. The flags are used to provide details about the timestamp and its meaning. They do not affect the sort order of Timestamps. The commit then introduces the first flag: SYNTHETIC. As discussed in #52745, a synthetic timestamp is defined as a timestamp that makes no claim about the value of clocks in the system. While standard timestamps are pulled from HLC clocks and indicate that some node in the system has a clock with a reading equal to or above its value, a synthetic timestamp makes no such indication. By avoiding a connection to "real time", synthetic timestamps can be used to write values at a future time and to indicate that observed timestamps do not apply to such writes for the purposes of tracking causality between the write and its observers. Observed timestamps will be a critical part of implementing non-blocking transactions (#52745) and fixing the interaction between observed timestamps and transaction refreshing (#36431). The original plan was to reserve the high-order bit in the logical portion of a timestamp as a "synthetic bit". This is how I began implementing things, but was turned off for a few reasons. First, it was fairly subtle and seemed too easy to get wrong. Using a separate field is more explicit and avoids a class of bugs. Second, I began to have serious concerns about how the synthetic bit would impact timestamp ordering. Every timestamp comparison would need to mask out the bit or risk being incorrect. This was even true of the LSM custom comparator. This seemed difficult to get right and seemed particularly concerning since we're planning on marking only some of a transaction's committed values as synthetic to fix #36431, so if we weren't careful, we could get atomicity violations. There were also minor backwards compatibility concerns. But a separate field is more expensive in theory, so we need to be careful. However, it turns out that a separate field is mostly free in each case that we care about. In memory, the separate field is effectively free because the Timestamp struct was previously 12 bytes but was always padded out to 16 bytes when included as a field in any other struct. This means that the flags field is replacing existing padding. Over the wire, the field will not be included when zero and will use a varint encoding when not zero, so again, it is mostly free. In the engine key encoding, the field is also not included when zero, and takes up only 1 byte when non-zero, so it is mostly free. ---- First three commits from #56477. @sumeerbhola I'm hoping you can take a look at the engine-level changes in the `introduce synthetic flag on timestamps` commit (4th commit as of the time of writing). I think the key encoding added here makes sense, but want to make sure you're on board. One possible concern is that we introduce a new 13-byte suffix, which means that combined with a 4-byte sequence number (see #41720 (comment)), we'd collide with the 17 byte `engineKeyVersionLockTableLen`. @tbg do you mind being the primary reviewer here? I think you know the most about the motivations for this change and will have a good sense of whether this is the best way to introduce additional state on timestamps. 56437: cli, ui: dismiss release notes signup banner per environment variable r=knz,dhartunian a=nkodali Previously, the signup banner could only be dismissed manually. For internal testing purposes, this banner is unnecessary. This change provides a way to dismiss the signup banner upon start of a cluster via the cli by setting the environment variable COCKROACH_UI_RELEASE_NOTES_SIGNUP_DISMISSED=true. Resolves #46998 Release note: none 56627: sql: rework SHOW REGIONS to SHOW REGIONS FROM CLUSTER r=ajstorm a=otan Resolves #56331 Release note (sql change): SHOW REGIONS functionality is now deferred to SHOW REGIONS FROM CLUSTER. Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Namrata Kodali <[email protected]> Co-authored-by: Oliver Tan <[email protected]>
…rved timestamps This commit introduces a new package which contains logic and documentation related to the observed timestamp system, which allows transactions to track causality between themselves and other, possibly-concurrent, transactions in order to avoid uncertainty related restarts. This ties into our efforts to better document the kv layer. It also ties into synthetic timestamps (cockroachdb#56373), which are important for non-blocking transactions (cockroachdb#52745) and for fixing the interaction between observed timestamps and transaction refreshing (cockroachdb#36431). The proposed fix is to split out a new EffectiveMaxTimestamp field from the MaxTimestamp field on a transaction and set the EffectiveMaxTimestamp in LimitTxnMaxTimestamp. The EffectiveMaxTimestamp field will dictate the uncertainty interval for values with normal timestamps while the original MaxTimestamp will dictate the uncertainty interval for values with synthetic timestamps. In practice, this will mean that observed timestamps do not apply to values with synthetic timestamps. This commit gives us a package to house this new complexity.
Build failed (retrying...): |
Build succeeded: |
…rved timestamps This commit introduces a new package which contains logic and documentation related to the observed timestamp system, which allows transactions to track causality between themselves and other, possibly-concurrent, transactions in order to avoid uncertainty related restarts. This ties into our efforts to better document the kv layer. It also ties into synthetic timestamps (cockroachdb#56373), which are important for non-blocking transactions (cockroachdb#52745) and for fixing the interaction between observed timestamps and transaction refreshing (cockroachdb#36431). The proposed fix is to split out a new EffectiveMaxTimestamp field from the MaxTimestamp field on a transaction and set the EffectiveMaxTimestamp in LimitTxnMaxTimestamp. The EffectiveMaxTimestamp field will dictate the uncertainty interval for values with normal timestamps while the original MaxTimestamp will dictate the uncertainty interval for values with synthetic timestamps. In practice, this will mean that observed timestamps do not apply to values with synthetic timestamps. This commit gives us a package to house this new complexity.
55148: kvserver: don't allow raft forwarding of lease requests r=andreimatei a=andreimatei This patch aims to improve the behavior in scenarios where a follower replica is behind, unaware of the latest lease, and it tries to acquire a lease in its ignorant state. That lease acquisition request is bound to fail (because the lease that it's based on is stale), but while it fails (or, rather, until the behind replica finds out that it failed) local requests are blocked. This blocking can last for a very long time in situations where a snapshot is needed to catch up the follower, and the snapshot is queued up behind many other snapshots (e.g. after a node has been down for a while and gets restarted). This patch tries an opinionated solution: never allow followers to acquire leases. If there is a leader, it's a better idea for the leader to acquire the lease. The leader might have a lease anyway or, even if it doesn't, having the leader acquire it saves a leadership transfer (leadership follows the lease). We change the proposal path to recognize lease requests and reject them early if the current replica is a follower and the leader is known. The rejection points to the leader, which causes the request that triggered the lease acquisition to make its way to the leader and attempt to acquire a lease over there. The next commit takes this further by short-circuiting the lease proposal even sooner - but that patch is more best-effort. Fixes #37906 Release note (bug fix): A bug causing queries sent to a freshly-restarted node to sometimes hang for a long time while the node catches up with replication has been fixed. 56392: kvserver/observedts: extract package to house logic and docs for observed timestamps r=nvanbenschoten a=nvanbenschoten This commit introduces a new package that contains logic and documentation related to the observed timestamp system, which allows transactions to track causality between themselves and other, possibly-concurrent, transactions in order to avoid uncertainty related restarts. This ties into our efforts to better document the kv layer. It also ties into synthetic timestamps (#56373), which are important for non-blocking transactions (#52745) and for fixing the interaction between observed timestamps and transaction refreshing (#36431). The proposed fix is to split out a new EffectiveMaxTimestamp field from the MaxTimestamp field on a transaction and set the EffectiveMaxTimestamp in LimitTxnMaxTimestamp. The EffectiveMaxTimestamp field will dictate the uncertainty interval for values with normal timestamps while the original MaxTimestamp will dictate the uncertainty interval for values with synthetic timestamps. In practice, this will mean that observed timestamps do not apply to values with synthetic timestamps. This commit gives us a package to house this new complexity. Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/util/hlc/timestamp.go, line 283 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm trying, but it's putting up a fight: #56190. It's also used in
MVCCMetadata
, which is written to "below Raft" in a number of places. I think we're going to need to run an online migration if we ever want to remove it entirely.
Where is MVCCMetadata
written to below Raft? I am adding the bool txn_did_not_update_meta
field to MVCCMetadata
and I suppose I will need to make it truly optional.
I wish there was a way with gogoproto
to make something optional but tell it to not use a pointer, and so avoid an allocation, and not marshal when the value is the default.
Hmm, I see this come up in gogo/protobuf#142, which you referenced in a commit 2 years ago, and proto3 was recommended as the solution. What is our story on proto3 adoption/conversion -- we are using it in some places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola and @tbg)
pkg/util/hlc/timestamp.go, line 283 at r4 (raw file):
Where is MVCCMetadata written to below Raft? I am adding the bool txn_did_not_update_meta field to MVCCMetadata and I suppose I will need to make it truly optional.
I believe the calls to MVCCPutProto
with an empty timestamp in StateLoader
are responsible for most of the uses below Raft.
I wish there was a way with gogoproto to make something optional but tell it to not use a pointer, and so avoid an allocation, and not marshal when the value is the default.
This is how proto3 works, but not proto2. We use proto3 in most places, but can't in cases where we need deterministic proto encoding for messages that are written below Raft.
One option would be to split MVCCMetadata
into a proto3 MVCCMetadata
without compatibility concerns and a proto2 LegacyMVCCMetadata
with compatibility concerns. This would mirror what we did with Timestamp
/LegacyTimestamp
. I say this having not explored how difficult doing so would be.
…imestamp This commit splits off a new enginepb.TxnTimestamp type (see `pkg/storage/enginepb/mvcc.go`) from the existing hlc.Timestamp type through a type alias. While the two types share the same memory and proto representation, they have different purposes and properties. hlc.Timestamp continues to serve in its original role — representing a real timestamp pulled from a clock on one of the nodes in the system. enginepb.TxnTimestamp is introduced to serve in hlc.Timestamp's current secondary role – representing MVCC time and the time that transactions read and write at. In the words of cockroachdb#56373 (review), this splits "clock-domain timestamps" from "transaction-domain timestamps", allowing us to use to type system to help keep them separate and to make their interactions more explicit. This results in boiling most of the ocean, but comes with some benefits in the end. - It makes conversion from the clock-domain to the transaction-domain explicit. This must pass through a call to `enginepb.ToTxnTimestamp`. - It makes conversions from the transaction-domain to the clock-domain impossible (not quite there yet). - Similarly, it makes it impossible for transaction-domain timestamps to be passed into `hlc.Clock.Update`. - It allows us to more clearly state that clock-domain timestamps must always trail present time but transaction-domain timestamps can move into the future. As such, only transaction-domain timestamps will ever have the synthetic bit set. I'm interested to get people's take on this. These benefits are nice, but the size of this diff (it still doesn't come close to compiling) and the newly introduced burden of needing to choose between multiple timestamp types is not.
Informs #52745.
Informs #36431.
This commit introduces an 8-bit
flags
field on the hlc timestamp struct. The flags are used to provide details about the timestamp and its meaning. They do not affect the sort order of Timestamps.The commit then introduces the first flag: SYNTHETIC. As discussed in #52745, a synthetic timestamp is defined as a timestamp that makes no claim about the value of clocks in the system. While standard timestamps are pulled from HLC clocks and indicate that some node in the system has a clock with a reading equal to or above its value, a synthetic timestamp makes no such indication. By avoiding a connection to "real time", synthetic timestamps can be used to write values at a future time and to indicate that observed timestamps do not apply to such writes for the purposes of tracking causality between the write and its observers. Observed timestamps will be a critical part of implementing non-blocking transactions (#52745) and fixing the interaction between observed timestamps and transaction refreshing (#36431).
The original plan was to reserve the high-order bit in the logical portion of a timestamp as a "synthetic bit". This is how I began implementing things, but was turned off for a few reasons. First, it was fairly subtle and seemed too easy to get wrong. Using a separate field is more explicit and avoids a class of bugs. Second, I began to have serious concerns about how the synthetic bit would impact timestamp ordering. Every timestamp comparison would need to mask out the bit or risk being incorrect. This was even true of the LSM custom comparator. This seemed difficult to get right and seemed particularly concerning since we're planning on marking only some of a transaction's committed values as synthetic to fix #36431, so if we weren't careful, we could get atomicity violations. There were also minor backwards compatibility concerns.
But a separate field is more expensive in theory, so we need to be careful. However, it turns out that a separate field is mostly free in each case that we care about. In memory, the separate field is effectively free because the Timestamp struct was previously 12 bytes but was always padded out to 16 bytes when included as a field in any other struct. This means that the flags field is replacing existing padding. Over the wire, the field will not be included when zero and will use a varint encoding when not zero, so again, it is mostly free. In the engine key encoding, the field is also not included when zero, and takes up only 1 byte when non-zero, so it is mostly free.
First three commits from #56477.
@sumeerbhola I'm hoping you can take a look at the engine-level changes in the
introduce synthetic flag on timestamps
commit (4th commit as of the time of writing). I think the key encoding added here makes sense, but want to make sure you're on board. One possible concern is that we introduce a new 13-byte suffix, which means that combined with a 4-byte sequence number (see #41720 (comment)), we'd collide with the 17 byteengineKeyVersionLockTableLen
.@tbg do you mind being the primary reviewer here? I think you know the most about the motivations for this change and will have a good sense of whether this is the best way to introduce additional state on timestamps.