Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd2Topo: Use node's ModRevision consistently for in-memory topo.Version value #15847

Merged
merged 4 commits into from
May 12, 2024

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented May 6, 2024

Description

Etcd has several version related fields/values in each key's header: Revision, ModRevision, and Version. These are related but distinct things:

To summarize:

  • Revision: a logical clock used to track changes to the etcd keyspace (a transaction / snapshot ID)
  • Version: a logical clock used to track the changes to an individual key since it was last created (it is not monotonic)
  • ModRevision: the etcd keyspace revision number for the key at that point in time (the logical version of the key)

The Revision is what allows for consistent reads and writes of multiple keys (in a transaction). It allows you, for example, to do a consistent read / scan of N keys at that logical point-in-time.

In Vitess we also have awareness of each value and read/use these values in memory in the etcd2topo topo server implementation. There's also a generic and opaque topo.Version type — which can be set to either of the two lower level node/key specific values: ModRevision or Version — that we use for strict serializability.

In #15701 I aligned the Vitess and etcd2topo lower level version related fields, variable names, and comments for watches: https://github.com/vitessio/vitess/pull/15701/files/f1fed6dd787feea26a4f4e13f660e8eaa91e64bb#diff-3f055ca1a4446966abba5a840e13c5c6a11e73121143a66b06070a356425d676

In later unrelated investigations, however, I realized that the use of Version: EtcdVersion(initial.Kvs[0].ModRevision) in Watch() was intentional and consistent with how we're using ModRevision for the node's topo.Version value in file.go. The reason being that we use this to enforce strict serializability when a topo.Version value is provided — failing the write if the topo.Version value provided is not equal to the current ModRevision of the key — and we return the new topo.Version value based on the new ModRevision value (technically we return the Revision value from the response header, but that would also be the ModRevision value for the key we updated in the transaction) on successful update:

// Update is part of the topo.Conn interface.
func (s *Server) Update(ctx context.Context, filePath string, contents []byte, version topo.Version) (topo.Version, error) {
nodePath := path.Join(s.root, filePath)
if version != nil {
// We have to do a transaction. This means: if the
// current file revision is what we expect, save it.
txnresp, err := s.cli.Txn(ctx).
If(clientv3.Compare(clientv3.ModRevision(nodePath), "=", int64(version.(EtcdVersion)))).
Then(clientv3.OpPut(nodePath, string(contents))).
Commit()
if err != nil {
return nil, convertError(err, nodePath)
}
if !txnresp.Succeeded {
return nil, topo.NewError(topo.BadVersion, nodePath)
}
return EtcdVersion(txnresp.Header.Revision), nil
}
// No version specified. We can use a simple unconditional Put.
resp, err := s.cli.Put(ctx, nodePath, string(contents))
if err != nil {
return nil, convertError(err, nodePath)
}
return EtcdVersion(resp.Header.Revision), nil
}

We use the ModRevision value there as etcd's transaction response includes the new ModRevision from the commit
(again, technically it's the Revision value from the response header, but that would also be the ModRevision value for the key we updated in the transaction) but NOT the new node/key Version. This makes sense because a transaction can update any arbitrary number of keys. So the only logical version that can be provided for that transaction is the keyspace Revision, which would be the ModRevision value for each of the keys updated in the transaction. This is the logical point in time for the state of the data in etcd — equivalent to a GTID position/set in MySQL.

That is used e.g. with Shard records/keys and the ShardInfo in memory structure:

vitess/go/vt/topo/shard.go

Lines 165 to 222 in f27d287

// GetShard is a high level function to read shard data.
// It generates trace spans.
func (ts *Server) GetShard(ctx context.Context, keyspace, shard string) (*ShardInfo, error) {
if err := ValidateKeyspaceName(keyspace); err != nil {
return nil, err
}
if _, _, err := ValidateShardName(shard); err != nil {
return nil, err
}
span, ctx := trace.NewSpan(ctx, "TopoServer.GetShard")
span.Annotate("keyspace", keyspace)
span.Annotate("shard", shard)
defer span.Finish()
shardPath := shardFilePath(keyspace, shard)
data, version, err := ts.globalCell.Get(ctx, shardPath)
if err != nil {
return nil, err
}
value := &topodatapb.Shard{}
if err = value.UnmarshalVT(data); err != nil {
return nil, vterrors.Wrapf(err, "GetShard(%v,%v): bad shard data", keyspace, shard)
}
return NewShardInfo(keyspace, shard, value, version), nil
}
// updateShard updates the shard data, with the right version.
// It also creates a span, and dispatches the event.
func (ts *Server) updateShard(ctx context.Context, si *ShardInfo) error {
span, ctx := trace.NewSpan(ctx, "TopoServer.UpdateShard")
span.Annotate("keyspace", si.keyspace)
span.Annotate("shard", si.shardName)
defer span.Finish()
data, err := si.Shard.MarshalVT()
if err != nil {
return err
}
shardPath := shardFilePath(si.keyspace, si.shardName)
newVersion, err := ts.globalCell.Update(ctx, shardPath, data, si.version)
if err != nil {
return err
}
si.version = newVersion
event.Dispatch(&events.ShardChange{
KeyspaceName: si.Keyspace(),
ShardName: si.ShardName(),
Shard: si.Shard,
Status: "updated",
})
return nil
}

One way for us to think about this: an etcd watch is equivalent to a (filtered) binlog stream with MySQL. We don’t know the version of a table or row, we only know their state at a logical point in time — the GTID position. The revision in etcd is that logical point in time / position in the stream of events/changes.

So with all of this in mind, we should also be using ModRevision for the topo.Version value in Watch() — and beyond that, we should be using it for all topo.Version values within the etcd2topo implementation. Please see the discussion from here on for further reasons why.

So in the end, this PR does just that: it uses the etcd ModRevision value for the logical topo.Version value of keys across the etcd2topo implementation and adds a test to check and enforce that for Watch().

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Copy link
Contributor

vitess-bot bot commented May 6, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 6, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone May 6, 2024
@mattlord mattlord force-pushed the etcd2_node_version branch from 22ee6d1 to d49f984 Compare May 6, 2024 14:43
@mattlord mattlord removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 6, 2024
@mattlord mattlord marked this pull request as ready for review May 6, 2024 14:44
@mattlord mattlord requested a review from deepthi as a code owner May 6, 2024 14:44
@mattlord mattlord requested a review from rohit-nayak-ps May 6, 2024 14:44
@mattlord mattlord changed the title Topo: Use node's ModRevision consistently for in-memory Version value Topo: Use node's ModRevision consistently for in-memory topo.Version value May 6, 2024
deepthi
deepthi previously approved these changes May 6, 2024
@rohit-nayak-ps
Copy link
Contributor

To clarify, the use of Version is not causing a functional issue in Vitess, right? This change is only to be consistent?

Since we are watching a specific key, it shouldn't matter if we use Version or ModRevision as both will increment anytime a key is updated.

@deepthi
Copy link
Member

deepthi commented May 6, 2024

To clarify, the use of Version is not causing a functional issue in Vitess, right? This change is only to be consistent?

Since we are watching a specific key, it shouldn't matter if we use Version or ModRevision as both will increment anytime a key is updated.

Quote from the description:
this PR is really a 1 line revert from the noted PR.

So the way I'm reading this is that there was an unnecessary change in a previous PR which may actually cause incorrect behavior, and this PR is fixing that.

Copy link

codecov bot commented May 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.42%. Comparing base (f118ba2) to head (0ea964b).
Report is 66 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15847      +/-   ##
==========================================
+ Coverage   68.40%   68.42%   +0.01%     
==========================================
  Files        1556     1559       +3     
  Lines      195121   196537    +1416     
==========================================
+ Hits       133479   134482    +1003     
- Misses      61642    62055     +413     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mattlord
Copy link
Contributor Author

mattlord commented May 6, 2024

To clarify, the use of Version is not causing a functional issue in Vitess, right? This change is only to be consistent?

Since we are watching a specific key, it shouldn't matter if we use Version or ModRevision as both will increment anytime a key is updated.

Version and ModRevision will both be increasing, yes. So you can use one or the other as long as you're consistent. They just aren't the same logical or literal value (as Revision is incremented each time the etcd keyspace is updated). So it matters which one you return, as the caller can then pass that back in on subsequent topo calls, as we then check that this value is EQUAL TO the key's current ModRevision (if a topo.Value was passed in, e.g. to Update()).

AFAIK, the logical version within watches only comes in to play when we need to resume/restart the watch:

var rev = initial.Header.Revision
var watchRetries int
for {
select {
case <-s.running:
return
case <-watchCtx.Done():
// This includes context cancellation errors.
notifications <- &topo.WatchData{
Err: convertError(watchCtx.Err(), nodePath),
}
return
case wresp, ok := <-watcher:
if !ok {
if watchRetries > 10 {
t := time.NewTimer(time.Duration(watchRetries) * time.Second)
select {
case <-t.C:
t.Stop()
case <-s.running:
t.Stop()
continue
case <-watchCtx.Done():
t.Stop()
continue
}
}
watchRetries++
// Cancel inner context on retry and create new one.
watchCancel()
watchCtx, watchCancel = context.WithCancel(ctx)
newWatcher := s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(rev))

As you can see there, it's already using the revision value (etcd keyspace version) for both. BUT, in looking at that I noticed that in a relatively new usage of watchData{} from #10906 it was using Version rather than ModRevision. I can now understand WHY that was done at the time, but I've now updated that to use ModRevision just as the normal (non-recursive) Watch does for the same reasons as noted in the issue description.

@mattlord mattlord changed the title Topo: Use node's ModRevision consistently for in-memory topo.Version value Etcd2Topo: Use node's ModRevision consistently for in-memory topo.Version value May 6, 2024
@rohit-nayak-ps
Copy link
Contributor

I checked out your branch. I seem to see Version and ModRevision still used inconsistently. What am I missing?

Since no test is failing, I am assuming we don't really use the version value anywhere. Looking at a couple of watchers (Shard and SrvVSchema, we only look at the new contents of the key and not the version). Otherwise I would have expected some unit test to have failed with the changes in this PR or the previous one.

In etcd2topo/watch.go.Watch()
_the initial value uses ModRevision

wd := &topo.WatchData{
Contents: initial.Kvs[0].Value,
// ModRevision is used for the topo.Version value as we get the new Revision value back
// when updating the file/key within a transaction in file.go and so this is the opaque
// version that we can use to enforce serializabile writes for the file/key.
Version: EtcdVersion(initial.Kvs[0].ModRevision),
}

For new values it sends Version
case mvccpb.PUT:
notifications <- &topo.WatchData{
Contents: ev.Kv.Value,
Version: EtcdVersion(ev.Kv.Version),
}

In WatchRecursive() we first use Version for the initial value and then ModRevision for new values.

initial:

for _, kv := range initial.Kvs {
var wd topo.WatchDataRecursive
wd.Path = string(kv.Key)
wd.Contents = kv.Value
wd.Version = EtcdVersion(initial.Kvs[0].Version)
initialwd = append(initialwd, &wd)
}

watcher:

case mvccpb.PUT:
notifications <- &topo.WatchDataRecursive{
Path: string(ev.Kv.Key),
WatchData: topo.WatchData{
Contents: ev.Kv.Value,
Version: EtcdVersion(ev.Kv.ModRevision),
},
}

@rohit-nayak-ps
Copy link
Contributor

rohit-nayak-ps commented May 6, 2024

OK, I just saw after my comment above that you have a new commit where everything is set as ModRevision. I guess this should be good. It will still be useful to have a unit test that shows that we get newer revisions uniformly in the watcher, since we think it is worthwhile to modify this behaviour.

@mattlord
Copy link
Contributor Author

mattlord commented May 6, 2024

OK, I just saw after my comment above that you have a new commit where everything is set as ModRevision. I guess this should be good.

Yes, I noticed this too in answering your previous question and I updated things to be consistent (also updated the description for further clarification).

It will still be useful to have a unit test that shows that we get newer revisions uniformly in the watcher, since we think it is worthwhile to modify this behaviour.

Sure, I can add one.

@rohit-nayak-ps
Copy link
Contributor

Thinking about this some more: the watchers are watching a specific key, right? So shouldn't we be passing Version which is key-specific, rather than the global ModRevision. For example, in a unit test, if the SrvVSchema is changed thrice, we can expect the watcher to stream us versions 1, 2 and 3, but we have no idea from the ModRevision whether we have received all the updates.

When will using Version everywhere be wrong?

@deepthi deepthi dismissed their stale review May 6, 2024 20:02

Needs to be re-reviewed since more changes have been committed.

@mattlord
Copy link
Contributor Author

mattlord commented May 6, 2024

Thinking about this some more: the watchers are watching a specific key, right? So shouldn't we be passing Version which is key-specific, rather than the global ModRevision. For example, in a unit test, if the SrvVSchema is changed thrice, we can expect the watcher to stream us versions 1, 2 and 3, but we have no idea from the ModRevision whether we have received all the updates.

ModRevision is a metadata property of the key. That value is monotonically increasing. Version goes back to 0 and then 1 if the key is deleted and recreated. The watchers have never guaranteed that you don't miss an intermediate value AFAIK. They can't, as etcd will do compactions and a user can request it as well. This effectively prunes the history. So in various scenarios (including etcd failover or vitess process restart/failover) you will have to restart your watch and you may have missed intermediate values as they are no longer available and you start with the current/latest version.

// Create the Watcher. We start watching from the response we
// got, not from the file original version, as the server may
// not have that much history.
watcher := s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(initial.Header.Revision))

https://github.com/etcd-io/etcd/blob/35361955578322a55841c1ed962c53e28658d951/client/v3/watch.go#L53-L82

It’s worth noting why Watch actually exists. It’s not because you want to process every change of a key (it doesn’t offer those guarantees) — it’s that you want to get the new version of a key when it’s changed. In other words, you always want your in-memory representation to be up-to-date / have the latest values from the database. https://etcd.io/docs/v3.5/tutorials/how-to-watch-keys/ It only guarantees that you (eventually) get notifications about future changes to the key (with "future" being based on the revision). That’s it. If you’re not already caught up, you can miss intermediate changes.

And it's important to note that the etcd client library only takes revisions on Get,Watch,Update,List etc as there is no clientv3.WithVersion:

s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(rev))

For example, get: https://github.com/etcd-io/etcd/blob/35361955578322a55841c1ed962c53e28658d951/client/v3/kv.go#L41-L49

You're saying that you want to watch this key from this database snapshot. And this is the only way that you can coordinate across keys, whether you're using transactions, multi-key watches (as we do with WatchRecursive), prefix scans, etc.

It's also worth noting that I haven't yet found places where Etcd itself, in the product or in the documentation, uses the key's Version. For example: https://etcd.io/docs/v3.5/dev-guide/interacting_v3/#watch-historical-changes-of-keys

So perhaps the key's Version was more of an etcdv2 thing — as keyspace revisions were added in storage v3 — that is being phased out. I'm not sure. It also makes sense to do so given the support for (prefix) scans and transactions in v3 so that you have uniformity in the API whether you're interacting with a single key or multiple ones.

When will using Version everywhere be wrong?

Wrong is not the correct word to use here IMO. We have to define the semantics and guarantees — only then can we know the best option. I've made every attempt to explain why ModRevision has to be used in the places where we care most about the topo.Version value (see updated issue description). So we know that we can't use it everywhere. If you only mean everywhere in the Watch then I get the inclination. The downside I see there is that if you read the latest version of a key from the watcher, then update it, you won't be able to save that back in the topo if you provide the topo.Version value you have. So you'd be forced to do a blind write.

So I take this in total to mean that the key's Version value is handy for humans but what really matters is its ModRevision value — as that's what is used for consistent reads and writes and thus what you pass when you want to read, watch, update the key etc — really any time you're interfacing with the database on a request now.

@mattlord
Copy link
Contributor Author

mattlord commented May 6, 2024

After further investigation (see my previous comment: #15847 (comment)) I think that we should use Revision/ModRevision everywhere as the topo.Version value because that seems to be the only thing that etcdv3 cares about. In this PR, we're already doing that with one exception:

❯ git grep -E 'Version|Revision' | grep etcd2topo | grep -v "Apache License, Version 2.0"
go/vt/topo/etcd2topo/election.go:		clientv3.WithSort(clientv3.SortByModRevision, clientv3.SortAscend),
go/vt/topo/etcd2topo/election.go:		clientv3.WithSort(clientv3.SortByModRevision, clientv3.SortAscend),
go/vt/topo/etcd2topo/election.go:	watcher := mp.s.cli.Watch(ctx, electionPath, clientv3.WithPrefix(), clientv3.WithRev(initial.Header.Revision))
go/vt/topo/etcd2topo/election.go:					clientv3.WithSort(clientv3.SortByModRevision, clientv3.SortAscend),
go/vt/topo/etcd2topo/file.go:func (s *Server) Create(ctx context.Context, filePath string, contents []byte) (topo.Version, error) {
go/vt/topo/etcd2topo/file.go:		If(clientv3.Compare(clientv3.Version(nodePath), "=", 0)).
go/vt/topo/etcd2topo/file.go:	return EtcdVersion(txnresp.Header.Revision), nil
go/vt/topo/etcd2topo/file.go:func (s *Server) Update(ctx context.Context, filePath string, contents []byte, version topo.Version) (topo.Version, error) {
go/vt/topo/etcd2topo/file.go:			If(clientv3.Compare(clientv3.ModRevision(nodePath), "=", int64(version.(EtcdVersion)))).
go/vt/topo/etcd2topo/file.go:			return nil, topo.NewError(topo.BadVersion, nodePath)
go/vt/topo/etcd2topo/file.go:		return EtcdVersion(txnresp.Header.Revision), nil
go/vt/topo/etcd2topo/file.go:	return EtcdVersion(resp.Header.Revision), nil
go/vt/topo/etcd2topo/file.go:func (s *Server) Get(ctx context.Context, filePath string) ([]byte, topo.Version, error) {
go/vt/topo/etcd2topo/file.go:	return resp.Kvs[0].Value, EtcdVersion(resp.Kvs[0].ModRevision), nil
go/vt/topo/etcd2topo/file.go:		results[n].Version = EtcdVersion(pairs[n].ModRevision)
go/vt/topo/etcd2topo/file.go:func (s *Server) Delete(ctx context.Context, filePath string, version topo.Version) error {
go/vt/topo/etcd2topo/file.go:			If(clientv3.Compare(clientv3.ModRevision(nodePath), "=", int64(version.(EtcdVersion)))).
go/vt/topo/etcd2topo/file.go:					return topo.NewError(topo.BadVersion, nodePath)
go/vt/topo/etcd2topo/lock.go:		If(clientv3.Compare(clientv3.Version(newKey), "=", 0)).
go/vt/topo/etcd2topo/lock.go:	return newKey, txnresp.Header.Revision, nil
go/vt/topo/etcd2topo/server.go:			MinVersion:         tls.VersionTLS12,
go/vt/topo/etcd2topo/version.go:// EtcdVersion is etcd's idea of a version.
go/vt/topo/etcd2topo/version.go:// It implements topo.Version.
go/vt/topo/etcd2topo/version.go:type EtcdVersion int64
go/vt/topo/etcd2topo/version.go:// String is part of the topo.Version interface.
go/vt/topo/etcd2topo/version.go:func (v EtcdVersion) String() string {
go/vt/topo/etcd2topo/watch.go:		// ModRevision is used for the topo.Version value as we get the new Revision value back
go/vt/topo/etcd2topo/watch.go:		Version: EtcdVersion(initial.Kvs[0].ModRevision),
go/vt/topo/etcd2topo/watch.go:	watcher := s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(initial.Header.Revision))
go/vt/topo/etcd2topo/watch.go:		var rev = initial.Header.Revision
go/vt/topo/etcd2topo/watch.go:				rev = wresp.Header.GetRevision()
go/vt/topo/etcd2topo/watch.go:							Version:  EtcdVersion(ev.Kv.ModRevision),
go/vt/topo/etcd2topo/watch.go:		wd.Version = EtcdVersion(initial.Kvs[0].ModRevision)
go/vt/topo/etcd2topo/watch.go:	watcher := s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(initial.Header.Revision), clientv3.WithPrefix())
go/vt/topo/etcd2topo/watch.go:		var rev = initial.Header.Revision
go/vt/topo/etcd2topo/watch.go:				rev = wresp.Header.GetRevision()

That one exception is when explicitly doing create operations on the Vitess side:

  • Keys:
    // Create is part of the topo.Conn interface.
    func (s *Server) Create(ctx context.Context, filePath string, contents []byte) (topo.Version, error) {
    nodePath := path.Join(s.root, filePath)
    // We have to do a transaction, comparing existing version with 0.
    // This means: if the file doesn't exist, create it.
    txnresp, err := s.cli.Txn(ctx).
    If(clientv3.Compare(clientv3.Version(nodePath), "=", 0)).
    Then(clientv3.OpPut(nodePath, string(contents))).
    Commit()
  • Locks:
    // newUniqueEphemeralKV creates a new file in the provided directory.
    // It is linked to the Lease.
    // Errors returned are converted to topo errors.
    func (s *Server) newUniqueEphemeralKV(ctx context.Context, cli *clientv3.Client, leaseID clientv3.LeaseID, nodePath string, contents string) (string, int64, error) {
    // Use the lease ID as the file name, so it's guaranteed unique.
    newKey := fmt.Sprintf("%v/%v", nodePath, leaseID)
    // Only create a new file if it doesn't exist already
    // (version = 0), to avoid two processes using the
    // same file name. Since we use the lease ID, this should never happen.
    txnresp, err := cli.Txn(ctx).
    If(clientv3.Compare(clientv3.Version(newKey), "=", 0)).
    Then(clientv3.OpPut(newKey, contents, clientv3.WithLease(leaseID))).
    Commit()

The usage there makes sense to me because it's a special case when creating a key — we want to enforce that it doesn't already exist because at lower layers / within etcd it's a put operation whether you're creating a key or updating it.

So I'm going to add a test to confirm and enforce that we are using ModRevision in watches and that we observe the relevant semantics. This way we are at least consistent, testing the current behavior, and preventing (unintentional) changes going forward.

@mattlord
Copy link
Contributor Author

mattlord commented May 6, 2024

@rohit-nayak-ps Let's also look at an example since you seem to suspect that Version would give us better properties/semantics in determining order / sequence of operations and a key's relative logical version (is this key value newer than another one):

# It doesn't yet exist so its Version and ModRevision are both 0
❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":51,"raft_term":2}}

❯ etcdctl put /testval "foo"
OK

❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":52,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":52,"version":1,"value":"Zm9v"}],"count":1}

❯ etcdctl put /testval "foo2"
OK

❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":53,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":53,"version":2,"value":"Zm9vMg=="}],"count":1}

❯ etcdctl put /testval "foo3"
OK

❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":54,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":54,"version":3,"value":"Zm9vMw=="}],"count":1}

❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":55,"raft_term":2}}
❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":56,"raft_term":2}}
❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":57,"raft_term":2}}
❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":58,"raft_term":2}}

❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":58,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":58,"version":7,"value":"Zm9vWA=="}],"count":1}

❯ etcdctl del /testval
1

❯ etcdctl put -w json /testval "foo"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":60,"raft_term":2}}

❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":60,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":60,"mod_revision":60,"version":1,"value":"Zm9v"}],"count":1}

As you can see, the Revision value is monotonic. So we can always know that one logical version of a key is newer than another one when using ModRevision because the Revision value is a global monotonically increasing value giving us a global total order and the ModRevision value gives us a total order for the key — including when deletes are mixed in. Version, however, is NOT monotonic — it resets to 0 when it's deleted. So if we used that to try and determine logical versions and whether we're getting things in total order and determine relative order from that — is this version of the key logically newer than another — then a logically older version of the key could easily be seen as the most recent one. If you look at the above output, the Version goes from 0-7, then back to 1. So if we used the Version we'd see logical time going backwards and we'd think an old version of the node/key (let's say version 7) was actually newer than the latest (version 1).

Hopefully this, in combination with my previous comments, helps make it clear why using the Version value gives us worse properties overall IMO and we shouldn't choose to use it.

@rohit-nayak-ps
Copy link
Contributor

Let's also look at an example since you seem to suspect that Version would give us better properties/semantics in determining order / sequence of operations and a key's relative logical version (is this key value newer than another one):

I do understand the differences between Version and ModRevision and that if a key is deleted we restart the versioning.

My only context to ask about Version vs ModRevision is related to the watcher, since that is the code path this PR affects. My understanding in this context is:

If we don't care about the version of the watched object (has it been changed X times) and/or if it is a newly created key then it doesn't matter which one we use.

If it does matter, do we delete the watched key ever? If not, it doesn't matter which one we use.

If so, does it matter to us that a key was deleted? If not, it doesn't matter which one we use.

If so, we do want the Version (or both the CreateRevision and ModRevision) since it will tell us that we have a new key.

So in terms of watching specific keys, I believe the Version gives us more information than ModRevision. Or possibly sending all the version numbers, Version, CreateRevision and ModRevision is most flexible.

Just to round off the discussion on this PR, from my side, I am fine with whatever the others decide we want to do now for the watch versioning since we don't seem to depend on the versions in this context anyway.

We can always change it later if any specific functionality is needed that depends on this decision.

@mattlord mattlord requested a review from deepthi May 7, 2024 17:10
@mattlord
Copy link
Contributor Author

mattlord commented May 7, 2024

@rohit-nayak-ps and @deepthi thank you for the reviews and discussion! I've now thoroughly documented the reason for this PR and the reasoning for the changes made. I've also now added a unit test to confirm and enforce the new expected behavior.

I'm considering this work done unless I get any further review comments. Thanks again!

@deepthi deepthi merged commit 042e73e into vitessio:main May 12, 2024
93 checks passed
@deepthi deepthi deleted the etcd2_node_version branch May 12, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants