-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recover lessor before recovering mvcc store and transactionally revoke leases #6098
Conversation
xiang90
commented
Aug 4, 2016
- always recover lessor first. When we recover the mvcc.KV it will reattach keys to its leases. If we recover mvcc.KV first, it will attach the keys to the wrong lessor before it recovers.
- any touch to mvcc.KV will increase the consistent index. Increasing consistent index will avoid the applier to re-apply the same entry. For the same entry, we must ensure all kv ops in the entry executed in a txn.
lgtm following CI fixups |
@heyitsanthony There is one additional fix I am working on. |
The previous logic is wrong. When we have hisotry like Put(foo, bar, lease1), and Put(foo, bar, lease2), we will end up with attaching foo to two leases 1 and 2. Similar things can happen for deattach by clearing the lease of a key. Now we try to fix this by starting to attach leases at the end of the recovery. We use a map to keep the last lease attachment state.
@heyitsanthony PTAL. We def need to add tests around leases for recovery, snapshot path. |
@@ -219,15 +224,27 @@ func (le *lessor) Revoke(id LeaseID) error { | |||
le.mu.Unlock() | |||
|
|||
if le.rd != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if le.rd == nil { return nil }
to pull back the indent for the interesting path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
lgtm maybe worth failpointing all the Recover()s for general testing |
@heyitsanthony We should do it for sure. This is too easy to get wrong. The original design was simple that only lessor knows about KV. After we decided to put leaseID into the key, lessor and KV are coupled together, causing a few issues. |
Test failed with
Do you think it is related to this change? |
@xiang90 that test resets an etcdserver; I wouldn't be surprised if it's affected by recovery path changes |
@heyitsanthony I will take a closer look. |
@heyitsanthony I checked the log. etcd restarting is fine. gRPC call blocks for more than 5 seconds. comparing to a successful case, there is one additional reconnecting happened which should not after the node restarted.
|
@xiang90 OK, I suspect this is related to the grpc backoff strategy / reconnect logic. Given the failing test took about ~13s before reaching the |
@heyitsanthony I reran the tests for a few times. No failure occured. So shall we assume it is unrelated and get this merged? I will open a new issue for the failed test. |
@xiang90 yeah, it's fine. The reconnection code has diverged in master since the last grpc vendoring anyway. |