Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recover lessor before recovering mvcc store and transactionally revoke leases #6098

Merged
merged 5 commits into from
Aug 5, 2016

Conversation

xiang90
Copy link
Contributor

@xiang90 xiang90 commented Aug 4, 2016

  1. always recover lessor first. When we recover the mvcc.KV it will reattach keys to its leases. If we recover mvcc.KV first, it will attach the keys to the wrong lessor before it recovers.
  2. any touch to mvcc.KV will increase the consistent index. Increasing consistent index will avoid the applier to re-apply the same entry. For the same entry, we must ensure all kv ops in the entry executed in a txn.

@xiang90 xiang90 changed the title Lease Fix Lease Aug 4, 2016
@heyitsanthony
Copy link
Contributor

lgtm following CI fixups

@xiang90
Copy link
Contributor Author

xiang90 commented Aug 4, 2016

@heyitsanthony There is one additional fix I am working on.

The previous logic is wrong. When we have hisotry like Put(foo, bar, lease1),
and Put(foo, bar, lease2), we will end up with attaching foo to two leases 1 and
2. Similar things can happen for deattach by clearing the lease of a key.

Now we try to fix this by starting to attach leases at the end of the recovery.
We use a map to keep the last lease attachment state.
@xiang90
Copy link
Contributor Author

xiang90 commented Aug 4, 2016

@heyitsanthony PTAL. We def need to add tests around leases for recovery, snapshot path.

@@ -219,15 +224,27 @@ func (le *lessor) Revoke(id LeaseID) error {
le.mu.Unlock()

if le.rd != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if le.rd == nil { return nil } to pull back the indent for the interesting path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Aug 4, 2016

lgtm

maybe worth failpointing all the Recover()s for general testing

@xiang90
Copy link
Contributor Author

xiang90 commented Aug 4, 2016

@heyitsanthony We should do it for sure. This is too easy to get wrong. The original design was simple that only lessor knows about KV. After we decided to put leaseID into the key, lessor and KV are coupled together, causing a few issues.

@xiang90
Copy link
Contributor Author

xiang90 commented Aug 5, 2016

@heyitsanthony

Test failed with

--- FAIL: TestKVPutFailGetRetry (18.43s)
    kv_test.go:622: timed out waiting for get
    kv_test.go:609: grpc: the client connection is closing

Do you think it is related to this change?

@heyitsanthony
Copy link
Contributor

@xiang90 that test resets an etcdserver; I wouldn't be surprised if it's affected by recovery path changes

@xiang90
Copy link
Contributor Author

xiang90 commented Aug 5, 2016

@heyitsanthony I will take a closer look.

@xiang90
Copy link
Contributor Author

xiang90 commented Aug 5, 2016

@heyitsanthony I checked the log. etcd restarting is fine. gRPC call blocks for more than 5 seconds.

comparing to a successful case, there is one additional reconnecting happened which should not after the node restarted.

2016-08-05 03:51:32.583520 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node670786626.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node670786626.sock.bridge" <nil>}
2016-08-05 03:51:48.157978 I | v3rpc/grpc: grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
2016-08-05 03:51:48.158019 I | integration: terminating node670786626 (unix://localhost:node670786626.sock.bridge)

@heyitsanthony
Copy link
Contributor

@xiang90 OK, I suspect this is related to the grpc backoff strategy / reconnect logic. Given the failing test took about ~13s before reaching the select timeout, I can see the retries getting out of hand.

@xiang90
Copy link
Contributor Author

xiang90 commented Aug 5, 2016

@heyitsanthony I reran the tests for a few times. No failure occured. So shall we assume it is unrelated and get this merged? I will open a new issue for the failed test.

@heyitsanthony
Copy link
Contributor

@xiang90 yeah, it's fine. The reconnection code has diverged in master since the last grpc vendoring anyway.

@xiang90 xiang90 merged commit 4a7fabd into etcd-io:master Aug 5, 2016
@xiang90 xiang90 deleted the lease branch August 5, 2016 17:08
@gyuho gyuho changed the title Fix Lease recover lessor before recovering mvcc store and transactionally revoke leases Aug 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants