Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: cannot detach lease not found #6093

Closed
nekto0n opened this issue Aug 3, 2016 · 21 comments
Closed

mvcc: cannot detach lease not found #6093

nekto0n opened this issue Aug 3, 2016 · 21 comments

Comments

@nekto0n
Copy link

nekto0n commented Aug 3, 2016

Hi!
Just encountered a very unexpected issue. I'm using latest etcd and had a key which was used as a lock, i.e. lease attached to it. For some reason lease was there even if I stopped all clients for a while. I decided to remove the key using etcdctl del. After running this command whole cluster (5 nodes) went down with error:

mvcc: cannot detach lease not found

I had to patch etcd to ignore this error, after that cluster went back to live.

Not quite sure, how I managed to get there (have stale lease), but I think you may be interested. Maybe there was a bug a while ago which is now fixed, but lurked in my installation and revealed itself only now.

@xiang90
Copy link
Contributor

xiang90 commented Aug 3, 2016

is your deployment upgraded from a pre-release version of etcd3?

@nekto0n
Copy link
Author

nekto0n commented Aug 3, 2016

Not sure, but that's quite possible, yes.

@xiang90
Copy link
Contributor

xiang90 commented Aug 3, 2016

If you have time, probably you can help to reproduce? Set up same lock logic and let lease expire. At the meantime, I will take a look at the lease logic sometime this week.

@nekto0n
Copy link
Author

nekto0n commented Aug 3, 2016

Thanks for such a prompt response!
I am quite sure I have another etcd ensemble with same issue right now.

@xiang90
Copy link
Contributor

xiang90 commented Aug 4, 2016

@nekto0n It would be really helpful if you can somehow reproduce it from a fresh cluster.

@xiang90
Copy link
Contributor

xiang90 commented Aug 4, 2016

@nekto0n I think I found the bug. I will try to get a fix soon. It would be great if you can provide me the full etcd log? Is there a snapshot sending/receiving event happened?

@nekto0n
Copy link
Author

nekto0n commented Aug 4, 2016

@xiang90 I tried to, but I failed to reproduce the bug on fresh/testing installations.

It would be great if you can provide me the full etcd log? Is there a snapshot sending/receiving event happened?

You'd like to look at logs during etcdctl del call? Seems like I miss them =( I only have logs from already broken runs: http://pastebin.com/PYzy0pXn

@nekto0n
Copy link
Author

nekto0n commented Aug 4, 2016

But as I said, I'm pretty sure I have at least another one cluster, which has similar state and I can issue etcdctl del and grab all info you need :)

@xiang90
Copy link
Contributor

xiang90 commented Aug 4, 2016

@nekto0n In the pre-release, there was a bug that if one lease attached with multiple keys, only the first key will be removed. In your case, have you attached multiple keys to one lease?

@nekto0n
Copy link
Author

nekto0n commented Aug 4, 2016

@xiang90 No, I don't think so.

@xiang90
Copy link
Contributor

xiang90 commented Aug 5, 2016

@nekto0n

We made a couple of fixes here: #6098.

Most of them are on recovery path. Previously, etcd might mess up lease items during crash recovery.

If you can throw some workload onto the patched etcd, it would be great...

@nekto0n
Copy link
Author

nekto0n commented Aug 5, 2016

Sure thing! I'll give it a spin during weekend.

@xiang90
Copy link
Contributor

xiang90 commented Aug 6, 2016

No, I don't think so.

Did you implement your own lock? Or you were using our lock implementation? If you were using ours, we actually attach all lock keys onto one lease.

@nekto0n
Copy link
Author

nekto0n commented Aug 8, 2016

I used your Session implementation. Couldn't use Lock because I found no way to stop worker, when lease expires. Either way, at the moment I use only one lock per process.

@xiang90
Copy link
Contributor

xiang90 commented Aug 8, 2016

Can you please create an issue for the lock thing? We can improve it. I cannot really figure out where was the bug... We changed quite a lot stuff since 2.3.x to 3.0. I think we can close this for now. Let us know if it happens again. It would be great if you can reproduce.

@xiang90 xiang90 closed this as completed Aug 8, 2016
@nekto0n
Copy link
Author

nekto0n commented Aug 8, 2016

Yeah, sure. Just thought it tried to look like sync.Mutex, which cannot be unlocked behind your back :)

@nekto0n
Copy link
Author

nekto0n commented Aug 8, 2016

BTW, I built and installed etcd from master and still something is holding lock file. Can I now remove it with etcdctl without shutting down whole ensemble?

@heyitsanthony
Copy link
Contributor

@nekto0n what do you mean by lock file?

If etcd is refusing to start, then an etcd process is still running on the target member directory and needs to be stopped before running the new version of etcd. It can be done one-by-one so the cluster stays up.

If the mutex is still held, you could try revoking the lease on the mutex key with the lowest create revision which will abort the holder's session.

@nekto0n
Copy link
Author

nekto0n commented Aug 8, 2016

Sorry for being vague. By lock file I meant "file with attached lease". I confirm, that issue is fixed, after removing file with stale lease I got an error E | mvcc: cannot detach lease not found, not panic.

@xiang90
Copy link
Contributor

xiang90 commented Aug 8, 2016

@nekto0n If you start with a fresh cluster, even that error should not ever appear. Or there is still a bug some where.

@nekto0n
Copy link
Author

nekto0n commented Aug 8, 2016

@xiang90 Right, I failed to reproduce this with a fresh cluster. Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants