Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix unintended deadlock on key prefixes #6284

Merged
merged 1 commit into from
Aug 30, 2016
Merged

Conversation

glycerine
Copy link
Contributor

After winning a campaign, the call to waitDeletes() was missing a
slash after the key prefix. The result was deadlock due to
waiting on wrong keys; if a prior key exists that has the new key
as a prefix.

Fixes #6278

@glycerine
Copy link
Contributor Author

@xiang90, @heyitsanthony
Could you take a look.

The CI tests seem to fail for "package not found" reasons I don't understand, and seem unrelated to the patch.

@xiang90
Copy link
Contributor

xiang90 commented Aug 28, 2016

@glycerine Sorry for the inconvenience. The current master is broken due to a testing issue. We first need to get that first tomorrow.

@glycerine
Copy link
Contributor Author

@xiang90 -- no worries. As for the patch, I was wondering -- would it be better to push the change into the function waitDeletes(), so that any other future caller wouldn't make the same mistake? I didn't see any other uses of waitDeletes(), but I also can't tell if there are other expectations/legacy users I'm unaware of.

@xiang90
Copy link
Contributor

xiang90 commented Aug 28, 2016

@glycerine

I think we can actually add / or any other hard to conflict char to the end of the keyPrefix. So users will not accidentally pollute the key space used for election or locking. Right now, not only waitDeletes watches on raw prefix but also some other functions (https://github.com/coreos/etcd/blob/master/clientv3/concurrency/election.go#L147-L158) too.

I even think we probably should add prefixes like /lock/xy/. It can avoid conflicts and we can support operations like list locks, elections in the future. But that can break existing users :(.

It would be helpful if you can also write a test case for this in https://github.com/coreos/etcd/blob/master/integration/v3_election_test.go.

With all that being side, @heyitsanthony knows this part better than me and might provide better opinions.

@heyitsanthony
Copy link
Contributor

@glycerine thanks for the patch. I agree with @xiang90 that it'd be cleaner to add the "/" to the prefix in NewMutex/NewElection constructors up front and replacing the Sprintf("%s/%s"...) with Sprintf("%s%s").

Also please fix the commit title or CI won't accept it (the error is Expected commit title format '<package>{", "<package>}: <description>').

@xiang90 I feel like prefixing with /lock/xy breaks etcd3's clean keyspace guarantee since it reserves /lock/ from the start as having some additional meaning.

@glycerine
Copy link
Contributor Author

Done:

  1. title in commit changed to include package clientv3/concurrency.
  2. test added to integration/v3_election_test.go.
  3. slash added to prefix during NewElection and NewMutex construction, per suggestion.

@glycerine glycerine force-pushed the fix6278 branch 7 times, most recently from cc4e96c to be0cc34 Compare August 29, 2016 09:37
@glycerine
Copy link
Contributor Author

TravisCI passes. Semaphoreci points out that TestTxnReadRetry and TestWatchReconnRequest are flakey tests, independent of this change.

@xiang90, @heyitsanthony: ready to merge?

// Also a good basic test of single candidate election
// and simultaneous correct observation of an election.
//
func TestElectionOnPrefixOfExistingKey(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is more complicated than it needs to be. I think this should trigger the old, broken behavior with a lot less code:

clus := NewClusterV3(t, &ClusterConfig{Size : 1})
defer clus.Terminate(t)

cli := clus.RandClient()
if _, err := cli.Put(context.TODO(), "testa"); err != nil {
    t.Fatal(err)
}
s, serr :=  concurrency.NewSession(cli)
if serr != nil {
    return serr
}
e := concurrency.NewElection(s, "test")
ctx, cancel := context.WithTimeout(context.TODO(), 5*time.Second)
err := e.Campaign(ctx, "abc")
cancel()
if err != nil {
    t.Fatal(err)
}

@xiang90
Copy link
Contributor

xiang90 commented Aug 29, 2016

@glycerine Can you rebase with the master? We just fixed the CI issue on master. Sorry about the inconvenience.

@xiang90 xiang90 added this to the v3.1.0 milestone Aug 29, 2016
@glycerine
Copy link
Contributor Author

@heyitsanthony Cool. This is really a pretty small change and obviously you know the code much better than me. Why do you go ahead and merge your test version and fix per your preference.

@heyitsanthony
Copy link
Contributor

@glycerine are you abandoning this PR or what? Just update the test and I can merge this...?

After winning an election or obtaining a lock, we
auto-append a slash after the provided key prefix.
This avoids the previous deadlock due to waiting
on the wrong key.

Fixes etcd-io#6278
@glycerine
Copy link
Contributor Author

@heyitsanthony :Okay, rebased and updated test.

@heyitsanthony
Copy link
Contributor

lgtm

@xiang90
Copy link
Contributor

xiang90 commented Aug 30, 2016

@glycerine Thanks for the contribution!

@xiang90 xiang90 merged commit 547bf1a into etcd-io:master Aug 30, 2016
@glycerine
Copy link
Contributor Author

cool -- @heyitsanthony one quick question for future reference so I understand how these tests and context interact -- why is cancel() called after the Campaign() call? -- the Campaign() has already succeeded or timed-out by that point, so why is it needed?

e := concurrency.NewElection(s, "test")
ctx, cancel := context.WithTimeout(context.TODO(), 5*time.Second)
err := e.Campaign(ctx, "abc") // pauses for 5 seconds without the bug fix
cancel() // what does this do?
if err != nil {

@xiang90
Copy link
Contributor

xiang90 commented Aug 30, 2016

@glycerine That is the intended behavior for releasing the resource associated with the context. See the doc in context pkg: https://godoc.org/golang.org/x/net/context#WithCancel

@heyitsanthony heyitsanthony mentioned this pull request Aug 31, 2016
@glycerine glycerine deleted the fix6278 branch September 8, 2016 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

etcdctl elect results in deadlock: won election never reported to winner, if key is a prefix of existing key
3 participants