-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade to etcd/client/v3 #12
base: flatcar-master
Are you sure you want to change the base?
Conversation
47c3147
to
a1a5db4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we will need to use Txn
in one more place.
I'm on the fence with what you did with the tests. I kinda like the former table-based testing. :) Any reason to move away from that?
lock/etcd.go
Outdated
setopts := &client.SetOptions{ | ||
PrevIndex: sem.Index, | ||
} | ||
|
||
_, err = c.keyapi.Set(context.Background(), c.keypath, string(b), setopts) | ||
_, err = c.keyapi.Put(context.Background(), c.keypath, string(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be turned into a transaction too, because the old code had a SetOption to apply the changes if the previous index is a value known to us. I think that with v3 it could be something like:
response, err = c.keyapi.Txn(context.Background()).If
client.Compare(client.Version(c.keypath), "=", sem.Index),
).Then(
client.OpPut(c.keypath, string(payload)),
).Commit()
if err != nil {
return err
}
// If it's true, it means that the "then" branch was taken, false - "else" branch was taken.
if !response.Succeeded {
return fmt.Errorf("failed to set the semaphore - it got updated in the meantime")
}
return nil
Also, maybe the Index
field in the Semaphore
struct should be renamed to Version
?
This change will probably allow us to get rid of the Set
method in the KeyAPI
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - using Txn
instead of Put
will provide more controls on the insertion. The s/Index/Version/
makes sense too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it would be nice to have some congestion tests around that to make sure things behave as expected in distributed environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@invidian I don't know what are congestion tests - do you have some example to provide ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, if more than one client initializes etc at the same time. Maybe https://github.com/golang/mock could be useful here to simulate the races?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that we should cover this section - but I also think that etcd
provides the right tool: concurrency.Mutex
to ensure that we don't have this kind of race conditions.
Package concurrency implements concurrency operations on top of etcd such as distributed locks, barriers, and elections.
When I started to upgrade to v3
I knew that the changes will be quite something so I tried to mitigate the migration impact. That's why I did not used the concurrency.Mutex
as mentioned in the PR description:
did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic
But now, I feel we are actually redeveloping this concurrency.Mutex
through Txn
and racing tests - so it might actually be the correct choice to go ahead with the Mutex
. What do you think ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that we should cover this section - but I also think that
etcd
provides the right tool:concurrency.Mutex
to ensure that we don't have this kind of race conditions.Package concurrency implements concurrency operations on top of etcd such as distributed locks, barriers, and elections.
When I started to upgrade to
v3
I knew that the changes will be quite something so I tried to mitigate the migration impact. That's why I did not used theconcurrency.Mutex
as mentioned in the PR description:did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic
But now, I feel we are actually redeveloping this
concurrency.Mutex
throughTxn
and racing tests - so it might actually be the correct choice to go ahead with theMutex
. What do you think ? :)
Up to you. :) On one hand it would probably make the code easier to understand and the tests would be easier to write (maybe). On the other hand, it introduces more back-and-forth communication with etcdserver. Not sure if the latter is a big problem, so using the mutex may be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, just quickly read the documentation - it seems we can't really use the concurrency.Mutex
because it does not allow to use custom semaphores. It relies on the key
to lock/unlock - so we need to think about how we could authorize many instances ( > 1) to reboot - in other terms, how we could implement the set-max
command using concurrency.Mutex
. (see also: etcd-io/etcd#12490)
Also it gives a better reading of the running tests + can eventually be run in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general suggestions. Overall, it seems it would be nice to have a better test suite for this code, testing implied properties of the code, e.g. what happens on write conflicts etc.
I really appreciate your work @tormath1 and I'm sure it's worth an effort.
BTW, Did we consider what will happen when this patch hits users? Some locksmiths will be using v2 storage and some v3, which means more nodes will be able to update at a time than it's configured, but only when 2 different version updates are happening at the same time I guess (v2 -> v3 and v3 -> another using v3). So I guess it should be fine?
github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf | ||
github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f | ||
github.com/godbus/dbus v4.1.0+incompatible // indirect | ||
github.com/godbus/dbus/v5 v5.0.3 | ||
github.com/gogo/protobuf v1.3.1 // indirect | ||
github.com/godbus/dbus/v5 v5.0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: dbus lib has also been updated here.
lock/etcd.go
Outdated
|
||
"golang.org/x/net/context" | ||
) | ||
|
||
// ErrNotFound is used when a key is not found - which means | ||
// it returns 0 value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// it returns 0 value | |
// it returns 0 value. |
lock/etcd.go
Outdated
// So we first try to get the value, if the value is not found we create the key | ||
// with a default semaphore value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// So we first try to get the value, if the value is not found we create the key | |
// with a default semaphore value | |
// So we first try to get the value, if the value is not found we create the key | |
// with a default semaphore value. |
lock/etcd.go
Outdated
// there is no proper way to handle non-existing value for a | ||
// given key | ||
// https://github.com/etcd-io/etcd/issues/6089 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// there is no proper way to handle non-existing value for a | |
// given key | |
// https://github.com/etcd-io/etcd/issues/6089 | |
// There is no proper way to handle non-existing value for a | |
// given key. | |
// See https://github.com/etcd-io/etcd/issues/6089 for more details. |
lock/etcd.go
Outdated
if err != nil { | ||
return err | ||
} | ||
if _, err := c.Get(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd flip the error logic here to save on indentation. if err == nil { return nil }
, then if !errors.Is(err, ErrNotFound) { return fmt.Errorf("...") }
, I think it should be easier to read.
lock/etcd_test.go
Outdated
err error | ||
} | ||
|
||
func (t *testTxn) If(cs ...client.Cmp) client.Txn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if defined like this it won't have to be a pointer in a struct and it's zero-value can be used, so you don't have to initialize it in all tests?
func (t *testTxn) If(cs ...client.Cmp) client.Txn { | |
func (t testTxn) If(cs ...client.Cmp) client.Txn { |
lock/etcd.go
Outdated
return fmt.Errorf("unable to marshal initial semaphore: %w", err) | ||
} | ||
|
||
if _, err := c.keyapi.Txn(context.Background()).If( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is not a top-level context, I think using context.TODO()
is more appropriate here.
if _, err := c.keyapi.Txn(context.Background()).If( | |
if _, err := c.keyapi.Txn(context.Background()).If( |
go.mod
Outdated
go.uber.org/zap v1.16.0 // indirect | ||
golang.org/x/net v0.0.0-20201031054903-ff519b6c9102 | ||
google.golang.org/grpc v1.33.2 // indirect | ||
github.com/stretchr/testify v1.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't add a testify as a dependency here. Is it really worth it for few lines saved?
lock/etcd.go
Outdated
setopts := &client.SetOptions{ | ||
PrevIndex: sem.Index, | ||
} | ||
|
||
_, err = c.keyapi.Set(context.Background(), c.keypath, string(b), setopts) | ||
_, err = c.keyapi.Put(context.Background(), c.keypath, string(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it would be nice to have some congestion tests around that to make sure things behave as expected in distributed environment.
lock/etcd_test.go
Outdated
putResp: &client.PutResponse{}, | ||
txn: &testTxn{}, | ||
}, | ||
"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make ""
(group) a const named like testGroup
, so it's easier to figure out what is it?
Signed-off-by: Mathieu Tortuyaux <[email protected]>
Signed-off-by: Mathieu Tortuyaux <[email protected]>
Signed-off-by: Mathieu Tortuyaux <[email protected]>
we now use the `KV` client to Get and Set the semaphore Signed-off-by: Mathieu Tortuyaux <[email protected]>
the main difference is that we don't use Transport but TLSConfig directly. We also return a `KV` client. Signed-off-by: Mathieu Tortuyaux <[email protected]>
Using etcd/v3, it seems we don't need to specify the scheme (`http`/`https`). If it's specified, it will use the ones provided by default and stick to it - even if it's HTTP and we have a HTTPS configured etcd. So it will fail because we send http request to a https service Signed-off-by: Mathieu Tortuyaux <[email protected]>
Signed-off-by: Mathieu Tortuyaux <[email protected]>
10d3cf7
to
a229739
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I've tried reviewing this PR couple of times and I still get stuck on reviewing the test suite, as I'm afraid modifying both test suite logic and the code may introduce some regressions.
I wonder if we could modify the test suite (and possibly code) in a way that upgrading to new etcd client version won't affect those tests. However, this requires capturing the essence of existing behavior and writing tests for it, which is not always easy.
google.golang.org/grpc => google.golang.org/grpc v1.29.1 | ||
) | ||
// Most recent etcd version is not compatible with grpc v1.31.x. | ||
replace google.golang.org/grpc => google.golang.org/grpc v1.29.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this can be dropped. v3.5.0 use grpc v1.38.0: https://github.com/etcd-io/etcd/blob/946a5a6f25c3b6b89408ab447852731bde6e6289/go.mod#L35
} | ||
|
||
if !response.Succeeded { | ||
return fmt.Errorf("failed to set the semaphore - it got updated in the meantime") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just:
return fmt.Errorf("failed to set the semaphore - it got updated in the meantime") | |
return fmt.Errorf("semaphore got updated in the meantime") |
if err != nil { | ||
return err | ||
return fmt.Errorf("unable to marshal initial semaphore: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("unable to marshal initial semaphore: %w", err) | |
return fmt.Errorf("marshaling initial semaphore: %w", err) |
client.OpPut(c.keypath, string(payload)), | ||
). | ||
Commit(); err != nil { | ||
return fmt.Errorf("unable to commit initial transaction: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("unable to commit initial transaction: %w", err) | |
return fmt.Errorf("committing initial transaction: %w", err) |
pb "go.etcd.io/etcd/api/v3/mvccpb" | ||
client "go.etcd.io/etcd/client/v3" | ||
|
||
"golang.org/x/net/context" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pb "go.etcd.io/etcd/api/v3/mvccpb" | |
client "go.etcd.io/etcd/client/v3" | |
"golang.org/x/net/context" | |
pb "go.etcd.io/etcd/api/v3/mvccpb" | |
client "go.etcd.io/etcd/client/v3" | |
"golang.org/x/net/context" |
getResp: &client.GetResponse{ | ||
Count: 1, | ||
Kvs: []*pb.KeyValue{ | ||
&pb.KeyValue{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&pb.KeyValue{ | |
{ |
getResp: &client.GetResponse{ | ||
Count: 1, | ||
Kvs: []*pb.KeyValue{ | ||
&pb.KeyValue{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&pb.KeyValue{ | |
{ |
}) | ||
t.Run("Error", func(t *testing.T) { | ||
_, err := NewEtcdLockClient(&testEtcdClient{ | ||
txn: testTxn{err: errors.New("connection refused")}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set the error here, you should be able to make use of errors.Is()
perhaps?
if err.Error() != "unable to init etcd lock client: unable to commit initial transaction: connection refused" { | ||
t.Fatalf("error should be 'unable to init etcd lock client: unable to commit initial transaction: connection refused', got: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string error could be put in variable at least, to avoid duplication, if we stay on string comparison for errors.
testGroup, | ||
) | ||
if err != nil { | ||
t.Fatalf("error should be nil, got: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to errors, it's good to explain what went wrong when you fail the test. For example:
t.Fatalf("error should be nil, got: %v", err) | |
t.Fatalf("Failed creating new etcd lock client: %v", err) |
I think we should leverage airlock as etcd3 client through the fleetlock protocol. Then we don't need to implement this again in locksmith. |
in this PR, we migrate from
etcd/client/v2
toetcd/client/v3
.Some high level changes:
KV
instead ofKeyAPI
from V2client/v3/concurrency.
in order to not break the currentlocksmithctl
semaphore logicTesting done
run kola tests with locksmith:
coreos.locksmith.reboot
coreos.locksmith.tls
cl.locksmith.cluster
How to use
easy way:
or using the SDK:
then
emerge-amd64-usr -av app-admin/locksmith && ./build_image
SSH into the VM - then play a bit with the
./locksmithctl
and assert it consumes correctlyETCDCTL_API=3
Note for reviewers
This one fde90bd was a hard one to solve and I think it results from an implementation mistake - no matter which
LOCKSMITHD_ENDPOINT
we set as environment variable it will still be added to the default ones - with V2 it seems it was not an issue but with V3, in the test, we set a TLSetcd
node on::2379
since the connection is considered as a success with the defaulthttp://localhost:2379
this endpoint will be used as the correct one .. but each call will fail because we sendhttp
to ahttps
service.