Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to etcd/client/v3 #12

Open
wants to merge 7 commits into
base: flatcar-master
Choose a base branch
from
Open

Conversation

tormath1
Copy link
Contributor

@tormath1 tormath1 commented Aug 13, 2021

in this PR, we migrate from etcd/client/v2 to etcd/client/v3.

Some high level changes:

  • rework a bit the Init method -> we first try to check that a key exist it does not we create it
  • use KV instead of KeyAPI from V2
  • did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic
  • rework a bit the tests - first step in order to provide a decent coverage

Testing done

  • update the unit tests

run kola tests with locksmith:

  • coreos.locksmith.reboot
  • coreos.locksmith.tls
  • cl.locksmith.cluster

How to use

easy way:

make
scp -P 2222 ./bin/locksmithctl [email protected]:/home/core

or using the SDK:

diff --git a/app-admin/locksmith/locksmith-9999.ebuild b/app-admin/locksmith/locksmith-9999.ebuild
index 1fab0c6e4..2fca232f3 100644
--- a/app-admin/locksmith/locksmith-9999.ebuild
+++ b/app-admin/locksmith/locksmith-9999.ebuild
@@ -11,7 +11,7 @@ inherit cros-workon systemd coreos-go
 if [[ "${PV}" == 9999 ]]; then
        KEYWORDS="~amd64 ~arm64"
 else
-       CROS_WORKON_COMMIT="085ff774311dba979a53d049f6a776e156224437" # flatcar-master
+       CROS_WORKON_COMMIT="cc4db15e48c1afc979ac1c73c3ac680cb2369c43" # tormath1/upgrade-etcd
        KEYWORDS="amd64 arm64"
 fi

then

emerge-amd64-usr -av app-admin/locksmith && ./build_image

SSH into the VM - then play a bit with the ./locksmithctl and assert it consumes correctly ETCDCTL_API=3

$ sudo systemctl start etcd-member.service
$ ./locksmithctl status
Available: 1
Max: 1
$ ./locksmithctl set-max 123
Old-Max: 1
Max: 123
$ ETCDCTL_API=3 etcdctl get "coreos.com/updateengine/rebootlock/semaphore"
coreos.com/updateengine/rebootlock/semaphore
{"semaphore":123,"max":123,"holders":[]}
$ ./locksmithctl lock
$ ETCDCTL_API=3 etcdctl get "coreos.com/updateengine/rebootlock/semaphore"
coreos.com/updateengine/rebootlock/semaphore
{"semaphore":122,"max":123,"holders":["0f80dd16217b482eb89ffea2e1412115"]}

Note for reviewers

This one fde90bd was a hard one to solve and I think it results from an implementation mistake - no matter which LOCKSMITHD_ENDPOINT we set as environment variable it will still be added to the default ones - with V2 it seems it was not an issue but with V3, in the test, we set a TLS etcd node on ::2379 since the connection is considered as a success with the default http://localhost:2379 this endpoint will be used as the correct one .. but each call will fail because we send http to a https service.

@tormath1 tormath1 self-assigned this Aug 13, 2021
@tormath1 tormath1 changed the title [wip] upgrade to etcd/client/v3 upgrade to etcd/client/v3 Aug 16, 2021
@tormath1 tormath1 marked this pull request as ready for review August 16, 2021 17:12
@tormath1 tormath1 requested review from a team and invidian August 16, 2021 17:12
lock/etcd.go Outdated Show resolved Hide resolved
lock/etcd.go Outdated Show resolved Hide resolved
Copy link
Member

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we will need to use Txn in one more place.

I'm on the fence with what you did with the tests. I kinda like the former table-based testing. :) Any reason to move away from that?

lock/etcd.go Outdated
Comment on lines 117 to 132
setopts := &client.SetOptions{
PrevIndex: sem.Index,
}

_, err = c.keyapi.Set(context.Background(), c.keypath, string(b), setopts)
_, err = c.keyapi.Put(context.Background(), c.keypath, string(b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be turned into a transaction too, because the old code had a SetOption to apply the changes if the previous index is a value known to us. I think that with v3 it could be something like:

response, err = c.keyapi.Txn(context.Background()).If
	client.Compare(client.Version(c.keypath), "=", sem.Index),
).Then(
	client.OpPut(c.keypath, string(payload)),
).Commit()
if err != nil {
	return err
}
// If it's true, it means that the "then" branch was taken, false - "else" branch was taken.
if !response.Succeeded {
	return fmt.Errorf("failed to set the semaphore - it got updated in the meantime")
}
return nil

Also, maybe the Index field in the Semaphore struct should be renamed to Version?

This change will probably allow us to get rid of the Set method in the KeyAPI interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - using Txn instead of Put will provide more controls on the insertion. The s/Index/Version/ makes sense too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would be nice to have some congestion tests around that to make sure things behave as expected in distributed environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@invidian I don't know what are congestion tests - do you have some example to provide ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, if more than one client initializes etc at the same time. Maybe https://github.com/golang/mock could be useful here to simulate the races?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that we should cover this section - but I also think that etcd provides the right tool: concurrency.Mutex to ensure that we don't have this kind of race conditions.

Package concurrency implements concurrency operations on top of etcd such as distributed locks, barriers, and elections.

When I started to upgrade to v3 I knew that the changes will be quite something so I tried to mitigate the migration impact. That's why I did not used the concurrency.Mutex as mentioned in the PR description:

did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic

But now, I feel we are actually redeveloping this concurrency.Mutex through Txn and racing tests - so it might actually be the correct choice to go ahead with the Mutex. What do you think ? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that we should cover this section - but I also think that etcd provides the right tool: concurrency.Mutex to ensure that we don't have this kind of race conditions.

Package concurrency implements concurrency operations on top of etcd such as distributed locks, barriers, and elections.

When I started to upgrade to v3 I knew that the changes will be quite something so I tried to mitigate the migration impact. That's why I did not used the concurrency.Mutex as mentioned in the PR description:

did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic

But now, I feel we are actually redeveloping this concurrency.Mutex through Txn and racing tests - so it might actually be the correct choice to go ahead with the Mutex. What do you think ? :)

Up to you. :) On one hand it would probably make the code easier to understand and the tests would be easier to write (maybe). On the other hand, it introduces more back-and-forth communication with etcdserver. Not sure if the latter is a big problem, so using the mutex may be a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, just quickly read the documentation - it seems we can't really use the concurrency.Mutex because it does not allow to use custom semaphores. It relies on the key to lock/unlock - so we need to think about how we could authorize many instances ( > 1) to reboot - in other terms, how we could implement the set-max command using concurrency.Mutex. (see also: etcd-io/etcd#12490)

@tormath1
Copy link
Contributor Author

I'm on the fence with what you did with the tests. I kinda like the former table-based testing. :) Any reason to move away from that?

table tests were ok until now because we were only using {err, response} struct but we now have {err, getResponse, setResponse, Txn} which can be quite big to define into a table a1a5db4#diff-509fec5f7f91a11e078c85b72f3645555168d99ae5a8a036f2d0c807b0eb673fR138-R146 😕

Also it gives a better reading of the running tests + can eventually be run in parallel.

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general suggestions. Overall, it seems it would be nice to have a better test suite for this code, testing implied properties of the code, e.g. what happens on write conflicts etc.

I really appreciate your work @tormath1 and I'm sure it's worth an effort.

BTW, Did we consider what will happen when this patch hits users? Some locksmiths will be using v2 storage and some v3, which means more nodes will be able to update at a time than it's configured, but only when 2 different version updates are happening at the same time I guess (v2 -> v3 and v3 -> another using v3). So I guess it should be fine?

github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf
github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f
github.com/godbus/dbus v4.1.0+incompatible // indirect
github.com/godbus/dbus/v5 v5.0.3
github.com/gogo/protobuf v1.3.1 // indirect
github.com/godbus/dbus/v5 v5.0.4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: dbus lib has also been updated here.

lock/etcd.go Outdated

"golang.org/x/net/context"
)

// ErrNotFound is used when a key is not found - which means
// it returns 0 value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// it returns 0 value
// it returns 0 value.

lock/etcd.go Outdated
Comment on lines 74 to 75
// So we first try to get the value, if the value is not found we create the key
// with a default semaphore value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// So we first try to get the value, if the value is not found we create the key
// with a default semaphore value
// So we first try to get the value, if the value is not found we create the key
// with a default semaphore value.

lock/etcd.go Outdated
Comment on lines 99 to 101
// there is no proper way to handle non-existing value for a
// given key
// https://github.com/etcd-io/etcd/issues/6089
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// there is no proper way to handle non-existing value for a
// given key
// https://github.com/etcd-io/etcd/issues/6089
// There is no proper way to handle non-existing value for a
// given key.
// See https://github.com/etcd-io/etcd/issues/6089 for more details.

lock/etcd.go Outdated
if err != nil {
return err
}
if _, err := c.Get(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd flip the error logic here to save on indentation. if err == nil { return nil }, then if !errors.Is(err, ErrNotFound) { return fmt.Errorf("...") }, I think it should be easier to read.

err error
}

func (t *testTxn) If(cs ...client.Cmp) client.Txn {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if defined like this it won't have to be a pointer in a struct and it's zero-value can be used, so you don't have to initialize it in all tests?

Suggested change
func (t *testTxn) If(cs ...client.Cmp) client.Txn {
func (t testTxn) If(cs ...client.Cmp) client.Txn {

lock/etcd.go Outdated
return fmt.Errorf("unable to marshal initial semaphore: %w", err)
}

if _, err := c.keyapi.Txn(context.Background()).If(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is not a top-level context, I think using context.TODO() is more appropriate here.

Suggested change
if _, err := c.keyapi.Txn(context.Background()).If(
if _, err := c.keyapi.Txn(context.Background()).If(

go.mod Outdated
go.uber.org/zap v1.16.0 // indirect
golang.org/x/net v0.0.0-20201031054903-ff519b6c9102
google.golang.org/grpc v1.33.2 // indirect
github.com/stretchr/testify v1.7.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add a testify as a dependency here. Is it really worth it for few lines saved?

lock/etcd.go Outdated
Comment on lines 117 to 132
setopts := &client.SetOptions{
PrevIndex: sem.Index,
}

_, err = c.keyapi.Set(context.Background(), c.keypath, string(b), setopts)
_, err = c.keyapi.Put(context.Background(), c.keypath, string(b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would be nice to have some congestion tests around that to make sure things behave as expected in distributed environment.

putResp: &client.PutResponse{},
txn: &testTxn{},
},
"",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make "" (group) a const named like testGroup, so it's easier to figure out what is it?

Mathieu Tortuyaux added 7 commits August 31, 2021 15:47
Signed-off-by: Mathieu Tortuyaux <[email protected]>
Signed-off-by: Mathieu Tortuyaux <[email protected]>
Signed-off-by: Mathieu Tortuyaux <[email protected]>
we now use the `KV` client to Get and Set the semaphore

Signed-off-by: Mathieu Tortuyaux <[email protected]>
the main difference is that we don't use Transport but TLSConfig
directly.

We also return a `KV` client.

Signed-off-by: Mathieu Tortuyaux <[email protected]>
Using etcd/v3, it seems we don't need to specify the scheme
(`http`/`https`).

If it's specified, it will use the ones provided by default and stick to
it - even if it's HTTP and we have a HTTPS configured etcd.

So it will fail because we send http request to a https service

Signed-off-by: Mathieu Tortuyaux <[email protected]>
@tormath1
Copy link
Contributor Author

  • rebased onto flatcar-master to solve a conflict
  • squashed fixup! commits from first review
  • added a test to mock transaction result
  • removed the Put to use Txn instead
  • removed the module testify depenency (still in go.sum, it seems to be used by etcd)

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I've tried reviewing this PR couple of times and I still get stuck on reviewing the test suite, as I'm afraid modifying both test suite logic and the code may introduce some regressions.

I wonder if we could modify the test suite (and possibly code) in a way that upgrading to new etcd client version won't affect those tests. However, this requires capturing the essence of existing behavior and writing tests for it, which is not always easy.

google.golang.org/grpc => google.golang.org/grpc v1.29.1
)
// Most recent etcd version is not compatible with grpc v1.31.x.
replace google.golang.org/grpc => google.golang.org/grpc v1.29.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

if !response.Succeeded {
return fmt.Errorf("failed to set the semaphore - it got updated in the meantime")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just:

Suggested change
return fmt.Errorf("failed to set the semaphore - it got updated in the meantime")
return fmt.Errorf("semaphore got updated in the meantime")

if err != nil {
return err
return fmt.Errorf("unable to marshal initial semaphore: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("unable to marshal initial semaphore: %w", err)
return fmt.Errorf("marshaling initial semaphore: %w", err)

client.OpPut(c.keypath, string(payload)),
).
Commit(); err != nil {
return fmt.Errorf("unable to commit initial transaction: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("unable to commit initial transaction: %w", err)
return fmt.Errorf("committing initial transaction: %w", err)

Comment on lines +21 to 24
pb "go.etcd.io/etcd/api/v3/mvccpb"
client "go.etcd.io/etcd/client/v3"

"golang.org/x/net/context"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pb "go.etcd.io/etcd/api/v3/mvccpb"
client "go.etcd.io/etcd/client/v3"
"golang.org/x/net/context"
pb "go.etcd.io/etcd/api/v3/mvccpb"
client "go.etcd.io/etcd/client/v3"
"golang.org/x/net/context"

getResp: &client.GetResponse{
Count: 1,
Kvs: []*pb.KeyValue{
&pb.KeyValue{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&pb.KeyValue{
{

getResp: &client.GetResponse{
Count: 1,
Kvs: []*pb.KeyValue{
&pb.KeyValue{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&pb.KeyValue{
{

})
t.Run("Error", func(t *testing.T) {
_, err := NewEtcdLockClient(&testEtcdClient{
txn: testTxn{err: errors.New("connection refused")},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set the error here, you should be able to make use of errors.Is() perhaps?

Comment on lines +94 to +95
if err.Error() != "unable to init etcd lock client: unable to commit initial transaction: connection refused" {
t.Fatalf("error should be 'unable to init etcd lock client: unable to commit initial transaction: connection refused', got: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string error could be put in variable at least, to avoid duplication, if we stay on string comparison for errors.

testGroup,
)
if err != nil {
t.Fatalf("error should be nil, got: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to errors, it's good to explain what went wrong when you fail the test. For example:

Suggested change
t.Fatalf("error should be nil, got: %v", err)
t.Fatalf("Failed creating new etcd lock client: %v", err)

@pothos
Copy link
Member

pothos commented Apr 22, 2024

I think we should leverage airlock as etcd3 client through the fleetlock protocol. Then we don't need to implement this again in locksmith.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants