upgrade to etcd/client/v3 #12

tormath1 · 2021-08-13T15:31:20Z

in this PR, we migrate from etcd/client/v2 to etcd/client/v3.

Some high level changes:

rework a bit the Init method -> we first try to check that a key exist it does not we create it
use KV instead of KeyAPI from V2
did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic
rework a bit the tests - first step in order to provide a decent coverage

Testing done

update the unit tests

run kola tests with locksmith:

coreos.locksmith.reboot
coreos.locksmith.tls
cl.locksmith.cluster

How to use

easy way:

make
scp -P 2222 ./bin/locksmithctl [email protected]:/home/core

or using the SDK:

diff --git a/app-admin/locksmith/locksmith-9999.ebuild b/app-admin/locksmith/locksmith-9999.ebuild
index 1fab0c6e4..2fca232f3 100644
--- a/app-admin/locksmith/locksmith-9999.ebuild
+++ b/app-admin/locksmith/locksmith-9999.ebuild
@@ -11,7 +11,7 @@ inherit cros-workon systemd coreos-go
 if [[ "${PV}" == 9999 ]]; then
        KEYWORDS="~amd64 ~arm64"
 else
-       CROS_WORKON_COMMIT="085ff774311dba979a53d049f6a776e156224437" # flatcar-master
+       CROS_WORKON_COMMIT="cc4db15e48c1afc979ac1c73c3ac680cb2369c43" # tormath1/upgrade-etcd
        KEYWORDS="amd64 arm64"
 fi

then

emerge-amd64-usr -av app-admin/locksmith && ./build_image

SSH into the VM - then play a bit with the ./locksmithctl and assert it consumes correctly ETCDCTL_API=3

$ sudo systemctl start etcd-member.service
$ ./locksmithctl status
Available: 1
Max: 1
$ ./locksmithctl set-max 123
Old-Max: 1
Max: 123
$ ETCDCTL_API=3 etcdctl get "coreos.com/updateengine/rebootlock/semaphore"
coreos.com/updateengine/rebootlock/semaphore
{"semaphore":123,"max":123,"holders":[]}
$ ./locksmithctl lock
$ ETCDCTL_API=3 etcdctl get "coreos.com/updateengine/rebootlock/semaphore"
coreos.com/updateengine/rebootlock/semaphore
{"semaphore":122,"max":123,"holders":["0f80dd16217b482eb89ffea2e1412115"]}

Note for reviewers

This one fde90bd was a hard one to solve and I think it results from an implementation mistake - no matter which LOCKSMITHD_ENDPOINT we set as environment variable it will still be added to the default ones - with V2 it seems it was not an issue but with V3, in the test, we set a TLS etcd node on ::2379 since the connection is considered as a success with the default http://localhost:2379 this endpoint will be used as the correct one .. but each call will fail because we send http to a https service.

lock/etcd.go

krnowak

Looks like we will need to use Txn in one more place.

I'm on the fence with what you did with the tests. I kinda like the former table-based testing. :) Any reason to move away from that?

krnowak · 2021-08-18T11:33:34Z

lock/etcd.go

-	setopts := &client.SetOptions{
-		PrevIndex: sem.Index,
-	}
-
-	_, err = c.keyapi.Set(context.Background(), c.keypath, string(b), setopts)
+	_, err = c.keyapi.Put(context.Background(), c.keypath, string(b))


I think this needs to be turned into a transaction too, because the old code had a SetOption to apply the changes if the previous index is a value known to us. I think that with v3 it could be something like:

response, err = c.keyapi.Txn(context.Background()).If client.Compare(client.Version(c.keypath), "=", sem.Index), ).Then( client.OpPut(c.keypath, string(payload)), ).Commit() if err != nil { return err } // If it's true, it means that the "then" branch was taken, false - "else" branch was taken. if !response.Succeeded { return fmt.Errorf("failed to set the semaphore - it got updated in the meantime") } return nil

Also, maybe the Index field in the Semaphore struct should be renamed to Version?

This change will probably allow us to get rid of the Set method in the KeyAPI interface.

I agree - using Txn instead of Put will provide more controls on the insertion. The s/Index/Version/ makes sense too.

It seems it would be nice to have some congestion tests around that to make sure things behave as expected in distributed environment.

@invidian I don't know what are congestion tests - do you have some example to provide ?

I meant, if more than one client initializes etc at the same time. Maybe https://github.com/golang/mock could be useful here to simulate the races?

I do agree that we should cover this section - but I also think that etcd provides the right tool: concurrency.Mutex to ensure that we don't have this kind of race conditions.

Package concurrency implements concurrency operations on top of etcd such as distributed locks, barriers, and elections.

When I started to upgrade to v3 I knew that the changes will be quite something so I tried to mitigate the migration impact. That's why I did not used the concurrency.Mutex as mentioned in the PR description:

did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic

But now, I feel we are actually redeveloping this concurrency.Mutex through Txn and racing tests - so it might actually be the correct choice to go ahead with the Mutex. What do you think ? :)

I do agree that we should cover this section - but I also think that etcd provides the right tool: concurrency.Mutex to ensure that we don't have this kind of race conditions.

Package concurrency implements concurrency operations on top of etcd such as distributed locks, barriers, and elections.

When I started to upgrade to v3 I knew that the changes will be quite something so I tried to mitigate the migration impact. That's why I did not used the concurrency.Mutex as mentioned in the PR description:

did not use the client/v3/concurrency. in order to not break the current locksmithctl semaphore logic

But now, I feel we are actually redeveloping this concurrency.Mutex through Txn and racing tests - so it might actually be the correct choice to go ahead with the Mutex. What do you think ? :)

Up to you. :) On one hand it would probably make the code easier to understand and the tests would be easier to write (maybe). On the other hand, it introduces more back-and-forth communication with etcdserver. Not sure if the latter is a big problem, so using the mutex may be a good idea.

Mm, just quickly read the documentation - it seems we can't really use the concurrency.Mutex because it does not allow to use custom semaphores. It relies on the key to lock/unlock - so we need to think about how we could authorize many instances ( > 1) to reboot - in other terms, how we could implement the set-max command using concurrency.Mutex. (see also: etcd-io/etcd#12490)

tormath1 · 2021-08-18T12:38:17Z

I'm on the fence with what you did with the tests. I kinda like the former table-based testing. :) Any reason to move away from that?

table tests were ok until now because we were only using {err, response} struct but we now have {err, getResponse, setResponse, Txn} which can be quite big to define into a table a1a5db4#diff-509fec5f7f91a11e078c85b72f3645555168d99ae5a8a036f2d0c807b0eb673fR138-R146 😕

Also it gives a better reading of the running tests + can eventually be run in parallel.

invidian

Some general suggestions. Overall, it seems it would be nice to have a better test suite for this code, testing implied properties of the code, e.g. what happens on write conflicts etc.

I really appreciate your work @tormath1 and I'm sure it's worth an effort.

BTW, Did we consider what will happen when this patch hits users? Some locksmiths will be using v2 storage and some v3, which means more nodes will be able to update at a time than it's configured, but only when 2 different version updates are happening at the same time I guess (v2 -> v3 and v3 -> another using v3). So I guess it should be fine?

invidian · 2021-08-18T12:17:31Z

go.mod

 	github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf
 	github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f
 	github.com/godbus/dbus v4.1.0+incompatible // indirect
-	github.com/godbus/dbus/v5 v5.0.3
-	github.com/gogo/protobuf v1.3.1 // indirect
+	github.com/godbus/dbus/v5 v5.0.4


Note: dbus lib has also been updated here.

invidian · 2021-08-18T12:26:05Z

lock/etcd.go


 	"golang.org/x/net/context"
 )

+// ErrNotFound is used when a key is not found - which means
+// it returns 0 value


Suggested change

// it returns 0 value

// it returns 0 value.

invidian · 2021-08-18T12:26:18Z

lock/etcd.go

+// So we first try to get the value, if the value is not found we create the key
+// with a default semaphore value


Suggested change

// So we first try to get the value, if the value is not found we create the key

// with a default semaphore value

// So we first try to get the value, if the value is not found we create the key

// with a default semaphore value.

invidian · 2021-08-18T12:26:59Z

lock/etcd.go

+	// there is no proper way to handle non-existing value for a
+	// given key
+	// https://github.com/etcd-io/etcd/issues/6089


Suggested change

// there is no proper way to handle non-existing value for a

// given key

// https://github.com/etcd-io/etcd/issues/6089

// There is no proper way to handle non-existing value for a

// given key.

// See https://github.com/etcd-io/etcd/issues/6089 for more details.

invidian · 2021-08-18T12:29:45Z

lock/etcd.go

-	if err != nil {
-		return err
-	}
+	if _, err := c.Get(); err != nil {


I'd flip the error logic here to save on indentation. if err == nil { return nil }, then if !errors.Is(err, ErrNotFound) { return fmt.Errorf("...") }, I think it should be easier to read.

invidian · 2021-08-18T12:35:41Z

lock/etcd_test.go

+	err   error
+}
+
+func (t *testTxn) If(cs ...client.Cmp) client.Txn {


Maybe if defined like this it won't have to be a pointer in a struct and it's zero-value can be used, so you don't have to initialize it in all tests?

Suggested change

func (t *testTxn) If(cs ...client.Cmp) client.Txn {

func (t testTxn) If(cs ...client.Cmp) client.Txn {

invidian · 2021-08-18T12:37:41Z

lock/etcd.go

+		return fmt.Errorf("unable to marshal initial semaphore: %w", err)
+	}
+
+	if _, err := c.keyapi.Txn(context.Background()).If(


Given that this is not a top-level context, I think using context.TODO() is more appropriate here.

Suggested change

if _, err := c.keyapi.Txn(context.Background()).If(

if _, err := c.keyapi.Txn(context.Background()).If(

invidian · 2021-08-18T12:46:37Z

go.mod

-	go.uber.org/zap v1.16.0 // indirect
-	golang.org/x/net v0.0.0-20201031054903-ff519b6c9102
-	google.golang.org/grpc v1.33.2 // indirect
+	github.com/stretchr/testify v1.7.0


I wouldn't add a testify as a dependency here. Is it really worth it for few lines saved?

invidian · 2021-08-18T12:47:29Z

lock/etcd.go

-	setopts := &client.SetOptions{
-		PrevIndex: sem.Index,
-	}
-
-	_, err = c.keyapi.Set(context.Background(), c.keypath, string(b), setopts)
+	_, err = c.keyapi.Put(context.Background(), c.keypath, string(b))


It seems it would be nice to have some congestion tests around that to make sure things behave as expected in distributed environment.

invidian · 2021-08-18T12:48:12Z

lock/etcd_test.go

+			putResp: &client.PutResponse{},
+			txn:     &testTxn{},
+		},
+			"",


Can we make "" (group) a const named like testGroup, so it's easier to figure out what is it?

Signed-off-by: Mathieu Tortuyaux <[email protected]>

we now use the `KV` client to Get and Set the semaphore Signed-off-by: Mathieu Tortuyaux <[email protected]>

the main difference is that we don't use Transport but TLSConfig directly. We also return a `KV` client. Signed-off-by: Mathieu Tortuyaux <[email protected]>

Using etcd/v3, it seems we don't need to specify the scheme (`http`/`https`). If it's specified, it will use the ones provided by default and stick to it - even if it's HTTP and we have a HTTPS configured etcd. So it will fail because we send http request to a https service Signed-off-by: Mathieu Tortuyaux <[email protected]>

Signed-off-by: Mathieu Tortuyaux <[email protected]>

tormath1 · 2021-08-31T14:01:34Z

rebased onto flatcar-master to solve a conflict
squashed fixup! commits from first review
added a test to mock transaction result
removed the Put to use Txn instead
removed the module testify depenency (still in go.sum, it seems to be used by etcd)

invidian

Hmm, I've tried reviewing this PR couple of times and I still get stuck on reviewing the test suite, as I'm afraid modifying both test suite logic and the code may introduce some regressions.

I wonder if we could modify the test suite (and possibly code) in a way that upgrading to new etcd client version won't affect those tests. However, this requires capturing the essence of existing behavior and writing tests for it, which is not always easy.

invidian · 2021-09-02T10:32:44Z

go.mod

-	google.golang.org/grpc => google.golang.org/grpc v1.29.1
-)
+// Most recent etcd version is not compatible with grpc v1.31.x.
+replace google.golang.org/grpc => google.golang.org/grpc v1.29.1


It seems this can be dropped. v3.5.0 use grpc v1.38.0: https://github.com/etcd-io/etcd/blob/946a5a6f25c3b6b89408ab447852731bde6e6289/go.mod#L35

invidian · 2021-09-02T10:34:11Z

lock/etcd.go

+	}
+
+	if !response.Succeeded {
+		return fmt.Errorf("failed to set the semaphore - it got updated in the meantime")


Maybe just:

Suggested change

return fmt.Errorf("failed to set the semaphore - it got updated in the meantime")

return fmt.Errorf("semaphore got updated in the meantime")

invidian · 2021-09-02T10:34:33Z

lock/etcd.go

 	if err != nil {
-		return err
+		return fmt.Errorf("unable to marshal initial semaphore: %w", err)


Suggested change

return fmt.Errorf("unable to marshal initial semaphore: %w", err)

return fmt.Errorf("marshaling initial semaphore: %w", err)

invidian · 2021-09-02T11:54:19Z

lock/etcd.go

+			client.OpPut(c.keypath, string(payload)),
+		).
+		Commit(); err != nil {
+		return fmt.Errorf("unable to commit initial transaction: %w", err)


Suggested change

return fmt.Errorf("unable to commit initial transaction: %w", err)

return fmt.Errorf("committing initial transaction: %w", err)

invidian · 2021-09-02T12:01:39Z

lock/etcd_test.go

+	pb "go.etcd.io/etcd/api/v3/mvccpb"
+	client "go.etcd.io/etcd/client/v3"
+
 	"golang.org/x/net/context"


Suggested change

pb "go.etcd.io/etcd/api/v3/mvccpb"

client "go.etcd.io/etcd/client/v3"

"golang.org/x/net/context"

pb "go.etcd.io/etcd/api/v3/mvccpb"

client "go.etcd.io/etcd/client/v3"

"golang.org/x/net/context"

invidian · 2021-09-06T06:41:10Z

lock/etcd_test.go

+			getResp: &client.GetResponse{
+				Count: 1,
+				Kvs: []*pb.KeyValue{
+					&pb.KeyValue{


Suggested change

&pb.KeyValue{

{

invidian · 2021-09-06T06:41:24Z

lock/etcd_test.go

+			getResp: &client.GetResponse{
+				Count: 1,
+				Kvs: []*pb.KeyValue{
+					&pb.KeyValue{


Suggested change

&pb.KeyValue{

{

invidian · 2021-09-06T15:17:09Z

lock/etcd_test.go

+	})
+	t.Run("Error", func(t *testing.T) {
+		_, err := NewEtcdLockClient(&testEtcdClient{
+			txn:     testTxn{err: errors.New("connection refused")},


If you set the error here, you should be able to make use of errors.Is() perhaps?

invidian · 2021-09-06T15:17:26Z

lock/etcd_test.go

+		if err.Error() != "unable to init etcd lock client: unable to commit initial transaction: connection refused" {
+			t.Fatalf("error should be 'unable to init etcd lock client: unable to commit initial transaction: connection refused', got: %v", err)


The string error could be put in variable at least, to avoid duplication, if we stay on string comparison for errors.

invidian · 2021-09-06T15:19:48Z

lock/etcd_test.go

+			testGroup,
+		)
+		if err != nil {
+			t.Fatalf("error should be nil, got: %v", err)


Similar to errors, it's good to explain what went wrong when you fail the test. For example:

Suggested change

t.Fatalf("error should be nil, got: %v", err)

t.Fatalf("Failed creating new etcd lock client: %v", err)

pothos · 2024-04-22T07:04:27Z

I think we should leverage airlock as etcd3 client through the fleetlock protocol. Then we don't need to implement this again in locksmith.

tormath1 self-assigned this Aug 13, 2021

This was referenced Aug 13, 2021

locksmithctl/locksmithcl: fix endpoints resilience #11

Merged

app-admin/locksmith: bump commit ID flatcar-archive/coreos-overlay#1161

Merged

tormath1 force-pushed the tormath1/upgrade-etcd branch from 47c3147 to a1a5db4 Compare August 16, 2021 17:06

tormath1 changed the title ~~[wip] upgrade to etcd/client/v3~~ upgrade to etcd/client/v3 Aug 16, 2021

tormath1 marked this pull request as ready for review August 16, 2021 17:12

tormath1 requested review from a team and invidian August 16, 2021 17:12

krnowak reviewed Aug 16, 2021

View reviewed changes

lock/etcd.go Outdated Show resolved Hide resolved

lock/etcd.go Outdated Show resolved Hide resolved

krnowak requested changes Aug 18, 2021

View reviewed changes

invidian reviewed Aug 18, 2021

View reviewed changes

Mathieu Tortuyaux added 7 commits August 31, 2021 15:47

vendor: go mod vendor

eeff4cd

Signed-off-by: Mathieu Tortuyaux <[email protected]>

sum: go mod tidy

84b96f5

Signed-off-by: Mathieu Tortuyaux <[email protected]>

mod: upgrade to etcd/client/v3

0bec1d8

Signed-off-by: Mathieu Tortuyaux <[email protected]>

lock/etcd: port to etcd/v3

7df0fdc

we now use the `KV` client to Get and Set the semaphore Signed-off-by: Mathieu Tortuyaux <[email protected]>

locksmithcl: configure client

cbcacee

the main difference is that we don't use Transport but TLSConfig directly. We also return a `KV` client. Signed-off-by: Mathieu Tortuyaux <[email protected]>

lock/etcd_test: rewrite tests using etcd/v3

a229739

Signed-off-by: Mathieu Tortuyaux <[email protected]>

tormath1 force-pushed the tormath1/upgrade-etcd branch from 10d3cf7 to a229739 Compare August 31, 2021 13:58

invidian reviewed Sep 6, 2021

View reviewed changes

bmbeverst mentioned this pull request Mar 29, 2024

locksmith fails when -etcd-cafile is specified flatcar/Flatcar#948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade to etcd/client/v3 #12

upgrade to etcd/client/v3 #12

tormath1 commented Aug 13, 2021 •

edited

Loading

krnowak left a comment

krnowak Aug 18, 2021

tormath1 Aug 18, 2021

invidian Aug 18, 2021

tormath1 Aug 18, 2021

invidian Aug 18, 2021

tormath1 Aug 18, 2021

krnowak Aug 18, 2021

tormath1 Aug 19, 2021

tormath1 commented Aug 18, 2021

invidian left a comment

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

invidian Aug 18, 2021

tormath1 commented Aug 31, 2021

invidian left a comment

invidian Sep 2, 2021

invidian Sep 2, 2021

invidian Sep 2, 2021

invidian Sep 2, 2021

invidian Sep 2, 2021

invidian Sep 6, 2021

invidian Sep 6, 2021

invidian Sep 6, 2021

invidian Sep 6, 2021

invidian Sep 6, 2021

pothos commented Apr 22, 2024 •

edited

Loading

		// So we first try to get the value, if the value is not found we create the key
		// with a default semaphore value

	func (t *testTxn) If(cs ...client.Cmp) client.Txn {
	func (t testTxn) If(cs ...client.Cmp) client.Txn {

	if _, err := c.keyapi.Txn(context.Background()).If(
	if _, err := c.keyapi.Txn(context.Background()).If(

	return fmt.Errorf("failed to set the semaphore - it got updated in the meantime")
	return fmt.Errorf("semaphore got updated in the meantime")

	return fmt.Errorf("unable to marshal initial semaphore: %w", err)
	return fmt.Errorf("marshaling initial semaphore: %w", err)

	return fmt.Errorf("unable to commit initial transaction: %w", err)
	return fmt.Errorf("committing initial transaction: %w", err)

		if err.Error() != "unable to init etcd lock client: unable to commit initial transaction: connection refused" {
		t.Fatalf("error should be 'unable to init etcd lock client: unable to commit initial transaction: connection refused', got: %v", err)

	t.Fatalf("error should be nil, got: %v", err)
	t.Fatalf("Failed creating new etcd lock client: %v", err)

upgrade to etcd/client/v3 #12

Are you sure you want to change the base?

upgrade to etcd/client/v3 #12

Conversation

tormath1 commented Aug 13, 2021 • edited Loading

Testing done

How to use

Note for reviewers

krnowak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tormath1 commented Aug 18, 2021

invidian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tormath1 commented Aug 31, 2021

invidian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pothos commented Apr 22, 2024 • edited Loading

tormath1 commented Aug 13, 2021 •

edited

Loading

pothos commented Apr 22, 2024 •

edited

Loading