-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: Migrate member add tests to common framework #14281
Conversation
tests/e2e/ctl_v3_member_test.go
Outdated
func TestCtlV3MemberAddPeerTLS(t *testing.T) { | ||
testCtl(t, memberAddTest, withCfg(*e2e.NewConfigPeerTLS())) | ||
} | ||
func TestCtlV3MemberAdd(t *testing.T) { testCtl(t, memberAddTest) } | ||
func TestCtlV3MemberAddForLearner(t *testing.T) { testCtl(t, memberAddForLearnerTest) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this also covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying implementation of TestCtlV3MemberAdd
and TestCtlV3MemberAddForLearner
use member add
and member add --learner
command.
I thought it would be valuable to keep them in the e2e testing, to cover the real command usage scenarios.
Do we need to keep them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if I understood you. New framework runs e2e test, but runs etcdctl only with --format=json
, so we never real command usage secnario etcdctl member add
, but flavor that uses json?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I've got the point.
I thought our common framework doesn't run e2e test. Now I know it does.
I'm going to remove these redundant functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought our common framework doesn't run e2e test. Now I know it does.
Common framework is an interface for testing that covers both e2e and integration. Tests in tests/common
that use common framework will be automatically run as both e2e and integration tests. Write a test once and double the benefit.
Instead of having two separate frameworks and redundant tests, we want to just have one framework with shared tests. For now we are migrating tests/e2e
directory, however if you want you can look into tests/integration
directory and check if some test scenarios are already covered and can be removed.
It's a good point that we don't test exact output of etcdctl member add
, however that's much lower on our priority than testing e2e process.
Codecov Report
@@ Coverage Diff @@
## main #14281 +/- ##
==========================================
- Coverage 75.49% 75.22% -0.28%
==========================================
Files 457 457
Lines 37084 37084
==========================================
- Hits 27996 27895 -101
- Misses 7341 7427 +86
- Partials 1747 1762 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Found errror:
Let me know if you need help with debugging. |
This clue shows that a member of a cluster starts serving client before it's connected to all peers.
To add or to remove a member from a cluster, a client need to wait a little bit. I've reproduced it on my computer. To correct the test cases, I'm gonna keep trying |
Hmm, maybe we should use |
One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable. On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1. |
Wow, this is an amazing function, it looks like what I need. I'll give it try. |
sg. |
I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1. |
As I went through the code base, I found that |
Skipping test cases where cluster size is equal 1 is not likely going to make the tests work. The failed test cases at the moment are the ones with cluster size > 1. When cluster size=1, the single node accepts |
cf7ca3a
to
bcb1923
Compare
I would recommend to rebase as there was a change that greatly reduces test flakiness #14283 |
bcb1923
to
76c57f8
Compare
Maybe we need to implement a type Cluster interface {
Members() []Member
// HERE
WaitLeader() int
Client() Client
Close() error
}
type Member interface {
Client() Client
Start() error
Stop()
}
type Client interface {
Put(key, value string, opts config.PutOptions) error
Get(key string, opts config.GetOptions) (*clientv3.GetResponse, error)
Delete(key string, opts config.DeleteOptions) (*clientv3.DeleteResponse, error)
Compact(rev int64, opts config.CompactOption) (*clientv3.CompactResponse, error)
Status() ([]*clientv3.StatusResponse, error)
HashKV(rev int64) ([]*clientv3.HashKVResponse, error)
Health() error
Defragment(opts config.DefragOption) error
AlarmList() (*clientv3.AlarmResponse, error)
AlarmDisarm(alarmMember *clientv3.AlarmMember) (*clientv3.AlarmResponse, error) |
47c487a
to
d2ed176
Compare
Do you need to wait for #14304 to be merged and then update this PR? |
d2ed176
to
d05c894
Compare
tests/common/member_test.go
Outdated
var addResp *clientv3.MemberAddResponse | ||
var err error | ||
if tc.learner { | ||
addResp, err = cc.MemberAddAsLearner("newmember", []string{blackHolePeerUrl}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to use a real peerURL something like http://localhost:xxx
instead of http://240.0.0.0:65535
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://240.0.0.0:65535
is used to avoid "Peer URLs already exists" error. If http://localhost:xxx
happens to be the peer-url of a started etcd instance, member add
operation would fail due to that error. I considered to use http://localhost:xxx
with a unused random port such as 2777 and 15981, but I'm slightly worried about port conflict in the future.
The
Therefore, in e2e cluster, etcd/server/etcdserver/server.go Lines 1350 to 1358 in d05c894
A PR #14360 was opened to discuss this. |
Please resolve the conflict. |
d05c894
to
8b4b70a
Compare
tests/common/member_test.go
Outdated
} | ||
} | ||
|
||
func checkAddResp(t *testing.T, addResp *clientv3.MemberAddResponse, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks error and validates response. Could we maybe split this:
- rename
checkAddResp
tovalidateMemberAdd
and removeerr
argument. - for error checking we can use
require.NoError(err)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to the suggestions! I'd love to do so , as they make the code clearer.
8b4b70a
to
34d9b03
Compare
tests/common/member_test.go
Outdated
cc := clus.Client() | ||
|
||
testutils.ExecuteUntil(ctx, t, func() { | ||
clus.WaitLeader(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need WaitLeader
here? I haven't seen it used in other tests as I would expect that NewCluster
returns me cluster after leader is established.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It turns out we don't need WaitLeader
here. I'll remove them.
The first time I saw etcdserver: unhealthy cluster
error after calling MemberAdd
, I thought it was because all etcd members did not agree on the same leader. Therefore, I added WaitLeader
function to the Cluster
interface. But the etcdserver: unhealthy cluster
error did not disappear. Finally I found it was caused by --strict-reconfig-check
.
tests/common/member_test.go
Outdated
if nc.config.ClusterSize == 1 { | ||
checkAddResp(t, addResp, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect reverse result, adding invalid (localhost:123
will never respond) member should fail in single node cluster as you cannot establish quorum. From 2 members only 1 is alive, but quorum is 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, adding an invalid member succeeds in the original e2e test. I also tried this in my terminal, it succeeded too. Is this a long-standing bug?🤣
tests/common/member_test.go
Outdated
} | ||
} | ||
|
||
func TestMemberAdd_SleetHealthInterval(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func TestMemberAdd_SleetHealthInterval(t *testing.T) { | |
func TestMemberAdd_SleepHealthInterval(t *testing.T) { |
tests/common/member_test.go
Outdated
} | ||
} | ||
|
||
func TestMemberAdd_SleetHealthInterval(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see similar test removed in this PR. Where this test has come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a scenario not covered before:
- In the first 5 seconds(HealthInterval) after the creation of a cluster(with
--strict-reconfig-check
enabled), any call toMemberAdd
andMemberRemove
should return anetcdserver: unhealthy cluster
error. - As the first 5 seconds(HealthInterval) has passed, call to these functions should be successful.
This behaviour is controlled by the isConnectedFullySince
function:
etcd/server/etcdserver/server.go
Lines 1350 to 1358 in d05c894
if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberId(), s.cluster.VotingMembers()) { | |
lg.Warn( | |
"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum", | |
zap.String("local-member-id", s.MemberId().String()), | |
zap.String("requested-member-add", fmt.Sprintf("%+v", memb)), | |
zap.Error(errors.ErrUnhealthy), | |
) | |
return errors.ErrUnhealthy | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, please rename/rewrite test so the reason for the behavior is known. For example TestMemberAdd_BeforeConnectedToAllPeers
or TestMemberAdd_WaitForQuorum
.
tests/common/member_test.go
Outdated
clus.WaitLeader(t) | ||
var addResp *clientv3.MemberAddResponse | ||
var err error | ||
time.Sleep(etcdserver.HealthInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why adding sleep here makes a difference?
Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:
We could just add |
I was worried about making a single test function too large and complex. I'll try and see if I can handle this. |
ae04eab
to
236f836
Compare
Signed-off-by: Clark <[email protected]>
236f836
to
fcc076f
Compare
I followed your suggestion and it works great! I've force pushed. Please take a look. |
Conctext #13637