-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests linearizability: reproduce and prevent 14571 #14819
tests linearizability: reproduce and prevent 14571 #14819
Conversation
b532575
to
58a78e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm sorry - prematurly clicked approve)
58a78e3
to
c15f110
Compare
Codecov Report
@@ Coverage Diff @@
## main #14819 +/- ##
=======================================
Coverage 74.87% 74.87%
=======================================
Files 415 415
Lines 34288 34288
=======================================
Hits 25672 25672
Misses 6994 6994
Partials 1622 1622
Flags with carried forward coverage won't be shown. Click here to find out more. 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
c15f110
to
a4b6d80
Compare
52baa47
to
b279ce6
Compare
Tested 60 times with v3.5.5 binary with experimental-snapshot-catchup-entry commit. |
b279ce6
to
a97756a
Compare
Thanks @chaochn47, please provide a detail steps to reproduce the issue. I may take a look sometime on weekend or next week. |
0e48598
to
1a04dcb
Compare
The PR is ready for review. @ptabor @serathius Could you please take a second look, thanks! |
Please resolve conficts. |
381854f
to
aa9c390
Compare
https://github.com/etcd-io/etcd/actions/runs/3953458632/jobs/6769755835 |
aa9c390
to
797e013
Compare
Signed-off-by: Chao Chen <[email protected]>
797e013
to
635d4b3
Compare
Conflicts have been resolved. Would you mind take a second look? @serathius Thanks!! |
@@ -33,7 +36,8 @@ const ( | |||
) | |||
|
|||
var ( | |||
KillFailpoint Failpoint = killFailpoint{} | |||
KillFailpoint Failpoint = killFailpoint{target: AnyMember} | |||
EnableAuthKillFailpoint Failpoint = killFailpoint{enableAuth: true, target: Follower} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to target follower here? I understand that 1471 happens when follower is killed, however we don't need to hardcode it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not necessary targeting follower. But if kill is randomly targeted against leader, clients need to wait on leader election (1 - 2 seconds) which increases test duration (It was an optimization when test repeat time is 60)
@@ -33,7 +36,8 @@ const ( | |||
) | |||
|
|||
var ( | |||
KillFailpoint Failpoint = killFailpoint{} | |||
KillFailpoint Failpoint = killFailpoint{target: AnyMember} | |||
EnableAuthKillFailpoint Failpoint = killFailpoint{enableAuth: true, target: Follower} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why failpoint needs to be aware of auth. It's true that at this moment it creates the client, but maybe we can move client creation to different place. Either do dependency injection and provide failpoint with client that is already authorized. Alternative would be to add method to e2e.EtcdProcessCluster that provides the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! I can explore each option and see what's the best fit.
return nil | ||
|
||
if f.enableAuth { | ||
require.NoError(t, addTestUserAuth(ctx, endpoints)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like that auth setup is part of failpoint injection. Those are totally separate things. Please move auth setup to cluster setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually a failure injection only for the issue 14571 that the test user won't be applied on the restarted member.
In short, auth traffic can be one type of failure injections. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that enabling auth is orthogonal to failure injection.
failpoint Failpoint | ||
config e2e.EtcdProcessClusterConfig | ||
traffic *trafficConfig | ||
clientCount int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This client count doesn't seem to be used. Please remove it and use trafficConfig.clientCount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. Will remove it.
@@ -124,39 +149,40 @@ func TestLinearizability(t *testing.T) { | |||
t.Fatal(err) | |||
} | |||
defer clus.Close() | |||
lg := zaptest.NewLogger(t, zaptest.WrapOptions(zap.AddCaller())).Named(tc.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice, but please consider moving it to separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. It's not a big change.
for i := 0; i < config.clientCount; i++ { | ||
i := i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be needed as we pass i
as argument to goroutine funtion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. It must be left behind due to rebase from main.
if qps < config.minimalQPS { | ||
t.Errorf("Requiring minimal %f qps for test results to be reliable, got %f qps", config.minimalQPS, qps) | ||
} | ||
return operations | ||
} | ||
|
||
func simulatePostFailpointTraffic(ctx context.Context, wg *sync.WaitGroup, endpoints []string, clientId int, ids identity.Provider, h *model.History, mux *sync.Mutex, config trafficConfig, limiter *rate.Limiter, lm identity.LeaseIdStorage) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand why you cannot incorporate this into normal traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To incorporate this into normal traffic, a client with test user authorization has to be set up upfront. However, before the user is added to the cluster, client creation will fail.
After a couple of failed attempts, I adopted this workaround. It's more deterministic to reliably reproduce on the impacted 3.5 versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please setup authorization in cluster setup.
func (t readWriteSingleKey) PreRun(ctx context.Context, c interfaces.Client, lg *zap.Logger) error { | ||
if t.AuthEnabled() { | ||
lg.Info("set up auth") | ||
return setupAuth(ctx, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authorization setup should be done at cluster setup, not at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
Revisit this task again, I think the linearizability test does not have to reproduce the exact same scenario how #14571 happened. #14571 uncovers the issue that auth recover from snapshot failed to update To avoid the back and force on this PR, the new proposed plan will be At cluster setup stage:
Traffic generator
Fault injector
The assumption is root user client should not see a key value is inconsistent from different etcd servers. In <=3.5.5, granted/revoked permissions will not be carried over to the restarted member. @serathius Let me know if this is aligned with the overall linearizability test design principles. Thanks! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Related to #14045
Signed-off-by: Chao Chen [email protected]
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.