Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests linearizability: reproduce and prevent 14571 #14819

Closed

Conversation

chaochn47
Copy link
Member

@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch from b532575 to 58a78e3 Compare November 22, 2022 02:32
@chaochn47 chaochn47 closed this Nov 22, 2022
@chaochn47 chaochn47 reopened this Nov 22, 2022
@chaochn47 chaochn47 marked this pull request as draft November 22, 2022 03:17
Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm sorry - prematurly clicked approve)

@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2022

Codecov Report

Merging #14819 (6200b22) into main (6200b22) will not change coverage.
The diff coverage is n/a.

❗ Current head 6200b22 differs from pull request most recent head 1a04dcb. Consider uploading reports for the commit 1a04dcb to get more accurate results

@@           Coverage Diff           @@
##             main   #14819   +/-   ##
=======================================
  Coverage   74.87%   74.87%           
=======================================
  Files         415      415           
  Lines       34288    34288           
=======================================
  Hits        25672    25672           
  Misses       6994     6994           
  Partials     1622     1622           
Flag Coverage Δ
all 74.87% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

tests/linearizability/failpoints.go Outdated Show resolved Hide resolved
tests/linearizability/failpoints.go Outdated Show resolved Hide resolved
tests/linearizability/failpoints.go Outdated Show resolved Hide resolved
tests/linearizability/failpoints.go Outdated Show resolved Hide resolved
tests/linearizability/failpoints.go Outdated Show resolved Hide resolved
@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch from c15f110 to a4b6d80 Compare December 20, 2022 01:29
@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch 3 times, most recently from 52baa47 to b279ce6 Compare January 6, 2023 10:26
@chaochn47 chaochn47 marked this pull request as ready for review January 6, 2023 10:26
@chaochn47 chaochn47 requested a review from ptabor January 6, 2023 10:27
@chaochn47
Copy link
Member Author

chaochn47 commented Jan 6, 2023

Tested 60 times with v3.5.5 binary with experimental-snapshot-catchup-entry commit. linearizability_test.go:277: Model is not linearizable while on mainline, it is.

@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch from b279ce6 to a97756a Compare January 6, 2023 10:49
@ahrtr
Copy link
Member

ahrtr commented Jan 6, 2023

Tested 60 times with v3.5.5 binary with experimental-snapshot-catchup-entry commit. linearizability_test.go:277: Model is not linearizable while on mainline, it is.

Thanks @chaochn47, please provide a detail steps to reproduce the issue. I may take a look sometime on weekend or next week.

@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch 2 times, most recently from 0e48598 to 1a04dcb Compare January 6, 2023 11:19
@chaochn47
Copy link
Member Author

chaochn47 commented Jan 6, 2023

Hi @ahrtr, it is not a new issue. The new linearizable test case in current PR is to reproduce and prevent #14571. The original issue #14571 has been fixed in v3.5.6.

@chaochn47 chaochn47 requested review from serathius and removed request for ptabor January 6, 2023 11:57
@chaochn47
Copy link
Member Author

The PR is ready for review.

@ptabor @serathius Could you please take a second look, thanks!

@serathius
Copy link
Member

Please resolve conficts.

@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch 4 times, most recently from 381854f to aa9c390 Compare January 18, 2023 22:26
@chaochn47
Copy link
Member Author

chaochn47 commented Jan 18, 2023

TestLinearizability_ClusterOfSize3

linearizability_test.go:428: Linearization timed out

https://github.com/etcd-io/etcd/actions/runs/3953458632/jobs/6769755835

@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch from aa9c390 to 797e013 Compare January 19, 2023 00:03
@chaochn47 chaochn47 force-pushed the linearizability_check_issue_14571 branch from 797e013 to 635d4b3 Compare January 19, 2023 00:19
@chaochn47
Copy link
Member Author

Conflicts have been resolved. Would you mind take a second look? @serathius Thanks!!

@@ -33,7 +36,8 @@ const (
)

var (
KillFailpoint Failpoint = killFailpoint{}
KillFailpoint Failpoint = killFailpoint{target: AnyMember}
EnableAuthKillFailpoint Failpoint = killFailpoint{enableAuth: true, target: Follower}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to target follower here? I understand that 1471 happens when follower is killed, however we don't need to hardcode it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not necessary targeting follower. But if kill is randomly targeted against leader, clients need to wait on leader election (1 - 2 seconds) which increases test duration (It was an optimization when test repeat time is 60)

@@ -33,7 +36,8 @@ const (
)

var (
KillFailpoint Failpoint = killFailpoint{}
KillFailpoint Failpoint = killFailpoint{target: AnyMember}
EnableAuthKillFailpoint Failpoint = killFailpoint{enableAuth: true, target: Follower}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why failpoint needs to be aware of auth. It's true that at this moment it creates the client, but maybe we can move client creation to different place. Either do dependency injection and provide failpoint with client that is already authorized. Alternative would be to add method to e2e.EtcdProcessCluster that provides the client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I can explore each option and see what's the best fit.

return nil

if f.enableAuth {
require.NoError(t, addTestUserAuth(ctx, endpoints))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like that auth setup is part of failpoint injection. Those are totally separate things. Please move auth setup to cluster setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually a failure injection only for the issue 14571 that the test user won't be applied on the restarted member.

In short, auth traffic can be one type of failure injections. Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that enabling auth is orthogonal to failure injection.

failpoint Failpoint
config e2e.EtcdProcessClusterConfig
traffic *trafficConfig
clientCount int
Copy link
Member

@serathius serathius Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This client count doesn't seem to be used. Please remove it and use trafficConfig.clientCount

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Will remove it.

@@ -124,39 +149,40 @@ func TestLinearizability(t *testing.T) {
t.Fatal(err)
}
defer clus.Close()
lg := zaptest.NewLogger(t, zaptest.WrapOptions(zap.AddCaller())).Named(tc.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, but please consider moving it to separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. It's not a big change.

for i := 0; i < config.clientCount; i++ {
i := i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed as we pass i as argument to goroutine funtion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. It must be left behind due to rebase from main.

if qps < config.minimalQPS {
t.Errorf("Requiring minimal %f qps for test results to be reliable, got %f qps", config.minimalQPS, qps)
}
return operations
}

func simulatePostFailpointTraffic(ctx context.Context, wg *sync.WaitGroup, endpoints []string, clientId int, ids identity.Provider, h *model.History, mux *sync.Mutex, config trafficConfig, limiter *rate.Limiter, lm identity.LeaseIdStorage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand why you cannot incorporate this into normal traffic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To incorporate this into normal traffic, a client with test user authorization has to be set up upfront. However, before the user is added to the cluster, client creation will fail.

After a couple of failed attempts, I adopted this workaround. It's more deterministic to reliably reproduce on the impacted 3.5 versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please setup authorization in cluster setup.

func (t readWriteSingleKey) PreRun(ctx context.Context, c interfaces.Client, lg *zap.Logger) error {
if t.AuthEnabled() {
lg.Info("set up auth")
return setupAuth(ctx, c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Authorization setup should be done at cluster setup, not at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@chaochn47
Copy link
Member Author

chaochn47 commented Feb 15, 2023

Revisit this task again, I think the linearizability test does not have to reproduce the exact same scenario how #14571 happened.

#14571 uncovers the issue that auth recover from snapshot failed to update rangePermCache. It's a in-memory map, for each user, maintains a interval tree that checks requested key start to key end is the subnet of the interval tree.

To avoid the back and force on this PR, the new proposed plan will be

At cluster setup stage:

  1. snapshot catch up entry is as low as 1, snapshot count is as low as 1 to speed up raft log compaction and leader always request follower to download snapshot even if it's a brief downtime.
  2. create root user, root role has access to all operations.
  3. create test user, test role, grant test role RW permissions from key foo to zoo

Traffic generator

  1. half of the clients using root user permissions
  2. half of the clients using test user permissions, all the operation will be acting against key range foo to zoo
  3. root user client will periodically grant / revoke test role permissions

Fault injector

  1. kill a random member

The assumption is root user client should not see a key value is inconsistent from different etcd servers. In <=3.5.5, granted/revoked permissions will not be carried over to the restarted member.

@serathius Let me know if this is aligned with the overall linearizability test design principles. Thanks!

@stale
Copy link

stale bot commented May 21, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 21, 2023
@stale stale bot closed this Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

6 participants