Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update consitent_index when applying fails #13942

Merged
merged 1 commit into from
Apr 21, 2022

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Apr 14, 2022

Fix issues/13937

When clients have no permission to perform whatever operation, then the applying may fail. We should also move consistent_index forward in this case, otherwise the consitent_index may be smaller than the snapshot index.

If the Auth isn't enabled, then users will not run into this issue. Usually K8s depends on certificate/TLS to communicate with etcd, and the Auth isn't enabled, so it should be fine in this case.

If the Auth is enabled, it's also not easy to run into this issue, because the default value for --snapshot-count is 100000. The consistent_index will be updated each time when there is a successful applying. This issue will be bypassed if there is at least one successful applying after generating each snapshot.

If users eventually run into this issue, please build a binary with the following patch based on 3.5.3. Replace the binary, and start the etcd cluster again. Afterwards try to put & delete at least one K/V. At last, stop the cluster and rollback the binary. Everything will be fine by then.

$ git diff
diff --git a/server/etcdserver/backend.go b/server/etcdserver/backend.go
index 2beef5763..44e583b66 100644
--- a/server/etcdserver/backend.go
+++ b/server/etcdserver/backend.go
@@ -104,6 +104,9 @@ func recoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snap
        if snapshot.Metadata.Index <= consistentIndex {
                return oldbe, nil
        }
+       if true {
+               return oldbe, nil
+       }
        oldbe.Close()
        return openSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)
 }

@ahrtr
Copy link
Member Author

ahrtr commented Apr 14, 2022

Without this fix, the new added test case TestV3AuthEmptyUserPut always fails.

logger.go:130: 2022-04-14T07:42:28.034+0800	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestV3AuthEmptyUserPut3114970556/002/etcd371615291", "error": "backend.ConsistentIndex (8) must be >= last snapshot index (16)"}

After applying the fix in this PR, then the case succeeds.

@codecov-commenter
Copy link

codecov-commenter commented Apr 14, 2022

Codecov Report

Merging #13942 (77ee442) into main (4555fc3) will decrease coverage by 0.37%.
The diff coverage is 100.00%.

❗ Current head 77ee442 differs from pull request most recent head 6eef7ed. Consider uploading reports for the commit 6eef7ed to get more accurate results

@@            Coverage Diff             @@
##             main   #13942      +/-   ##
==========================================
- Coverage   72.52%   72.14%   -0.38%     
==========================================
  Files         469      469              
  Lines       38413    38411       -2     
==========================================
- Hits        27859    27712     -147     
- Misses       8776     8901     +125     
- Partials     1778     1798      +20     
Flag Coverage Δ
all 72.14% <100.00%> (-0.38%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/server.go 83.80% <100.00%> (-0.59%) ⬇️
server/proxy/grpcproxy/election.go 8.69% <0.00%> (-73.92%) ⬇️
server/proxy/grpcproxy/lock.go 33.33% <0.00%> (-66.67%) ⬇️
...ver/proxy/grpcproxy/adapter/lock_client_adapter.go 33.33% <0.00%> (-66.67%) ⬇️
...proxy/grpcproxy/adapter/election_client_adapter.go 6.89% <0.00%> (-62.07%) ⬇️
...xy/grpcproxy/adapter/maintenance_client_adapter.go 5.71% <0.00%> (-28.58%) ⬇️
server/proxy/grpcproxy/maintenance.go 68.29% <0.00%> (-14.64%) ⬇️
client/pkg/v3/fileutil/lock_linux.go 72.22% <0.00%> (-8.34%) ⬇️
server/storage/schema/cindex.go 93.33% <0.00%> (-6.67%) ⬇️
client/v3/leasing/util.go 91.66% <0.00%> (-6.67%) ⬇️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4555fc3...6eef7ed. Read the comment docs.

server/etcdserver/server.go Outdated Show resolved Hide resolved
@ahrtr ahrtr force-pushed the move_cindex_on_apply_fail branch from 94ae5ee to d72aebd Compare April 16, 2022 19:35
@ahrtr
Copy link
Member Author

ahrtr commented Apr 16, 2022

Just rebased this PR.

@ahrtr ahrtr force-pushed the move_cindex_on_apply_fail branch 2 times, most recently from 7b60b76 to b5bda01 Compare April 18, 2022 23:24
server/etcdserver/server.go Outdated Show resolved Hide resolved
@serathius serathius mentioned this pull request Apr 19, 2022
6 tasks
@ptabor ptabor changed the title Update conssitent_index when applying fails Update consitent_index when applying fails Apr 19, 2022
@ahrtr ahrtr force-pushed the move_cindex_on_apply_fail branch from b5bda01 to 1ffc0f2 Compare April 20, 2022 01:40
@ahrtr ahrtr force-pushed the move_cindex_on_apply_fail branch from 1ffc0f2 to 449785a Compare April 20, 2022 08:05
server/etcdserver/server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failure - flake (https://github.com/etcd-io/etcd/runs/6091231156?check_suite_focus=true):

    logger.go:130: 2022-04-20T08:11:00.714Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync501892535/002/etcd4044943490", "error": "backend.ConsistentIndex (24) must be >= last snapshot index (25)"}

requires a second look.

@ahrtr ahrtr force-pushed the move_cindex_on_apply_fail branch 2 times, most recently from 91f4fdb to 30ec049 Compare April 20, 2022 11:39
@ahrtr
Copy link
Member Author

ahrtr commented Apr 20, 2022

Test failure - flake (https://github.com/etcd-io/etcd/runs/6091231156?check_suite_focus=true):

    logger.go:130: 2022-04-20T08:11:00.714Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync501892535/002/etcd4044943490", "error": "backend.ConsistentIndex (24) must be >= last snapshot index (25)"}

requires a second look.

The root cause should be coming from server.go#L919, and it should be a legacy issue. When sending snapshot to a slow follwer, appliedId and appliedTerm are used as the snapshot index and term. But if the sender's consistent_index is somehow smaller than the appliedIndex, then we will run into this issue.

logger.go:130: 2022-04-20T08:10:41.907Z INFO m0 applying snapshot {"member": "m0", "current-snapshot-index": 0, "current-applied-index": 8, "incoming-leader-snapshot-index": 24, "incoming-leader-snapshot-term": 17}
logger.go:130: 2022-04-20T08:11:00.714Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync501892535/002/etcd4044943490", "error": "backend.ConsistentIndex (23) must be >= last snapshot index (24)"}

@ahrtr ahrtr force-pushed the move_cindex_on_apply_fail branch 2 times, most recently from 6bd4edb to f087e20 Compare April 20, 2022 13:27
When clients have no permission to perform whatever operation, then
the applying may fail. We should also move consistent_index forward
in this case, otherwise the consitent_index may smaller than the
snapshot index.
@ahrtr ahrtr force-pushed the move_cindex_on_apply_fail branch from f087e20 to 6eef7ed Compare April 20, 2022 13:49
@ahrtr
Copy link
Member Author

ahrtr commented Apr 20, 2022

Everything should be good now.

Eventually the if statement is as below. It's really subtle & interesting :)

if !applyV3Performed || (ar != nil && ar.err != nil) 

@serathius
Copy link
Member

serathius commented Apr 21, 2022

I think the change makes sense, however it's just a band-aid solution for the consistent index code that needs a architectural redesign.

@serathius serathius merged commit e02ac59 into etcd-io:main Apr 21, 2022
@ahrtr
Copy link
Member Author

ahrtr commented Apr 21, 2022

I think the change makes sense, however it's just a band-aid solution for the consistent index code that needs a architectural redesign.

Yes, I agree. But redesign isn't an easy task, and I will try to drive this if @ptabor hasn't bandwidth, and probably will be top of @ptabor 's PR 13878

@serathius
Copy link
Member

serathius commented Apr 21, 2022

I think that the @ptabor PR wanted to fix the development issue of picking either LockInsideApply or LockOutsideApply. However looking that apply code that was changed in this PR, we will still be hard to decide if CI should be flushed or not. For v3.5.3 we fixed one type of unintentional update of consistent index, but broke case where update should still happen when authorization fails. We can continue to detect and fix such issues, however it might never end as new features and error states are added.

@ptabor
Copy link
Contributor

ptabor commented Apr 21, 2022

  1. broke case where update should still happen when authorization fails -> I think it was preexisting issue, just observed when we more frequently dump snapshot.

  2. I think PR Encapsulation of applier logic: Move Txn related code out of applier.go. #13878 helps with overall code quality:

    • breaks huge etcdserver in 2 subpackages with cleaner dependencies between them
    • reduces objects exposed directly in EtcdServer
    • makes wrapping story or appiers 'cleaner'. No cyclic dependency.

So would not consider it 'alternative', even if not contributing directly to the new approach.

  1. I don't see a way to avoid passing some form of applyingContext to all the transactions. We will need to do this in every code-paths leading to transaction or make it part of 'applying' specific helpers on top of objects representing post-raft-state. In general I'm in favor of strongly separating pre-raft and post-raft state objects... and then we know that post-raft object mutators are responsible for cindex.

@ahrtr
Copy link
Member Author

ahrtr commented Apr 21, 2022

In general, I agree with @ptabor , but we still need to clarify and breakdown the details. I will keep thinking about it and probably send out a draft/doc for review later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

etcd panic on startup (auth enabled)
4 participants