Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change) #14411

Closed

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Sep 1, 2022

The fourth solution to fix #14370

@ahrtr ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch 2 times, most recently from 83bada5 to cf9306b Compare September 1, 2022 10:41
@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2022

Codecov Report

Merging #14411 (e60cb56) into main (5707147) will decrease coverage by 0.29%.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##             main   #14411      +/-   ##
==========================================
- Coverage   75.56%   75.26%   -0.30%     
==========================================
  Files         457      458       +1     
  Lines       37183    37202      +19     
==========================================
- Hits        28098    28001      -97     
- Misses       7335     7432      +97     
- Partials     1750     1769      +19     
Flag Coverage Δ
all 75.26% <92.30%> (-0.30%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...st/interaction_env_handler_leader_report_status.go 77.77% <77.77%> (ø)
raft/node.go 87.02% <100.00%> (-0.53%) ⬇️
raft/raft.go 90.20% <100.00%> (+0.02%) ⬆️
raft/rafttest/interaction_env_handler.go 83.33% <100.00%> (+0.52%) ⬆️
server/etcdserver/api/rafthttp/peer_status.go 87.87% <0.00%> (-12.13%) ⬇️
server/auth/simple_token.go 80.00% <0.00%> (-8.47%) ⬇️
server/etcdserver/api/rafthttp/peer.go 87.01% <0.00%> (-8.45%) ⬇️
client/pkg/v3/fileutil/purge.go 66.03% <0.00%> (-7.55%) ⬇️
client/v3/experimental/recipes/queue.go 58.62% <0.00%> (-6.90%) ⬇️
client/v3/namespace/watch.go 87.87% <0.00%> (-6.07%) ⬇️
... and 21 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@tbg tbg self-requested a review September 1, 2022 13:42
@ahrtr ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch from cf9306b to d1957fe Compare September 2, 2022 00:57
Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This change looks good to me.
The fixed tests are huge value, irregardless whether we will go with Step or the in-line approach.

raft/node.go Show resolved Hide resolved
@ahrtr ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch from 0773176 to a799416 Compare September 2, 2022 08:59
For a cluster with only one member, the raft always send identical
unstable entries and committed entries to etcdserver, and etcd
responds to the client once it finishes (actually partially) the
applying workflow.

When the client receives the response, it doesn't mean etcd has already
successfully saved the data, including BoltDB and WAL, because:
   1. etcd commits the boltDB transaction periodically instead of on each request;
   2. etcd saves WAL entries in parallel with applying the committed entries.
Accordingly, it may run into a situation of data loss when the etcd crashes
immediately after responding to the client and before the boltDB and WAL
successfully save the data to disk.
Note that this issue can only happen for clusters with only one member.

For clusters with multiple members, it isn't an issue, because etcd will
not commit & apply the data before it being replicated to majority members.
When the client receives the response, it means the data must have been applied.
It further means the data must have been committed.
Note: for clusters with multiple members, the raft will never send identical
unstable entries and committed entries to etcdserver.

Signed-off-by: Benjamin Wang <[email protected]>
1. added one more command "report-status" so that the leader can acknowledges
   that the entries has already been persisted.
2. regenerated some test data.

Signed-off-by: Benjamin Wang <[email protected]>
@ahrtr ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch from a799416 to e60cb56 Compare September 2, 2022 18:03
@ahrtr
Copy link
Member Author

ahrtr commented Sep 5, 2022

leave it to @tbg to fix the raft side in #14413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Durability API guarantee broken in single node cluster
3 participants