Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Durability API guarantee broken in single node cluster #14370

Closed
Tracked by #14438
hasethuraman opened this issue Aug 22, 2022 · 58 comments · Fixed by #14400 or #14413
Closed
Tracked by #14438

Durability API guarantee broken in single node cluster #14370

hasethuraman opened this issue Aug 22, 2022 · 58 comments · Fixed by #14400 or #14413
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/bug

Comments

@hasethuraman
Copy link

I observed the possibility of data loss and I would like the community to comment / correct me otherwise.

Before explaining that, I would like to explain the happy path when user does a PUT <key, value>. I have tried to only necessary steps to focus this issue. And considered a single etcd instance.

====================================================================================
----------api thread --------------

User calls etcdctl PUT k v

It lands in v3_server.go::put function with the message about k,v

Call delegates to series of function calls and enters v3_server.go::processInternalRaftRequestOnce

It registers for a signal with wait utility against this keyid

Call delegates further to series of function calls and enters raft/node.go::stepWithWaitOption(..message..)

It wraps this message in a msgResult channel and updates its result channel; then sends this message to propc channel.

After sending it waits on msgResult.channel
----------api thread waiting --------------

On seeing a message in propc channel, raft/node.go::run(), it wakes up and sequence of calls adds the message.Entries to raftLog

Notifies the msgResult.channel

----------api thread wakes--------------
10. Upon seeing the msgResult.channel, api thread wakes and returns down the stack back to v3_server.go::processInternalRaftRequestOnce and waits for signal that it registered at step#4
----------api thread waiting --------------

In next iteration of raft/node.go::run(), it gets the entry from raftLog and add it to readyc
etcdserver/raft.go::start wakes up on seeing this entry in readyc and adds this entry to applyc channel
and synchronously writes to wal log ---------------------> wal log
etcdserver/server.go wakes up on seeing entry in applyc channel (added in step #12)
From step#14, the call goes through series of calls and lands in server.go::applyEntryNormal
applyEntryNormal calls applyV3.apply which will eventually puts the KV to mvcc kvstore txn kvindex
applyEntryNormal now sends the signal for this key which is basically to wake up api thread that is waiting in 7
----------api thread wakes--------------
18. User thread here wakes and sends back acknowledgement
----------user sees ok--------------

Batcher flushes the entries added to kvstore txn kvindex to database file. (also this can happen before 18 based on its timer)

Here if step #13 thread is pre-empted and rescheduled by the underlying operating system after completing step #18 and when there is a power failure at the end of step 18 where after user sees error, then the kv is neither written to wal nor to database file

I think this is not seen today because it is a small window where the server has to restart immediately after step 18 (and immediately after step 12 the underlying os must have pre-empted the etcdserver/raft.go::start and added to end of the runnable Q.). Given these multiple conditions, it appears that we dont see data loss.

But it appears from the code that it is possible. To simulate, added sleep after step 12 (also added exit) and 19. I was able to see ok but the data is not in both wal and db.

If I am not correct, my apology and also please correct my understanding.

Before repro please do the changes:

  1. Do the code changes in raft.go
    image

2.Do the code changes in tx.go
image

  1. Rebuild etcd server

Now follow the steps to repro
//1. Start etcd server with changes

//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation)
$ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1
OK

//3. Remove this control flag file and restart the etcd server
$ rm /tmp/exitnow

//4. Check if key present
$ ./bin/etcdctl get /k --prefix
$

// We can see no key-value

@hasethuraman
Copy link
Author

Looks the explanation and steps are very tedious for one to look. Instead, please find this.

//1. Incorporate this code diff and run etcd-server
https://github.com/etcd-io/bbolt/compare/v1.3.6...hasethuraman:bbolt:v1.3.6-test?expand=1

https://github.com/etcd-io/etcd/compare/release-3.5...hasethuraman:release-3.5?expand=1

//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation)
$ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1
OK

//3. Remove this control flag file and restart the etcd server
$ rm /tmp/exitnow

//4. Check if key present
$ ./bin/etcdctl get /k --prefix
$

// We can see no key-value

@ahrtr
Copy link
Member

ahrtr commented Aug 23, 2022

It seems that you raised a duplicated issue to #14364.

I have already answered the issue in #14364 (comment) and #14364 (comment)

@hasethuraman
Copy link
Author

hasethuraman commented Aug 24, 2022

@ahrtr it is not about HA. I am saying there appears to be a data loss.
When we acknowledge ok to user and there is a possibility for data loss even after acknowledging, shouldnt we synchronously write to WAL before acknowledging?

@lavacat
Copy link

lavacat commented Aug 25, 2022

@hasethuraman haven't tried to repro, but I agree with @ahrtr, you should try 3 node cluster. Raft protocol doesn't really work on 1 node. I bet you won't be able to reproduce this.

Etcd allows to launch 1 node clusters only to make it easy to experiment with api, not for any production use.

@hasethuraman
Copy link
Author

Thanks @lavacat . I thought this; but isnt the raft message rtt covering up this observation? I am posting the screenshot here that will also give what I think basically wal can go first before the database/raft for strong consistency

image

@serathius
Copy link
Member

Hey @hasethuraman can you please please not use screenshots. It's much easier to just link the code

select {
case r.applyc <- ap:
case <-r.stopped:
return
}
// the leader can write to its disk in parallel with replicating to the followers and them
// writing to their disks.
// For more details, check raft thesis 10.2.1
if islead {
// gofail: var raftBeforeLeaderSend struct{}
r.transport.Send(r.processMessages(rd.Messages))
}
// Must save the snapshot file and WAL snapshot entry before saving any other entries or hardstate to
// ensure that recovery after a snapshot restore is possible.
if !raft.IsEmptySnap(rd.Snapshot) {
// gofail: var raftBeforeSaveSnap struct{}
if err := r.storage.SaveSnap(rd.Snapshot); err != nil {
r.lg.Fatal("failed to save Raft snapshot", zap.Error(err))
}
// gofail: var raftAfterSaveSnap struct{}
}
// gofail: var raftBeforeSave struct{}
if err := r.storage.Save(rd.HardState, rd.Entries); err != nil {
r.lg.Fatal("failed to save Raft hard state and entries", zap.Error(err))
}

@serathius
Copy link
Member

serathius commented Aug 25, 2022

Thanks @lavacat . I thought this; but isnt the raft message rtt covering up this observation? I am posting the screenshot here that will also give what I think basically wal can go first before the database/raft for strong consistency

image

I don't think you correctly identified order of operations. Please remember that committing to db is done after HardState is WAL entry is added to WAL. So the invocations you highlighted case r.applyc <- ap:, r.storage.Save are not saving the same entries.

EDIT: In single node cluster rd.Entries = rd.CommitedEntries which is the exact problem that causes durability issue.

@ahrtr
Copy link
Member

ahrtr commented Aug 25, 2022

I agree that it's a little confusing that etcd returns success/OK, but the data is actually lost, although there is only one member in the cluster.

In theory, we can make everything as a synchronous call, and do not respond to the client until everything (including boltDB, WAL) is successfully persisted. Obviously it will cause huge reduce of performance, and the design doesn't make any sense at all.

To prevent data loss, etcd has WAL, which is similar to the redo log of MySQL. To prevent one member total down for whatever reason (e.g. disk corruption), we recommend to setup a cluster with at least 3 members. There is no perfect solution, but this is a good solution for now.

But you intentionally fail both the WAL persistent and the BoltDB, and also with only one member. So it's expected behavior by design.

@ahrtr ahrtr closed this as completed Aug 25, 2022
@serathius
Copy link
Member

serathius commented Aug 26, 2022

I'm just flabbergasted with the conclusion. This means that single node cluster doesn't provide durability guarantee. Documentation about API guarantees do not state it anywhere https://etcd.io/docs/v3.5/learning/api_guarantees/#durability

@ahrtr
Copy link
Member

ahrtr commented Aug 26, 2022

I will take care of this, updating the doc or enhance the existing etcdserver/raft workflow. Please note that it can only happens for a cluster with only one member.

@ahrtr ahrtr reopened this Aug 26, 2022
@ahrtr ahrtr self-assigned this Aug 26, 2022
@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 26, 2022

Kube was designed with the assumption that PUT was durable and etcd was crash consistent for accepted writes, regardless of quorum size. Maybe @xiang90 or @philips could weigh in on whether this was intentional - it is certainly not expected and I'd say my read of the ecosystem over the last 10 years has been that all accepted writes should be crash safe regardless of cluster size.

@liggitt
Copy link
Contributor

liggitt commented Aug 26, 2022

I'd say my read of the ecosystem over the last 10 years has been that all accepted writes should be crash safe regardless of cluster size.

Agreed. The only discussion around relaxing single node safety guarantees I was aware of was #11930, and that reinforced my expectation that even single-node etcd servers default to being crash safe w.r.t. successfully persisting to disk prior to treating a write request as a success, and that overriding that default requires the admin explicitly opting into something unsafe (--unsafe-no-fsync).

@ahrtr
Copy link
Member

ahrtr commented Aug 26, 2022

I will deliver a PR to fix this for clusters with only one member.

@hasethuraman
Copy link
Author

@ahrtr Since I did this locally and worked as expected thought of sharing. Please let me know if the fix you are planning is going to be different.

https://github.com/etcd-io/etcd/compare/release-3.5...hasethuraman:etcd:14370?expand=1#diff-54bdb7a5ed2d92f64598fb472372562ff64f8417e63d7ac672eaa485704cea9f

@ahrtr
Copy link
Member

ahrtr commented Aug 27, 2022

@ahrtr
Copy link
Member

ahrtr commented Aug 27, 2022

I just delivered a PR #14394 to fix this issue.

Please let me know whether you can still reproduce the issue with the PR. cc @hasethuraman

@ahrtr ahrtr added type/bug priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 27, 2022
@serathius
Copy link
Member

Question is how we want to roll out this fix, it's not a breaking change as it restores etcd durability which matches user expectation. However it is expected to come with performance regression. For v3.6 we should make single etcd instances durable, but we should avoid backport being to disruptive.

I think it's crucial to do benchmarking to confirm how impact-full is the change. If the change is small, i would to backport it as it is. If it could disrupt larger clusters, I would want to leave a escape hatch flag, so users can return to previous behavior. If regression is very big, we could consider backport would need to be in default off mode.

@serathius
Copy link
Member

serathius commented Aug 28, 2022

Minimal repro, by using pre-existing failpoints in etcdserver:

FAILPOINTS=enable make
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & 
# Wait for etcd to start
curl http://127.0.0.1:22381/go.etcd.io/etcd/server/etcdserver/raftBeforeSave -XPUT -d'panic'
./bin/etcdctl put a 1
# Expect put to return OK and etcd to crash
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & 
./bin/etcdctl get a
# Expect to get empty reponse

@serathius serathius changed the title [Help Wanted] [With steps to repro] Possibility of data loss when server restarts immediately after key-value put with explained conditions Durability API guarantee broken in single node cluster Aug 28, 2022
@ahrtr
Copy link
Member

ahrtr commented Aug 29, 2022

I think it's crucial to do benchmarking to confirm how impact-full is the change. If the change is small, i would to backport it as it is. If it could disrupt larger clusters, I would want to leave a escape hatch flag, so users can return to previous behavior. If regression is very big, we could consider backport would need to be in default off mode.

Please see my comments #14394 (comment), #14394 (comment) and #14394 (comment). In short, I think we should backport the fix to 3.5, and probably 3.4.

FAILPOINTS=enable make
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd &
curl http://127.0.0.1:22381/go.etcd.io/etcd/server/etcdserver/raftBeforeSave -XPUT -d'panic'
./bin/etcdctl put a 1
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd &
./bin/etcdctl get a

These steps aren't correct to me. Although you can reproduce this issue easily using these steps, but it isn't stable. We need to make sure both the boltDB and WAL fail to save the data, but the boltDB may save data successfully or the client might get an error response in your steps. The correct steps are,

go get go.etcd.io/gofail/runtime   # execute this command for both server and etcdutl
FAILPOINTS=enable make
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd &

curl http://127.0.0.1:22381/etcdserver/raftBeforeLeaderSend -XPUT -d'sleep(100)'
curl http://127.0.0.1:22381/etcdserver/raftBeforeSave -XPUT -d'panic'
curl http://127.0.0.1:22381/backend/beforeCommit -XPUT -d'sleep(200)'
./etcdctl  put k1 v1   

# The client will get an "OK" response, and the etcd crashes when running this step. please start the etcd again

./etcdctl  get k1   ## no data because the data was lost

For anyone reference: https://github.com/etcd-io/gofail/blob/master/doc/design.md

@ptabor
Copy link
Contributor

ptabor commented Aug 29, 2022

I think we should fix etcd's RAFT implementation rather than the etcd "apply" layer.

The RAFT protocol is pretty explicit about this:

  1. "Fortunately, the leader can write to its disk in parallel with replicating to the followers and them
    writing to their disks." (this says about writing WAL).
  2. "The leader may even commit an entry before it has been written to its own disk, if a majority of
    followers have written it to their disks;"

Seems that etcd's RAFT implementation skips the rules needed for 'committing' an entry...
I think that the rule intents to say: "The leader may even commit an entry before it has been written to its own disk, if a
followers that had written it to their disks are the majority of all members;".

So in case of 1-member RAFT, the leader needs to write and flush first - before considering the entry to be "committed".

@ahrtr
Copy link
Member

ahrtr commented Aug 29, 2022

Thanks @ptabor for the feedback. The 10.2.1 is for the performance optimization for multi-member cluster, and I don't think it is the root cause of this issue. The performance optimization is based on the key point that a follower must sync the WAL entries before responding to the leader. Anyway, it's for multi-member cluster.

For the case of 1-member cluster, I believe the solution "the leader needs to write and flush first - before considering the entry to be "committed"" definitely works, and several people raised the same solution . But I mentioned the reason why I did not adopt in #14394 (comment). Let me recap the points:

  1. It has some performance improvement to save WAL entries in parallel with committing&applying it, although it might not be too big performance improvement;
  2. When etcdserver fails to apply the entries for whatever reason, it can respond to the client asap, and no need to wait for the WAL syncing at all in this case.
  3. The raft package is relative stable against the etceserver, so we are always cautious to enhance/refactor the raft package. Raft handle the commit logic in the same way no matter it's one-member cluster or multi-member cluster. For one-member cluster, it just commits the proposal immediately because it doesn't need to get confirmation from itself. It might be OK for now.

Anyway, I agree that we should enhance the raft protocol for one-member cluster, but we'd better do it in future instead of now, because we have more higher priority things to do. WDYT? @ptabor @serathius @spzala

@ptabor
Copy link
Contributor

ptabor commented Aug 29, 2022

Thank you @ahrtr . My intuition is that 1. and 2. are premature optimisations.

  1. Committing/flushing bolt is the most expensive operation and in both cases we are not doing this on the synchronous part. From in-memory apply I would expect insignificant overhead.
  2. That's interesting. This would apply if etcd is used with many transactions with failing preconditions... this used to be a case more before introduction of leases... as such usage model is expensive due to need to write (but not flush) WAL log for all such (practically RO) entries. I would not optimize for patterns that should be rather avoided.
  3. TBH I failed to find where the exception on happens in RAFT implementation:
    • etcd/raft/raft.go

      Lines 585 to 588 in f56e0d0

      func (r *raft) maybeCommit() bool {
      mci := r.prs.Committed()
      return r.raftLog.maybeCommit(mci, r.Term)
      }

      is the place that move committed index and
      // The smallest index into the array for which the value is acked by a
      // quorum. In other words, from the end of the slice, move n/2+1 to the
      // left (accounting for zero-indexing).
      pos := n - (n/2 + 1)
      check for the majority index. Do you see where there is a special case for single-node cluster ?

If Raft change is O(10-20) lines change and it will not show the (more-significant) performance degradation, I would prefer that than introducing another 'case' (walNatifyc) on the etcd side. I think it will:

  • simplify the code-paths.
  • avoid similar errors for other etcd's raft users that does not distinguish 1 and multi member case.

If the applications want's to do 'hedging' (apply early but do not bolt commit when probability of commit is high), such optimisation can be added in etcd layer on top of proper raft protocol (although I don't think it's worth it).

Raft change would need to be Config gated to keep the behavior stable - with previous behavior default in 3.4 & 3.5 (overwritten in etcd).

serathius added a commit to serathius/etcd that referenced this issue Oct 20, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 20, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 20, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 20, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 21, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 21, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 21, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 23, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 23, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 23, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 23, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 23, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 23, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 24, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 24, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 24, 2022
serathius added a commit to serathius/etcd that referenced this issue Oct 24, 2022
serathius added a commit that referenced this issue Oct 25, 2022
tests: Add linearizability tests scenario for #14370
serathius added a commit to serathius/etcd that referenced this issue May 8, 2024
…equire first and last request to be persisted

This assumption is not true during durability issues like etcd-io#14370.
In reality we want to avoid situations where WAL is was truncated, for
that it's enough that we ensure that first and last operations are
present.

Found it when running `make test-robustness-issue14370` and instead of
getting `Model is not linearizable` I got that assumptions were broken.

Signed-off-by: Marek Siarkowicz <[email protected]>
serathius added a commit to serathius/etcd that referenced this issue May 8, 2024
serathius added a commit to serathius/etcd that referenced this issue May 8, 2024
serathius added a commit to serathius/etcd that referenced this issue May 8, 2024
serathius added a commit to serathius/etcd that referenced this issue May 8, 2024
serathius added a commit that referenced this issue May 9, 2024
Update the robustness README and fix the #14370 reproduction case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment