Possible data loss on corruption of coordinating node replica #679

engelsanchez · 2013-10-02T14:37:06Z

Russell brought this up today, and perhaps he can elaborate on this:

A replica becomes unavailable in a put coordinating node due to corruption or user error.
The incoming object may or may not contain a vector clock.
We get a not_found in the local put. We have lost the information corresponding to this vnode id.
The vector clock created could have a stale number for this vnode actor id. Either 1 if no vclock or < than the real one if more writes have happened since.
The vector clock is subsumed by the other replicas when sent out, so this write will lose and will eventually be read repaired completely out of existence.

/cc @jtuple @jrwest @evanmcc @Vagabond @russelldb

russelldb · 2013-10-07T07:55:07Z

Yeah. That captures it @engelsanchez. I think this is a problem we introduced recently by treating errors on local read as notfound. And I think the answer is as simple as creating a new vnodeid as soon as local corruption is detected.

If for whatever reason a coordinating PUT to a vnode fails to read an existing K/V out of the backend (we have a catch for it here:- https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1258) then we have this problem.

@engelsanchez pointed out we have the same problem if a user removes the backend data and not the vnodeid, or removes the backend data while the vnode is running. At this point I'd say all bets are off and there is nothing we can do.

In the case of corruption where we do get an error attempting to read the data, we can do something (get the data from another vnode, generate a new vnodeid.) Ignoring the error and returning a notfound is not correct though.

Imagine putting to Key K with an empty vclock, the item has been written by vnode A before. Vnode A coordinates the write for this PUT. Vnode A's backend is corrupt due to cosmic rays. Vnode A attempts to read key K to get a local vector clock. Vnode A gets an error but we silently parlay that to a notfound. Vnode A creates a vclock entry of {A, 1} for the PUT and stores the item (even though the item on disk (that we can't read) has entry {A, 3}).

K's data is returned as a frontier object (not!) to the FSM which sends it downstream to Vnode B and C for local merge. Vnode B and C see that their local vclocks for K dominate the one they just received and drop the write and ack the FSM. The FSMs ack the user.

The user tries to read K. A's replica is dominated by B's and C's, read repair kicks in and A's data is replaced. We lost the data.

If on a read error on a co-ordinating PUT we increment the vnodeId at A, the write would be a sibling of the data at B and C and survive (even if it is a false sibling, it is better than a dropped write.)

slfritchie · 2013-10-09T07:26:54Z

@russelldb I don't believe that changing/incrementing the vnodeid is feasible. Even if the backend could tell us 100% of the time that there were corruption, we run the risk of telling us that it noticed corruption 1000 time/second (to pull an arbitrary but not totally silly figure out of the air). There is a different kind of actor explosion happening then, wouldn't it?

Is this totally crazy?

--- src/vclock.erl  2013-09-27 18:38:57.000000000 +0900
+++ /tmp/vclock.erl 2013-10-09 16:23:11.000000000 +0900
@@ -123,12 +123,15 @@
 increment(Node, IncTs, VClock) ->
     {{_Ctr, _TS}=C1,NewV} = case lists:keytake(Node, 1, VClock) of
                                 false ->
-                                    {{1, IncTs}, VClock};
+                                    {{big_starting_point(), IncTs}, VClock};
                                 {value, {_N, {C, _T}}, ModV} ->
                                     {{C + 1, IncTs}, ModV}
                             end,
     [{Node,C1}|NewV].

+big_starting_point() ->
+    {X, Y, Z} = os:timestamp(),
+    (((X * 1000000) + Y) * 1000000) + Z. % Microseconds, lots & lots of 'em

 % @doc Return the list of all nodes that have ever incremented VClock.
 -spec all_nodes(VClock :: vclock()) -> [vclock_node()].

cc: @jtuple @jonmeredith Thoughts?

russelldb · 2013-10-09T16:31:46Z

Just to summarize a long discussion in HipChat:

Per vnode counter. Increment it on every co-ordinated write.
Paranoidly persist it to disk every $THRESHOLD writes (with the vnodeid)
Read it on start up and add $THRESHOLD to it (what if $THRESHOLD changes between start and stop (from conf, say?))
Remove it (like we do with vnodeid) when handoff completes.
(Suffice to say the counter is just like the vnodeid is now as far as persistence goes, except it is persisted more often)
When a local get on a coordinating put returns notfound use the vnodeid counter+1 as the initial counter value in the vector clock (thus ensuring a frontier counter for the coordinating actor)

(Is that it @jtuple, @slfritchie, @jonmeredith ?)

slfritchie · 2013-10-10T02:25:56Z

The counter's starting value should be really big, e.g. billions? Otherwise, you're in the same bad place as using the constant 1 like today?

russelldb · 2013-10-10T06:31:00Z

@slfritchie I don't understand why? If there is no counter on disk to read, but there is a vnodeid, then create new vnodeid, otherwise the counter value on disk + threshold means we get a frontier count. Why the billions?

russelldb · 2013-11-05T15:56:23Z

@slfritchie I am starting work on this now, wondering if you could articulate your objection to starting at 1, or are you cool with it?

russelldb · 2013-11-15T06:36:25Z

The plan above is no good. See riak_kv#726 and https://github.com/basho/riak_test/compare/bug;rdb;gh679 for details.

I think there are a number of possible sources of this behaviour and we should probably partition them a little better.

One question? Why store the vnodeid on disk separate from the backend? If the vnodeid was in the backend then one source of the issue (backend data deleted, vnodeid data is not) would be resolved. I imagine this leads to issues for the memory backend, though.

We could also revert the change that treats an error on local read as a notfound, the user / client / system can then retry and hit another coordinator.

Then we are left with the case where there is undetectable corruption (only a/some key/value is affected but the vnodeid is still present in the backend) I have no idea how to solve this case at the moment.

I also need to do a little analysis to figure out if there is different behaviour when the PUT has a vclock and when it does not.

If we stick with the solution in riak_kv#726 the problem is manifested as a concurrent write causally dominating a write on disk. If we stick with what we have a write concurrent with data at a replica is accepted and subsequently dropped as it is dominated by what is on disk. Which is worse? Is there another way? I'm wondering if I'm just over thinking a tiny edge case?

cc @jonmeredith @engelsanchez @slfritchie @jtuple

russelldb · 2013-11-15T16:12:05Z

OK, I had a think and hacked together a branch that partially solves the problem.

Since we never compare vclocks across, creating a new vnodeId (based on the vnode Id ++ some epoch) as the actor whenever there is a notfound on a coordinating put solves the issue.

It raises issues too…but…discuss.

russelldb · 2013-11-16T20:28:08Z

This[1] works* for the case where the local get is a notfound and no vclock is provided on the put

[1] https://github.com/basho/riak_kv/compare/bug;rdb;gh679-crazy-ivan

Passes the test

slfritchie · 2013-11-25T06:21:28Z

Hi, Russell. I've a couple of worries, ignoring the fact that I still like the big_starting_point() scheme better. :-)

The riak_kv_vnode:write_vnode_status/2 function isn't sufficiently robust. It's outside of the scope of this PR so far, but I recommend that it be fixed. IMO file:write_file/2 isn't sufficient because it doesn't trigger an fsync(2) system call, so it's possible for a system crash to replace the old file with a truncated new file (probably all the way to zero bytes).
The current implementation is also relying quite a bit on file operations that get serialized through the Erlang file_server_2 process. The latency added by that serialization may be something to worry about.

One way around the latency hit would be to do the vnode status update asynchronously:

Spawn a worker pid to do the update when the counter is approaches Threshold minus AsyncFudgeCount (e.g. 10).
When the worker pid is done, send a yo_i_am_done_updating_counter message to the vnode.
When the counter reaches Threshold, then block on a receive of the yo_... message.
- If the receive returns immediately, great, we're done.
- If we block for a while, bummer, but at least we've had AsyncFudgeCount increments that we were able to do before blocking.

gburd · 2014-01-27T15:29:32Z

Sounds like we're overload the meaning of notfound and should have a {error, repair_this_key_please} return value from backends.

russelldb · 2014-01-27T15:33:47Z

Sure, that deals with one case, but there are others, when you remove the data at the backend for instance, this issue is still an issue.

On 27 Jan 2014, at 15:29, Gregory Burd [email protected] wrote:

Sounds like we're overload the meaning of notfound and should have a {error, repair_this_key_please} return value from backends.

—
Reply to this email directly or view it on GitHub.

engelsanchez · 2014-02-18T21:44:01Z

Adding wood to the fire:

A scenario related to this discussion painfully surfaced recently and was discussed by the Core/KV cabal. It is possible for hinted handoff to finish much, much later than necessary. See #847. That combined with deletes with delete_mode not set to keep, new writes to the delete key followed by an old fallback node finally completlng hinted handoff due to a node restart can lead to old data coming back to life and silently replacing the new data on read repair. Not writing to deleted keys or keeping their tombstones forever would prevent that problem. Alas, that is not the default mode and customers don't like the idea of an ever increasing number of tombstones.

russelldb · 2014-08-22T21:40:29Z

As @jrwest rightly points out, delete mode never (i.e. don't reap tombstones) is a work around for the doomstone data loss flavour of this bug. Given that is the most likely cause, and the least "byzantine" maybe that is enough for 2.1, with more comprehensive fixes for later versions.

Riak Tests for scenarios of basho/riak_kv#679

russelldb · 2019-08-23T12:38:34Z

Erm, I think we could have closed this by now, pretty sure it's dealt with

martinsumner · 2019-08-23T12:50:47Z

Agreed

russelldb mentioned this issue Oct 7, 2013

Proof of concept of using a dot per sibling in metadata #688

Closed

engelsanchez mentioned this issue Feb 18, 2014

Ensure hinted handoff completes as soon as possible #847

Open

jtuple added this to the 2.1 milestone Mar 24, 2014

cmeiklejohn mentioned this issue May 7, 2014

Replicated log: Generate unique opid after process restart AntidoteDB/antidote#9

Closed

seancribbs removed Kanban-Backlog labels Aug 21, 2014

jonmeredith mentioned this issue Nov 5, 2014

Evaluate possibility to copy tombstoned objects during fullsync replication basho/riak_ee-issues#21

Closed

russelldb mentioned this issue Jan 8, 2015

Rebase per-key actor epoch (kv679) on 2.0 [JIRA: RIAK-1331] #1070

Closed

4 tasks

russelldb added a commit to basho/riak_test that referenced this issue Jan 16, 2015

Merge pull request #719 from basho/rdb/gh-kv679

c1b12f5

Riak Tests for scenarios of basho/riak_kv#679

seemaj mentioned this issue Apr 15, 2015

Document kv679 changes [JIRA: PMT-178] basho/basho_docs#1650

Open

martinsumner closed this as completed Aug 23, 2019

martinsumner mentioned this issue Oct 15, 2019

Mass deletion, Automated deletion, Reaping tombstones #1725

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible data loss on corruption of coordinating node replica #679

Possible data loss on corruption of coordinating node replica #679

engelsanchez commented Oct 2, 2013

russelldb commented Oct 7, 2013

slfritchie commented Oct 9, 2013

russelldb commented Oct 9, 2013

slfritchie commented Oct 10, 2013

russelldb commented Oct 10, 2013

russelldb commented Nov 5, 2013

russelldb commented Nov 15, 2013

russelldb commented Nov 15, 2013

russelldb commented Nov 16, 2013

slfritchie commented Nov 25, 2013

gburd commented Jan 27, 2014

russelldb commented Jan 27, 2014

engelsanchez commented Feb 18, 2014

russelldb commented Aug 22, 2014

russelldb commented Aug 23, 2019

martinsumner commented Aug 23, 2019

Possible data loss on corruption of coordinating node replica #679

Possible data loss on corruption of coordinating node replica #679

Comments

engelsanchez commented Oct 2, 2013

russelldb commented Oct 7, 2013

slfritchie commented Oct 9, 2013

russelldb commented Oct 9, 2013

slfritchie commented Oct 10, 2013

russelldb commented Oct 10, 2013

russelldb commented Nov 5, 2013

russelldb commented Nov 15, 2013

russelldb commented Nov 15, 2013

russelldb commented Nov 16, 2013

slfritchie commented Nov 25, 2013

gburd commented Jan 27, 2014

russelldb commented Jan 27, 2014

engelsanchez commented Feb 18, 2014

russelldb commented Aug 22, 2014

russelldb commented Aug 23, 2019

martinsumner commented Aug 23, 2019