-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible data loss on corruption of coordinating node replica #679
Comments
Yeah. That captures it @engelsanchez. I think this is a problem we introduced recently by treating errors on local read as If for whatever reason a coordinating PUT to a vnode fails to read an existing K/V out of the backend (we have a catch for it here:- https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1258) then we have this problem. @engelsanchez pointed out we have the same problem if a user removes the backend data and not the vnodeid, or removes the backend data while the vnode is running. At this point I'd say all bets are off and there is nothing we can do. In the case of corruption where we do get an error attempting to read the data, we can do something (get the data from another vnode, generate a new vnodeid.) Ignoring the error and returning a Imagine putting to Key K with an empty vclock, the item has been written by vnode A before. Vnode A coordinates the write for this PUT. Vnode A's backend is corrupt due to cosmic rays. Vnode A attempts to read key K to get a local vector clock. Vnode A gets an error but we silently parlay that to a K's data is returned as a frontier object (not!) to the FSM which sends it downstream to Vnode B and C for local merge. Vnode B and C see that their local vclocks for K dominate the one they just received and drop the write and ack the FSM. The FSMs ack the user. The user tries to read K. A's replica is dominated by B's and C's, read repair kicks in and A's data is replaced. We lost the data. If on a read error on a co-ordinating PUT we increment the vnodeId at A, the write would be a sibling of the data at B and C and survive (even if it is a false sibling, it is better than a dropped write.) |
@russelldb I don't believe that changing/incrementing the vnodeid is feasible. Even if the backend could tell us 100% of the time that there were corruption, we run the risk of telling us that it noticed corruption 1000 time/second (to pull an arbitrary but not totally silly figure out of the air). There is a different kind of actor explosion happening then, wouldn't it? Is this totally crazy?
cc: @jtuple @jonmeredith Thoughts? |
Just to summarize a long discussion in HipChat: Per vnode counter. Increment it on every co-ordinated write. (Is that it @jtuple, @slfritchie, @jonmeredith ?) |
The counter's starting value should be really big, e.g. billions? Otherwise, you're in the same bad place as using the constant |
@slfritchie I don't understand why? If there is no counter on disk to read, but there is a vnodeid, then create new vnodeid, otherwise the counter value on disk + threshold means we get a frontier count. Why the billions? |
@slfritchie I am starting work on this now, wondering if you could articulate your objection to starting at |
The plan above is no good. See riak_kv#726 and https://github.com/basho/riak_test/compare/bug;rdb;gh679 for details. I think there are a number of possible sources of this behaviour and we should probably partition them a little better. One question? Why store the vnodeid on disk separate from the backend? If the vnodeid was in the backend then one source of the issue (backend data deleted, vnodeid data is not) would be resolved. I imagine this leads to issues for the memory backend, though. We could also revert the change that treats an error on local read as a Then we are left with the case where there is undetectable corruption (only a/some key/value is affected but the vnodeid is still present in the backend) I have no idea how to solve this case at the moment. I also need to do a little analysis to figure out if there is different behaviour when the PUT has a vclock and when it does not. If we stick with the solution in riak_kv#726 the problem is manifested as a concurrent write causally dominating a write on disk. If we stick with what we have a write concurrent with data at a replica is accepted and subsequently dropped as it is dominated by what is on disk. Which is worse? Is there another way? I'm wondering if I'm just over thinking a tiny edge case? |
OK, I had a think and hacked together a branch that partially solves the problem. Since we never compare vclocks across, creating a new vnodeId (based on the vnode Id ++ some epoch) as the actor whenever there is a It raises issues too…but…discuss. |
This[1] works* for the case where the local [1] https://github.com/basho/riak_kv/compare/bug;rdb;gh679-crazy-ivan
|
Hi, Russell. I've a couple of worries, ignoring the fact that I still like the
One way around the latency hit would be to do the vnode status update asynchronously:
|
Sounds like we're overload the meaning of |
Sure, that deals with one case, but there are others, when you remove the data at the backend for instance, this issue is still an issue. On 27 Jan 2014, at 15:29, Gregory Burd [email protected] wrote:
|
Adding wood to the fire: A scenario related to this discussion painfully surfaced recently and was discussed by the Core/KV cabal. It is possible for hinted handoff to finish much, much later than necessary. See #847. That combined with deletes with delete_mode not set to |
As @jrwest rightly points out, delete mode never (i.e. don't reap tombstones) is a work around for the doomstone data loss flavour of this bug. Given that is the most likely cause, and the least "byzantine" maybe that is enough for 2.1, with more comprehensive fixes for later versions. |
Riak Tests for scenarios of basho/riak_kv#679
Erm, I think we could have closed this by now, pretty sure it's dealt with |
Agreed |
Russell brought this up today, and perhaps he can elaborate on this:
/cc @jtuple @jrwest @evanmcc @Vagabond @russelldb
The text was updated successfully, but these errors were encountered: