-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: gap in meta2 addressing records on beta cluster #9265
Comments
The error message indicates that this is a // It rarely may be possible that we somehow got grouped in with the
// wrong RangeLookup (eg. from a double split), so if we did, return
// a retryable lookupMismatchError with an unmodified eviction token.
if res.desc != nil {
if (!useReverseScan && !res.desc.ContainsKey(key)) || (useReverseScan && !res.desc.ContainsExclusiveEndKey(key)) {
return nil, evictToken, lookupMismatchError{
desiredKey: key,
mismatchedDesc: res.desc,
}
}
}
return res.desc, res.evictToken, res.err |
We stopped node 9 and looked at the meta2 entries (range1 hasn't split, so all the addressing is on a single range):
The (newest) meta2 entry which refers to range 9194 seems identical to the one we grabbed from the raft status page. Since the request's key was smaller than the key range covered by that descriptor, we looked at the meta2 entry preceding that of 9194's descriptor (again, its newest version).
Notice that there's a gap in the meta2 keyspace, which is an anomalous condition: 6111's EndKey isn't identical to 9194's StartKey (it's smaller). The range before 6111 (range_id:2884) exhibits no such gap:
|
We found the missing range. https://gist.github.com/tschottdorf/fe956827f7c97e55f9b76e382492fd4b We've stopped beta to avoid further damage (the GC queue won't check that the correct descriptor is returned from RangeLookup and will delete based only on StoreIDs). It shouldn't be restarted for that same reason. Unfortunately it seems that the damage is already (mostly) done. It seems that there's only one Replica remaining (node5) but this is legitimately to be GC'ed - the descriptor reads nodes 3, 2, and 6. The last lease this replica saw was on |
Action items so far:
|
Does |
We didn't look -- I forgot about rangelog (we checked eventlog instead). On Sep 10, 2016 20:26, "Peter Mattis" [email protected] wrote:
|
For those following along, the fact that the meta2 range has not split rules out #2266 as the culprit. I've pulled all the replicas of meta2, and range 1495 is missing from all of them (so it's not a case of replica divergence as in #5291 or #7224. If it were, the consistency checker should have caught that too). So either the meta2 records for range 1495 were never properly created, or they were created and then somehow destroyed. The range was rebalanced multiple times (each of which would have rewritten the addressing records), which suggests that it was working for a while. How, then, were the addressing records deleted? It's not through normal replica GC, which leaves the addressing records alone. The only time we intentionally delete from meta2 is in AdminMerge, and that's not supposed to happen anywhere (although it is technically exposed through the API...) |
On Sun, Sep 11, 2016 at 1:15 AM, Ben Darnell [email protected]
How can you tell? The replica IDs for that range are in the 20s, but all |
I didn't say anything about whether the rebalances happened before or after the destruction of the meta records, only that the fact that the rebalances occurred at all means that range 1495 was created and subsequently destroyed, instead of never being created in the first place. |
OK. Just to clarify, my working hypothesis is that 1495 once covered what is now [1495, 9194], and then it split, and the existing meta records were overwritten with 9194's descriptor, and the new meta records for the now-shrunken 1495 were never written. Is that what you had in mind? I'm confused by your use of "destroyed". |
Ah, I see what you're saying. The correct location for 1495's address records changed, so they could have been written (correctly), rebalanced (correctly), and then overwritten (correctly), and the incorrect part is that they were never written to their new home. I was using "destroyed" as in "deleted", which isn't supposed to happen except in the case of a merge. A split that fails to update the addressing records for the LHS makes more sense. This does resemble #5291 even though the end result is different, since our theory there involved intent resolution being weird during a split. |
Any idea when the cluster entered this problematic state? From what I can discern from |
The range log should tell. We weren't able to figure it out yesterday from any of the histories (because GC) or the block writer logs (because rotation) or grafana (because erratic). |
I'm collecting some of the interesting pieces of data (via |
Anything else before I proceed (which would be tonight as I'm about to head out). |
Uploading to gs://cockroach-backup/meta2-bug-9265/. |
OK, I think I've figured out what happened (or at least I've found a bug that could have caused this; I hope we don't have more than one of these). This comes from staring at the code; I haven't verified this against any of the data salvaged from gamma (and I'm not sure if there's any evidence that would remain). TL;DR: a transaction which A) issues multiple writes in its first batch B) with out-of-order keys may race with both the Tracing the bug:
The fix, I think, is simple: when More related action items:
|
Wow, nice sleuthing. I think all of the related action items are good, except I'm wary about having DistSender guarantee execution of |
Great find, Ben. |
Rather than the last action item, perhaps the semantics of |
RangeLookup currently takes a point, not a span. Changing it to take a span makes some sense, although I'm not sure what we'd get if we start pulling on that thread. |
Great job @bdarnell, what a perfect storm. I can sleep again at night. |
^- Going to grab this one |
@bdarnell right. Rephrasing my earlier comment: Rather than the last action item, perhaps the semantics of |
^- including this one. |
@tamird Yeah, sanity-checking like that sounds good. |
In case anyone else is wondering, @spencerkimball and I just realized something else was exacerbating this. The Txn should only be GCed after a 1hour inactivity because of this code:
We looked at the transaction entries which are created in the Push, and lo and behold, they don't set OrigTimestamp or LastHeartbeat, only Timestamp. @spencerkimball will be fixing that part. |
This bug is really the perfect storm. |
Fix the main issue in cockroachdb#9265. When a transaction is aborted before having written its initial transaction record, intents written by the transaction before creating its record may be aborted. If the GC queue deletes that ABORTED entry, there must be a mechanism which prevents the original transaction to create what it considers the "initial" transaction record, or that transaction could commit successfully but lose some its writes. Such a mechanism is added here: The GC queue keeps time of the threshold used for grooming the transaction span and communicates this into the replicated Range state, where BeginTransaction can check it and return an appropriate error. Fixes cockroachdb#9265.
Fix the main issue in cockroachdb#9265. When a transaction is aborted before having written its initial transaction record, intents written by the transaction before creating its record may be aborted. If the GC queue deletes that ABORTED entry, there must be a mechanism which prevents the original transaction to create what it considers the "initial" transaction record, or that transaction could commit successfully but lose some its writes. Such a mechanism is added here: The GC queue keeps time of the threshold used for grooming the transaction span and communicates this into the replicated Range state, where BeginTransaction can check it and return an appropriate error. Fixes cockroachdb#9265.
Fix the main issue in cockroachdb#9265. When a transaction is aborted before having written its initial transaction record, intents written by the transaction before creating its record may be aborted. If the GC queue deletes that ABORTED entry, there must be a mechanism which prevents the original transaction to create what it considers the "initial" transaction record, or that transaction could commit successfully but lose some its writes. Such a mechanism is added here: The GC queue keeps time of the threshold used for grooming the transaction span and communicates this into the replicated Range state, where BeginTransaction can check it and return an appropriate error. Fixes cockroachdb#9265.
Fix the main issue in cockroachdb#9265. When a transaction is aborted before having written its initial transaction record, intents written by the transaction before creating its record may be aborted. If the GC queue deletes that ABORTED entry, there must be a mechanism which prevents the original transaction to create what it considers the "initial" transaction record, or that transaction could commit successfully but lose some its writes. Such a mechanism is added here: The GC queue keeps time of the threshold used for grooming the transaction span and communicates this into the replicated Range state, where BeginTransaction can check it and return an appropriate error. Fixes cockroachdb#9265.
I marked three items in the TODO list above as completed: #9359 makes sure RangeLookup's callers always get what they want; #9377 makes sure that the BeginTransaction fails and correctly marks the push-created Transaction record as active (before it had a zero activity timestamp so the GC queue would nuke it immediately); #9374 makes sure splits and replica changes anchor their transaction entry first (which should lead to less aborts). @spencerkimball, I think I recall you thinking about this remaining one below - is that accurate? Perhaps file yourself an issue so that we can wrap this one up:
We've discarded the idea of having DistSender order BeginTransaction to the front, so I checked that box as well. |
Filed #9399 which I believe allows me to close this. |
Throughput is zero on all all nodes. We get the following log messages on all nodes:
This log was added to catch potential endless retry loops in DistSender, and it appears to have found one. A quick search shows that range 9194 is the only one mentioned in this type of message across the cluster:
The first insight is that the error is correct. From
curl -Ok https://beta.gce.cockroachdb.com:8080/_status/raft
we get information about range 9194 (which seems healthy). Combined with the above log message we seeThe request's key doesn't appear to be contained in the descriptor of 9194 (and the descriptor of 9194 which DistSender uses is the same).
Candidates:
The text was updated successfully, but these errors were encountered: