Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: gap in meta2 addressing records on beta cluster #9265

Closed
tbg opened this issue Sep 10, 2016 · 30 comments
Closed

stability: gap in meta2 addressing records on beta cluster #9265

tbg opened this issue Sep 10, 2016 · 30 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@tbg
Copy link
Member

tbg commented Sep 10, 2016

Throughput is zero on all all nodes. We get the following log messages on all nodes:

W160910 21:27:18.984763 168 kv/dist_sender.go:695  [n1] 640 retries for an RPC at sequence 2, last error was: key "/Table/51/1/8735125081319803271/\"ca595ccf-c7a8-4d96-84c4-11d63b553261\"/183430/0" not contained in range lookup's resulting decriptor range_id:9194 start_key:"\273\211\375y9\265\036T\207\322\333\022c4bc6ab8-bc1f-433a-9c5a-0f76301bda5c\000\001\370\024\307\211" end_key:"\273\211\375y=\262H-\233\363\212\0225d03b71f-7582-432c-88a6-1c695a3f4fa4\000\001\370h8\327" replicas:<node_id:10 store_id:10 replica_id:95 > replicas:<node_id:2 store_id:2 replica_id:96 > replicas:<node_id:3 store_id:3 replica_id:97 > next_replica_id:98 , remaining key ranges {"/Table/51/1/8735125081319803271/\"ca595ccf-c7a8-4d96-84c4-11d63b553261\"/183430/0" "/Table/51/1/8735125081319803271/\"ca595ccf-c7a8-4d96-84c4-11d63b553261\"/183430/4/1/NULL"}: BeginTransaction [/Table/51/1/8735125081319803271/"ca595ccf-c7a8-4d96-84c4-11d63b553261"/183430/0,/Min), ConditionalPut [/Table/51/1/8735125081319803271/"ca595ccf-c7a8-4d96-84c4-11d63b553261"/183430/0,/Min), ConditionalPut [/Table/51/1/8735125081319803271/"ca595ccf-c7a8-4d96-84c4-11d63b553261"/183430/4/1,/Min), EndTransaction [/Table/51/1/8735125081319803271/"ca595ccf-c7a8-4d96-84c4-11d63b553261"/183430/0,/Min)

This log was added to catch potential endless retry loops in DistSender, and it appears to have found one. A quick search shows that range 9194 is the only one mentioned in this type of message across the cluster:

pssh -ih hosts/beta 'grep -Eo '\''key .* not contained in range lookup.s resulting decriptor range_id..*next_replica_id'\'' logs/cockroach.stderr | awk '\''{print $10}'\' | sort | uniq -c
   1 [10] 17:45:55 [SUCCESS] [email protected]
   1 [1] 17:45:54 [SUCCESS] [email protected]
   1 [2] 17:45:55 [SUCCESS] [email protected]
   1 [3] 17:45:55 [SUCCESS] [email protected]
   1 [4] 17:45:55 [SUCCESS] [email protected]
   1 [5] 17:45:55 [SUCCESS] [email protected]
   1 [6] 17:45:55 [SUCCESS] [email protected]
   1 [7] 17:45:55 [SUCCESS] [email protected]
   1 [8] 17:45:55 [SUCCESS] [email protected]
   1 [9] 17:45:55 [SUCCESS] [email protected]
21939 range_id:9194

The first insight is that the error is correct. From curl -Ok https://beta.gce.cockroachdb.com:8080/_status/raft we get information about range 9194 (which seems healthy). Combined with the above log message we see

/Table/51/1/8735125081319803271/\"ca595ccf-c7a8-4d96-84c4-11d63b553261\"/183430/0 # BatchRequest's currently looked up key
/Table/51/1/8735212094133621467/\"c4bc6ab8-bc1f-433a-9c5a-0f76301bda5c\"/1361801 # 9194's StartKey
/Table/51/1/8736334875241214858/\"5d03b71f-7582-432c-88a6-1c695a3f4fa4\"/6830295 # 9194's EndKey

The request's key doesn't appear to be contained in the descriptor of 9194 (and the descriptor of 9194 which DistSender uses is the same).

Candidates:

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Sep 10, 2016
@tbg tbg assigned tbg and tamird Sep 10, 2016
@tbg
Copy link
Member Author

tbg commented Sep 10, 2016

The error message indicates that this is a lookupMismatchError. This is returned when the range cache looks up a descriptor but finds that the requested key isn't covered by the result. It should be straightforward to simulate this RangeLookup against the dataset and to find out what's going wrong. #2266 might well be responsible for this.

    // It rarely may be possible that we somehow got grouped in with the
    // wrong RangeLookup (eg. from a double split), so if we did, return
    // a retryable lookupMismatchError with an unmodified eviction token.
    if res.desc != nil {
        if (!useReverseScan && !res.desc.ContainsKey(key)) || (useReverseScan && !res.desc.ContainsExclusiveEndKey(key)) {
            return nil, evictToken, lookupMismatchError{
                desiredKey:     key,
                mismatchedDesc: res.desc,
            }
        }
    }
    return res.desc, res.evictToken, res.err

@tbg
Copy link
Member Author

tbg commented Sep 10, 2016

We stopped node 9 and looked at the meta2 entries (range1 hasn't split, so all the addressing is on a single range):

sudo ./cockroach debug keys --from 'raw:\x02' --to 'raw:\x04' --values /mnt/data/

The (newest) meta2 entry which refers to range 9194 seems identical to the one we grabbed from the raft status page. Since the request's key was smaller than the key range covered by that descriptor, we looked at the meta2 entry preceding that of 9194's descriptor (again, its newest version).

meta2 before 9194 newest version
1473537923.156133391,0 /Meta2/Table/51/1/8734115388983094829/"5d03b71f-7582-432c-88a6-1c695a3f4fa4"/1998779: [
/Table/51/1/8732974589394595481/"346a415f-4ad6-4d08-a881-4dfa61cb31f1"/629155, # Start
/Table/51/1/8734115388983094829/"5d03b71f-7582-432c-88a6-1c695a3f4fa4"/1998779) # End

meta2 for 9194 newest version
1473537903.547553355,0 /Meta2/Table/51/1/8736334875241214858/"5d03b71f-7582-432c-88a6-1c695a3f4fa4"/6830295: [
/Table/51/1/8735212094133621467/"c4bc6ab8-bc1f-433a-9c5a-0f76301bda5c"/1361801, # Start
/Table/51/1/8736334875241214858/"5d03b71f-7582-432c-88a6-1c695a3f4fa4"/6830295) # End

Notice that there's a gap in the meta2 keyspace, which is an anomalous condition: 6111's EndKey isn't identical to 9194's StartKey (it's smaller).

The range before 6111 (range_id:2884) exhibits no such gap:

# 2884
1473537306.684272016,0 /Meta2/Table/51/1/8732974589394595481/"346a415f-4ad6-4d08-a881-4dfa61cb31f1"/629155: [
/Table/51/1/8731822795255624172/"8931c524-d235-4865-93c6-c9ec27b78b3d"/1061334, # Start
/Table/51/1/8732974589394595481/"346a415f-4ad6-4d08-a881-4dfa61cb31f1"/629155) # End

# 6111
1473537923.156133391,0 /Meta2/Table/51/1/8734115388983094829/"5d03b71f-7582-432c-88a6-1c695a3f4fa4"/1998779: [
/Table/51/1/8732974589394595481/"346a415f-4ad6-4d08-a881-4dfa61cb31f1"/629155, # Start
/Table/51/1/8734115388983094829/"5d03b71f-7582-432c-88a6-1c695a3f4fa4"/1998779) # End

@tbg
Copy link
Member Author

tbg commented Sep 10, 2016

We found the missing range. https://gist.github.com/tschottdorf/fe956827f7c97e55f9b76e382492fd4b

We've stopped beta to avoid further damage (the GC queue won't check that the correct descriptor is returned from RangeLookup and will delete based only on StoreIDs). It shouldn't be restarted for that same reason.

Unfortunately it seems that the damage is already (mostly) done. It seems that there's only one Replica remaining (node5) but this is legitimately to be GC'ed - the descriptor reads nodes 3, 2, and 6. The last lease this replica saw was on 2016-09-01 03:34:37.362648681 +0000 UTC.

@tbg
Copy link
Member Author

tbg commented Sep 11, 2016

Action items so far:

  • make debug keys output information about each version (esp. whether the top level value is a deletion, or clarify whether those are being output)
  • integrity checks for meta{1,2}
  • fix replicaGCQueue so that it doesn't go haywire in this situation in the future
  • consider always leaving tombstones for RangeDescriptors (we are completely at the end of the tunnel since older entries have been GCed and so we can't trace the history of the problematic range)
  • consider a general "read what you wrote" policy/mode (with the assumption that we mess up much more often, but don't find out unless it's on a vital part of the system).

@petermattis
Copy link
Collaborator

Does system.rangelog contain anything interesting? It is supposed to contain a history of all the range operations. If it isn't helpful here, when would it be helpful?

@tamird
Copy link
Contributor

tamird commented Sep 11, 2016

We didn't look -- I forgot about rangelog (we checked eventlog instead).
We're away from computers for the night, but I think you can bring the
cluster back up and poke at it. Consider backing up node 5 or at least its
replica of 1495 -- there may be interesting history in its range-local data.

On Sep 10, 2016 20:26, "Peter Mattis" [email protected] wrote:

Does system.rangelog contain anything interesting? It is supposed to
contain a history of all the range operations. If it isn't helpful here,
when would it be helpful?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#9265 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPAIkqBfjzDcoTjLCg2R2IFEPTTiAks5qo0q7gaJpZM4J52Dp
.

@bdarnell
Copy link
Contributor

For those following along, the fact that the meta2 range has not split rules out #2266 as the culprit.

I've pulled all the replicas of meta2, and range 1495 is missing from all of them (so it's not a case of replica divergence as in #5291 or #7224. If it were, the consistency checker should have caught that too).

So either the meta2 records for range 1495 were never properly created, or they were created and then somehow destroyed. The range was rebalanced multiple times (each of which would have rewritten the addressing records), which suggests that it was working for a while. How, then, were the addressing records deleted? It's not through normal replica GC, which leaves the addressing records alone. The only time we intentionally delete from meta2 is in AdminMerge, and that's not supposed to happen anywhere (although it is technically exposed through the API...)

@tamird
Copy link
Contributor

tamird commented Sep 11, 2016

On Sun, Sep 11, 2016 at 1:15 AM, Ben Darnell [email protected]
wrote:

The range was rebalanced multiple times

How can you tell? The replica IDs for that range are in the 20s, but all
those rebalances could have happened before the fateful split that went
awry, creating 9194 and "destroying" 1495 by not writing its new meta2
record.

@bdarnell
Copy link
Contributor

I didn't say anything about whether the rebalances happened before or after the destruction of the meta records, only that the fact that the rebalances occurred at all means that range 1495 was created and subsequently destroyed, instead of never being created in the first place.

@tamird
Copy link
Contributor

tamird commented Sep 11, 2016

OK. Just to clarify, my working hypothesis is that 1495 once covered what is now [1495, 9194], and then it split, and the existing meta records were overwritten with 9194's descriptor, and the new meta records for the now-shrunken 1495 were never written. Is that what you had in mind? I'm confused by your use of "destroyed".

@bdarnell
Copy link
Contributor

Ah, I see what you're saying. The correct location for 1495's address records changed, so they could have been written (correctly), rebalanced (correctly), and then overwritten (correctly), and the incorrect part is that they were never written to their new home. I was using "destroyed" as in "deleted", which isn't supposed to happen except in the case of a merge. A split that fails to update the addressing records for the LHS makes more sense.

This does resemble #5291 even though the end result is different, since our theory there involved intent resolution being weird during a split.

@petermattis
Copy link
Collaborator

Any idea when the cluster entered this problematic state? From what I can discern from block_writer.stdout on one of the nodes, block_writer has been seeing no throughput for days.

@tbg
Copy link
Member Author

tbg commented Sep 11, 2016

The range log should tell. We weren't able to figure it out yesterday from any of the histories (because GC) or the block writer logs (because rotation) or grafana (because erratic).

@tbg tbg changed the title stability: block_writers on beta stuck in DistSender talking to wrong Replica stability: gap in meta2 addressing records on beta cluster Sep 11, 2016
@tbg
Copy link
Member Author

tbg commented Sep 12, 2016

I'm collecting some of the interesting pieces of data (via cockroach keys): all of meta2 and all of the ranges mentioned by id (id-keyed data only), also all the range descriptors on all nodes. Then I'll start the cluster and try to extract the range log table.

@tbg
Copy link
Member Author

tbg commented Sep 13, 2016

Uploading to gs://cockroach-backup/meta2-bug-9265/.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 13, 2016

OK, I think I've figured out what happened (or at least I've found a bug that could have caused this; I hope we don't have more than one of these). This comes from staring at the code; I haven't verified this against any of the data salvaged from gamma (and I'm not sure if there's any evidence that would remain).

TL;DR: a transaction which A) issues multiple writes in its first batch B) with out-of-order keys may race with both the gcQueue and another operation, causing its first writes to be silently dropped. The transactions used by AdminSplit and ChangeReplicas both meet these criteria (as noted in #9287), but if the race occurs in ChangeReplicas the consequences are less dramatic (to my current understanding).

Tracing the bug:

  1. AdminSplit starts its transaction. It prepares a batch containing the following requests:
    • [BeginTransaction, CPut(left range descriptor), CPut(right range descriptor), Put(left meta2 record), Put(right meta2 record)]
  2. DistSender splits the batch by ranges, and prepares to execute them in lexicographic order (so the meta2 keys come first):
    • [Noop, Noop, Noop, Put(left meta2), Put(right meta2)]
    • [BeginTransaction, CPut(left range desc), CPut(right range desc), Noop, Noop]
  3. DistSender sends the first batch. Then we context switch and something else happens before the second batch is executed.
  4. Something reads from the meta2 ranges and encounters the transaction's write intents. (that something could even be the split transaction itself, as it's looking up where to send the second batch!) In this case it reads the left meta2 record but not the right one. The read can proceed (because RangeLookups are inconsistent), but we still kick off the processIntentsAsync process, which calls PushTxn in PUSH_TOUCH mode in case the intents are orphaned and need to be cleaned up.
  5. PUSH_TOUCH is only supposed to clean up transactions that are either expired, aborted, or committed. However, there is one case where it will forcibly abort a transaction: if the BeginTransaction command has not yet run and there is no transaction record on disk.
  6. Once PushTxn returns, processIntentsAsync will delete the intent on the left meta2 record.
  7. So far, this is basically fine. As described in storage: Range addressing updated too soon #9287, it's sub-optimal to abort transactions like this, but the BeginTransaction command will see the aborted transaction, return an error, and start over. Everything is consistent.
  8. But now the rare part of the race occurs: gcQueue processes the replica. gcQueue assumes that absent transaction records are treated as aborted, so it will GC the transaction record that was just written by PushTxn.
  9. That assumption is true except in one case: BeginTransaction. BeginTransaction simply recreates the transaction record as PENDING and continues.
  10. The rest of the split transaction executes normally and commits. When the committed transaction tries to clean up its intents, there is a silent failure on the left meta2 key.
  11. Now the left-hand range has been removed from meta2, so its replicas start destroying themselves in replicaGCQueue.

The fix, I think, is simple: when PushTxn aborts a non-existent transaction, it should write to the abort cache instead of creating the transaction record. This has the same effect, but a different GC policy, so the abort cache entry will still be there by the time BeginTransaction is executed.

More related action items:

  • Split up the first batch of these transactions so they don't hit this race at all storage: Range addressing updated too soon #9287
  • When MVCCResolveIntent is asked to commit (rather than abort) an intent and no intent is found in the MVCC metadata, it should verify that there is a value at the transaction's timestamp, and fail loudly if not.
  • Verify that if a BeginTransaction command is delayed so long that an abort cache entry could have been GC'd, the transaction fails. We check for a SQL-level deadline in EndTransaction, but I don't think we strictly enforce this for BeginTransaction
  • Alternately, DistSender could guarantee that it executes BeginTransaction before any other writes (similar to the way it guarantees that EndTransaction comes last). This would remove a number of surprising edge cases.
  • replicaGCQueue should have more sanity checks on the results of its range lookup. Here, replicas of the left range were destroying themselves because their RangeLookups told them that they were not members of the RHS. If replicaGCQueue noticed that the range IDs didn't match, we would at least still have the three replicas of the LHS range and would have an easier time recovering. If we had gone any longer without investigating, we might not have been able to find a single replica of range 1495.

@spencerkimball
Copy link
Member

Wow, nice sleuthing. I think all of the related action items are good, except I'm wary about having DistSender guarantee execution of BeginTransaction before any other writes, unless we think there are still holes in the suggested fixes. If we separate out BeginTransaction it will add a synchronous raft consensus, guaranteeing any distributed txn has three serial raft consensus round trips.

@a-robinson
Copy link
Contributor

Great find, Ben.

@tamird
Copy link
Contributor

tamird commented Sep 13, 2016

Rather than the last action item, perhaps the semantics of RangeLookup should be such that it sanity checks the result? In other words, the elements RangeLookupResponse.Ranges should be guaranteed to be continuous and to cover fully RangeLookupRequest.Span.

@bdarnell
Copy link
Contributor

RangeLookup currently takes a point, not a span. Changing it to take a span makes some sense, although I'm not sure what we'd get if we start pulling on that thread.

@tbg
Copy link
Member Author

tbg commented Sep 13, 2016

Great job @bdarnell, what a perfect storm. I can sleep again at night.

@tbg
Copy link
Member Author

tbg commented Sep 13, 2016

Verify that if a BeginTransaction command is delayed so long that an abort cache entry could have been GC'd, the transaction fails. We check for a SQL-level deadline in EndTransaction, but I don't think we strictly enforce this for BeginTransaction

^- Going to grab this one

@tamird
Copy link
Contributor

tamird commented Sep 13, 2016

@bdarnell right. Rephrasing my earlier comment:

Rather than the last action item, perhaps the semantics of RangeLookup should be such that it sanity checks the result? In other words, the elements RangeLookupResponse.Ranges should be guaranteed to be continuous and the first of them guaranteed to include RangeLookupRequest.Span.Key.

@tbg
Copy link
Member Author

tbg commented Sep 13, 2016

The fix, I think, is simple: when PushTxn aborts a non-existent transaction, it should write to the abort cache instead of creating the transaction record. This has the same effect, but a different GC policy, so the abort cache entry will still be there by the time BeginTransaction is executed.

^- including this one.

@bdarnell
Copy link
Contributor

@tamird Yeah, sanity-checking like that sounds good.

@tbg
Copy link
Member Author

tbg commented Sep 14, 2016

In case anyone else is wondering, @spencerkimball and I just realized something else was exacerbating this. The Txn should only be GCed after a 1hour inactivity because of this code:

// LastActive returns the last timestamp at which client activity definitely
// occurred, i.e. the maximum of OrigTimestamp and LastHeartbeat.
func (t Transaction) LastActive() hlc.Timestamp {
    candidate := t.OrigTimestamp
    if t.LastHeartbeat != nil && candidate.Less(*t.LastHeartbeat) {
        candidate = *t.LastHeartbeat
    }
    return candidate
}

We looked at the transaction entries which are created in the Push, and lo and behold, they don't set OrigTimestamp or LastHeartbeat, only Timestamp. @spencerkimball will be fixing that part.

@spencerkimball
Copy link
Member

This bug is really the perfect storm.

tbg added a commit to tbg/cockroach that referenced this issue Sep 14, 2016
Fix the main issue in cockroachdb#9265. When a transaction is aborted before
having written its initial transaction record, intents written by the
transaction before creating its record may be aborted.

If the GC queue deletes that ABORTED entry, there must be a mechanism
which prevents the original transaction to create what it considers the
"initial" transaction record, or that transaction could commit successfully
but lose some its writes.

Such a mechanism is added here: The GC queue keeps time of the threshold
used for grooming the transaction span and communicates this into the
replicated Range state, where BeginTransaction can check it and return
an appropriate error.

Fixes cockroachdb#9265.
tbg added a commit to tbg/cockroach that referenced this issue Sep 15, 2016
Fix the main issue in cockroachdb#9265. When a transaction is aborted before
having written its initial transaction record, intents written by the
transaction before creating its record may be aborted.

If the GC queue deletes that ABORTED entry, there must be a mechanism
which prevents the original transaction to create what it considers the
"initial" transaction record, or that transaction could commit successfully
but lose some its writes.

Such a mechanism is added here: The GC queue keeps time of the threshold
used for grooming the transaction span and communicates this into the
replicated Range state, where BeginTransaction can check it and return
an appropriate error.

Fixes cockroachdb#9265.
tbg added a commit to tbg/cockroach that referenced this issue Sep 15, 2016
Fix the main issue in cockroachdb#9265. When a transaction is aborted before
having written its initial transaction record, intents written by the
transaction before creating its record may be aborted.

If the GC queue deletes that ABORTED entry, there must be a mechanism
which prevents the original transaction to create what it considers the
"initial" transaction record, or that transaction could commit successfully
but lose some its writes.

Such a mechanism is added here: The GC queue keeps time of the threshold
used for grooming the transaction span and communicates this into the
replicated Range state, where BeginTransaction can check it and return
an appropriate error.

Fixes cockroachdb#9265.
tbg added a commit to tbg/cockroach that referenced this issue Sep 15, 2016
Fix the main issue in cockroachdb#9265. When a transaction is aborted before
having written its initial transaction record, intents written by the
transaction before creating its record may be aborted.

If the GC queue deletes that ABORTED entry, there must be a mechanism
which prevents the original transaction to create what it considers the
"initial" transaction record, or that transaction could commit successfully
but lose some its writes.

Such a mechanism is added here: The GC queue keeps time of the threshold
used for grooming the transaction span and communicates this into the
replicated Range state, where BeginTransaction can check it and return
an appropriate error.

Fixes cockroachdb#9265.
@tbg
Copy link
Member Author

tbg commented Sep 15, 2016

I marked three items in the TODO list above as completed:

#9359 makes sure RangeLookup's callers always get what they want; #9377 makes sure that the BeginTransaction fails and correctly marks the push-created Transaction record as active (before it had a zero activity timestamp so the GC queue would nuke it immediately); #9374 makes sure splits and replica changes anchor their transaction entry first (which should lead to less aborts).

@spencerkimball, I think I recall you thinking about this remaining one below - is that accurate? Perhaps file yourself an issue so that we can wrap this one up:

When MVCCResolveIntent is asked to commit (rather than abort) an intent and no intent is found in the MVCC metadata, it should verify that there is a value at the transaction's timestamp, and fail loudly if not.

We've discarded the idea of having DistSender order BeginTransaction to the front, so I checked that box as well.

@tbg
Copy link
Member Author

tbg commented Sep 15, 2016

Filed #9399 which I believe allows me to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants