Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: propagate errors from contentionQueue, catch stalls in roachtest #37199

Merged
merged 7 commits into from
Jul 8, 2019

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Apr 30, 2019

Informs #36089.

The PR is split into a series of commits. The first fixes part of a bug that was causing #36089 to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089.

Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the contentionQueue.

The second commit bumps the TxnLivenessThreshold for clusters running kv/contention/nodes=4 to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement.

The fourth commit instructs pushers in the txnwait.Queue to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations.

@nvanbenschoten nvanbenschoten requested review from ajwerner and a team April 30, 2019 02:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

NOTE: I saw this get stuck once after bumping the TxnLivenessThreshold in the roachtest. That indicates that there is another stall lingering. I'm not planning on merging this PR until I gain a better understanding of what's going on there.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! I know you're not planning on merging this yet but :lgtm:

Reviewed 3 of 3 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@nvanbenschoten
Copy link
Member Author

I think I tracked down the other stall here. It has to do with a rare interleaving of requests in the contentionQueue. I found this all by looking at the /debug/requests page after bumping the page's maxActiveTraces to a more reasonable 300. The expanded view of the active traces revealed the following deadlock:

txn 8703f023
    pushing ee447be3
    behind f081497c in the queue
    stuck after noticing pushee aborted
        waiting for f081497c to notice
    aborted itself

txn ee447be3
    pushing 8703f023
    behind f081497c in the queue
    stuck after noticing pushee aborted
        waiting for f081497c to notice
    aborted itself

txn f081497c
    pushing both 8703f023 and ee447be3
    front of queue in both
    stuck pushing both txns

all other txns: stuck on f081497c

The first two transactions are stuck in the contentionQueue after observing each other as aborted. The don't realize that they themselves are aborted because transactions in the contentionQueue don't query their own disposition. They also don't immediately exit the queue when they determine that the front of the queue is aborted. Instead, they wait on their predecessor in the queue to notice to avoid a thundering herd.

So what's going wrong? Why isn't this resolving itself?

If 8703f023 and ee447be3 are aborted, why isn't f081497c noticing?

It is stuck in the TxnWaitQueue, waiting for updates to either of these transactions. Specifically, it is waiting for one of two things:

  1. updates to its pushees' transaction records
  2. its pushees' expirations, which are set to 5m!
Why aren't there any updates to 8703f023 and ee447be3's transaction records

They are stuck in the contentionQueue and haven't noticed that they are aborted. This hasn't been propagated back to their transaction coordinator so no one has rolled back their transaction records.

Why isn't deadlock detection working?

Because both 8703f023 and ee447be3 are already aborted, so the two transactions aren't sitting in each other's TxnWaitQueues.

Why do 8703f023 and ee447be3 think each other are aborted but f081497c is still in the TxnWaitQueue?

There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation! Each of these transactions would consider the other aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is still sitting in the TxnWaitQueue, and it would be stuck there until the pushee expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest ever decides that a transaction is aborted.

The second way to get into this deadlock scenario is more tricky, but it shows up more in this workload. A contention point when designing #33523 was deciding how pushes of transactions originating from intents with txnMeta.timestamp >> txn.OrigTimestamp. The problem is that when a transaction is aborted, the write timestamp cache is only bumped to its current timestamp. However, that transaction can keep writing if it isn't immediately informed of the abort, potentially at higher timestamps. So if a contending transaction observes one of these intents with a higher timestamp and pushes, it may not realize that the transaction was already aborted because it compares its timestamp against the write timestamp cache and find that its timestamp is above the cache. This is discussed in the following three places:

// Determine whether the transaction record could ever actually be written
// in the future. We provide the TxnMeta's timestamp (which we read from an
// intent) as the upper bound on the transaction's minimum timestamp. This
// may be greater than the transaction's actually original epoch-zero
// timestamp, in which case we're subjecting ourselves to false positives
// where we don't discover that a transaction is uncommittable, but never
// false negatives where we think that a transaction is uncommittable even
// when it's not and could later complete.

and

// Performing this detection could have false positives where we determine
// that a record could still be written and conclude that we're in the first
// case. However, it cannot have false negatives where we determine that a
// record can not be written and conclude that we're in the second or third
// case. This is important, because it means that we may end up failing to
// push a finalized transaction but will never determine that a transaction
// is finalized when it still could end up committing.

and

// for the provided transaction information. Callers should provide an upper
// bound on the transaction's minimum timestamp across all epochs (typically
// the original timestamp of its first epoch). If this is not exact then the
// method may return false positives (i.e. it determines that the record could
// be created when it actually couldn't) but will never return false negatives
// (i.e. it will never determine that the record could not be created when it
// actually could).

It turns out that this scenario can cause issues when combined with the dependencies enforced by the contentionQueue. If the first transaction in the queue falls into this trap of falsely assuming a pushee is not aborted and later transactions don't, the later transactions can get stuck waiting on the first transaction to notice that the pushee has already been aborted. If the pushee itself is stuck in a second contentionQueue on the same situation then it will never rollback its transaction record and inform the first transaction in the queue that it is indeed aborted. So everything gets stuck. It turns out that it is not ok for some pushees to consider a transaction to be pending while other pushees consider a transaction to be aborted.

The solution to this isn't immediately clear. I see two options at the moment:

  1. we begin storing the EpochZeroOrigTimestamp in intents. This allows all pushers to have perfect accuracy about whether or not a transaction has been aborted by any other pusher. In other words, it allows us to get rid of this nastiness. Interestingly, it would come for free with the proposal to change transaction IDs to be a combination of the epoch-zero-orig-timestamp and a gateway ID.
  2. we go back to always leaving an ABORTED transaction record before a pushee can consider a transaction to be aborted. This feels safer (although there is still come complication around QueryTxn requests that find a missing txn record), but is also a large step backward away from a proposal like this. It also returns to letting pushers create transaction records as long as they are in the ABORTED state.

@tbg @andreimatei I'm curious whether either of you two has a strong preference here.

On the positive side, with QueryTxn requests updating the TxnWaitQueue when they observe a transaction to be aborted and with the first of these options implemented, the test seems to pass without any stalls (after about 6 hrs of running).

@tbg
Copy link
Member

tbg commented May 1, 2019

Great writeup! Before I dive in, I'm curious if any of the stuff @darinpp added to the txnWaitQ helped (#35921) or would've helped figure out where "something" was stuck. (Or, if you repro this scenario again, I'd be grateful if you gave it a try and reported back, in particular the "have been waiting" log message).

I just took a look at the contentionQ code to remind myself how it works. Oof. I'd be surprised if this were the last bug.

There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation!

We query the pushee periodically, but I just realized that this is based on the TxnExpiration, which is artificially high in this test. Just pointing this out for anyone else wondering.

I like option 1. We can optimize for the common case in which Timestamp == Epoch0Timestamp by adding the convention that the epoch 0 timestamp defaults to the timestamp to avoid writing more than we do today (if we also decide to make the field nullable).

@nvanbenschoten
Copy link
Member Author

I'm curious if any of the stuff @darinpp added to the txnWaitQ helped (#35921) or would've helped figure out where "something" was stuck. (Or, if you repro this scenario again, I'd be grateful if you gave it a try and reported back, in particular the "have been waiting" log message).

The "have been waiting" log message in txnqueue.go was very helpful and helped point me at the offending transactions once I knew to look for them. The metrics themselves would have pointed me at the blockage, but I'm not sure whether they would have helped in the actual debugging. The problem with these kinds of issues is that most of the questions a debugger would like to ask are "what sequence of events led to this situation?", not "what situation are we in now that we're stalled?".

I also think the reason why the "have been waiting" log message was more useful than the metrics is because the log message actually includes transaction ids. Without that, it's not really possible to come to any conclusions about the state of a deadlock. With that in mind, a "have been waiting" log message in contentionQueue.go would be helpful.

I like option 1. We can optimize for the common case in which Timestamp == Epoch0Timestamp by adding the convention that the epoch 0 timestamp defaults to the timestamp to avoid writing more than we do today (if we also decide to make the field nullable).

👍 I was leaning that way myself. It will be really nice to remove this uncertainty. I don't think the migration will be too difficult either, as the new field will need to be optional no matter what.

@tbg
Copy link
Member

tbg commented May 1, 2019

The "have been waiting" log message in txnqueue.go was very helpful and helped point me at the offending transactions once I knew to look for them. The metrics themselves would have pointed me at the blockage, but I'm not sure whether they would have helped in the actual debugging. The problem with these kinds of issues is that most of the questions a debugger would like to ask are "what sequence of events led to this situation?", not "what situation are we in now that we're stalled?".

👍 that's about as good as I hoped it was going to get. The metrics can't do the debugging for you but looks like they gave you a good starting point to logspy in.

@andreimatei
Copy link
Contributor

I like option 1 too. Option 2 - writing more ABORTED txn records - seems indeed that we'd be going backwards. I hope we get to where we never write any ABORTED records.

@tbg
Copy link
Member

tbg commented Jun 4, 2019

Friendly ping, was just looking through our roachtests and this seems like a nice wart to iron out.

@nvanbenschoten
Copy link
Member Author

Yep, this is on my plate for after we stabilize parallel commits. The two additional items that I need to polish off are:

  1. QueryTxn requests update the TxnWaitQueue when they observe a transaction to be aborted
  2. we begin storing the EpochZeroOrigTimestamp in intents. This allows all pushers to have perfect accuracy about whether or not a transaction has been aborted by any other pusher

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cqErr branch from 93cdbb8 to b3bc9fd Compare July 3, 2019 18:38
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cqErr branch from b3bc9fd to 541b3e3 Compare July 3, 2019 18:51
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 3, 2019
… observed

From cockroachdb#37199 (comment):

> There actually appear to be two ways for us to get in this deadlock scenario.
The first is due to a timestamp cache rotation. Imagine that 8703f023 and
ee447be3 queried each other's transaction records after f081497c queried their
transaction records. Between this time, nothing was written. However, there was
a timestamp cache rotation! Each of these transactions would consider the other
aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the
abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is
still sitting in the TxnWaitQueue, and it would be stuck there until the pushee
expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest
ever decides that a transaction is aborted.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cqErr branch from 541b3e3 to 01294fe Compare July 3, 2019 18:55
@nvanbenschoten
Copy link
Member Author

I've updated this PR to address the first additional item that we need to fix. The second one is larger, so I'm going to put that into a different PR. I've downgraded this to no longer say "Fixes" because it doesn't include the full fix.

PTAL @ajwerner.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cqErr branch from 01294fe to 1df7a72 Compare July 3, 2019 19:19
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for the great writeup. My mental model on the TxnWaitQueue is still fuzzier than I'd like but this change is straightforward enough. Are there any downsides to notifying all of the pushees upon a single abort?

Reviewed 3 of 5 files at r3, 1 of 2 files at r4, 2 of 2 files at r5, 2 of 3 files at r6, 10 of 10 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Informs cockroachdb#36089.

Before this commit, requests could get stuck repeatedly attempting to
push a transaction only to repeatedly find that they themselves were
already aborted. The error would not propagate up to the transaction
coordinator and the request would get stuck.

This commit fixes this behavior by correctly propagating errors observed
by the contentionQueue.

Release note: None
…hold

Informs cockroachdb#36089.

This commit bumps the TxnLivenessThreshold for clusters running
`kv/contention/nodes=4` to 10 minutes. This is sufficiently large
such that if at any point a transaction is abandoned then all other
transactions will begin waiting for it and the test will fail to
achieve its minimum QPS requirement.

Release note: None
… observed

From cockroachdb#37199 (comment):

> There actually appear to be two ways for us to get in this deadlock scenario.
The first is due to a timestamp cache rotation. Imagine that 8703f023 and
ee447be3 queried each other's transaction records after f081497c queried their
transaction records. Between this time, nothing was written. However, there was
a timestamp cache rotation! Each of these transactions would consider the other
aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the
abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is
still sitting in the TxnWaitQueue, and it would be stuck there until the pushee
expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest
ever decides that a transaction is aborted.

Release note: None
The test was already called queue_test.go and package is
called txnwait, so the prefix on the name is unnecessary.

Only code movement.

Release note: None
We had three different defaultHeartbeatInterval constants and it was
unclear which values corresponded to which heartbeats without reading
their documentation. This commit renames these constants to avoid confusion.

Release note: None
Until cockroachdb#36089 is fixed, we know this will fail regularly. This is
especially true now that we extended TxnLivenessThreshold.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cqErr branch from 1df7a72 to 82aeef4 Compare July 8, 2019 15:50
@nvanbenschoten
Copy link
Member Author

Are there any downsides to notifying all of the pushees upon a single abort?

I don't think so. Once the pushee is aborted, all of the pushers would like to know ASAP.

Thanks for the review.

bors r+

craig bot pushed a commit that referenced this pull request Jul 8, 2019
37199: storage: propagate errors from contentionQueue, catch stalls in roachtest r=nvanbenschoten a=nvanbenschoten

Informs #36089.

The PR is split into a series of commits. The first fixes part of a bug that was causing #36089  to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089.

Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the `contentionQueue`.

The second commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement.

The fourth commit instructs pushers in the `txnwait.Queue` to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations.

38397: sql: deflake TestLogic//crdb_internal/max_retry_counter r=knz a=knz

Fixes #38062.

Release note: None

38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany

This commit fixes NULLs in the TopK sorter by avoiding use
of the vec copy method, which has a bug. Instead, we add
a set method to the vec comparator, and use the templatized
comparator to perform the sets that the TopK sorter needs.

To facilitate this, we add an UnsetNull method to the Nulls
object. However, use of this method results in HasNull()
maybe returning true even if the vector doesn't have nulls.
This behavior already occurs when selection vectors are used.
Based on discussions with @solongordon and @asubiotto, this behavior
is OK, and future PR's will attempt to make this behavior better, and address
the bugs within the Vec Copy method.

38725: cli/dump: more clearly inform the user upon tables with no visible columns r=knz a=knz

Informs #37768.
Informs #28948.
This is coming up quite often on support, lately again on gitter and forum https://forum.cockroachlabs.com/t/error-while-dumping-core-backup/2901/3.

This PR aims to lessen the burden on support and propose a clear "next action" for the user.

Before:

```
kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb
CREATE TABLE t (,
        FAMILY "primary" (rowid)
);
Error: pq: at or near "from": syntax error
Failed running "dump"
```

After:

```
kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb
CREATE TABLE t (,
        FAMILY "primary" (rowid)
);
Error: table "defaultdb.public.t" has no visible columns
HINT: To proceed with the dump, either omit this table from the list of tables to dump, drop the table, or add some visible columns.
--
See: #37768
Failed running "dump"
```

Release note (cli change): `cockroach dump` will now more clearly
refer to issue #37768 when it encounters a table with no visible
columns, which (currently) cannot be dumped successfully.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rohan Yadav <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 8, 2019

Build succeeded

@craig craig bot merged commit 82aeef4 into cockroachdb:master Jul 8, 2019
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/cqErr branch July 8, 2019 21:34
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 11, 2019
38782: storage: persist minimum transaction timestamps in intents r=nvanbenschoten a=nvanbenschoten

Fixes #36089.
Informs #37199.

This commit addresses the second concern from #37199 (comment) by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable.

This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn.

@tbg I'm assigning you because you reviewed most of the lazy transaction record stuff (especially #33523), but let me know if you'd like me to find a different reviewer.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants