-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YCSB workload A kills cockroach #20448
Comments
@amq Deadlock refers to a locking pattern that prevents forward progress. Did you actually see deadlock (and if so, what locks are involved), or is it that throughput falls to 0? Zero throughput might be deadlock, or it might be something else. Regardless, it is very concerning. What version of Cockroach are you running? How many Cockroach machines are you running? |
@amq, thank you for the report. i am attempting to reproduce this, thank you for the detailed reproduction instructions. I'll report back with the results of my reproduction. |
I had v1.1.3 in three-node cluster.
|
@arjunravinarayan this could be similar to an issue a user reported privately. I'll send it your way (quota pool/snapshot refusal). @amq any chance you're seeing message like these in the logs?
|
I don't see
In YCSB it looks like this:
|
@solongordon the action item here is to see if it's possible to reproduce this issue with the app that's provided. You should be able to use a similar setup as the one you'll use for our internal ycsb load generator. |
I was able to reproduce this issue against v1.1.5 and it behaved just as @amq described. On v2.0, the behavior is somewhat better. I don't see the same unbounded accumulation of queries; there are only ever as many running queries as threads, and they're cleaned up immediately when the YCSB client exits. However, I still see precipitous drops in qps and when running YCSB with a large number of threads (16 or more). I suspect this is related to it being an update-heavy workload with a zipfian distribution, which means multiple threads are likely to make concurrent updates to the same row. I'll dig in some more to see what's going on. |
The drop in qps definitely looks related to update contention, specifically for Reading and writing from other rows is plenty fast while the workload is running. |
@nvanbenschoten or @tschottdorf Can you take a look at that trace? The query seems to be repeatedly running into conflicting intents on the key |
One thing to point out is that the load generator is using the postgres jdbc driver without a transaction retry loop, so any retriable errors that are pushed to the client are presumably resulting in the transaction being thrown away (maybe not even rolled back/cleaned up). This would explain why each transaction ends up tripping over other transaction's intents and throwing @solongordon are you comfortable enough with this load generator to modify it? If so, could you try adding a transaction retry loop similar to the one in our docs? |
We're supposed to clean up whenever a SQL transaction fails unless it contains a SAVEPOINT statement. Also, the trace Solon posted showed one long request, so it's not just a cycle of unfair restarts caused by the lack of a within-transaction retry loop. |
Yeah, it looks like you're right. The repeated The |
I see a bunch of these in the leaseholder's log: And these on the gateway node: |
@solongordon were there any logs from
"pushee last active" .
|
Yes, should have mentioned, many of these as well (one per second or so): |
@spencerkimball This is the issue I was mentioning. It touches retries and the push txn queue. |
Interesting, seeing a @andreimatei you recently mentioned that you had found issues with how we roll back transactions after retryable errors. Could you expand on that? Do you think it could result in behavior like we see here, where transactions keep running into other abandoned transactions? |
I don't think so. I was talking about cases where a txn's ctx is canceled. No reason to think that's happening here, I believe. I've seen cases of expired transactions when I was overloading a cluster - the heartbeats weren't making it in time. I'd check whether that happens - heartbeat loops not running within 2s, or the heartbeat request taking a long time to execute. |
Planning to take a look tomorrow, if there have been any more insights please post them before that. ;-) |
I'd also look at the fact that the txnWaitQueue unblocks all waiting requests at once, potentially creating thundering herds and making no attempts at fairness. |
I got the load generator up and running on a three-node roachprod cluster. I was quickly able to see throughput tank as concurrency increased. Thanks for making the reproduction steps so easy @amq! First, I'll note that even when throughput is bottoming out, CPU utilization on all nodes is only at around 5% and memory use is well under under 1GB. This indicates that the problem is due to bad behavior in Cockroach and not due to an overloaded system. It's also interesting to look at what the workload is doing when performance gets bad, because it isn't very much. YCSB is essentially just running the statement Because the ddl above creates the table with a single column family, the UPDATE statement reduces to reading a key and then writing to that same key in a transaction. As concurrency increases in the transaction, performance drops. We certainly shouldn't be seeing such a bad drop in performance, but we do expect some hit as contention grows. If we are willing to change the schema, we can improve this situation significantly by breaking up the individual fields into multiple keys. We can do this by using column families, resulting in the schema: CREATE TABLE ycsb.usertable (
YCSB_KEY VARCHAR (255) PRIMARY KEY,
FIELD0 TEXT,
FIELD1 TEXT,
FIELD2 TEXT,
FIELD3 TEXT,
FIELD4 TEXT,
FIELD5 TEXT,
FIELD6 TEXT,
FIELD7 TEXT,
FIELD8 TEXT,
FIELD9 TEXT,
FAMILY f1 (YCSB_KEY),
FAMILY f2 (FIELD0),
FAMILY f3 (FIELD1),
FAMILY f4 (FIELD2),
FAMILY f5 (FIELD3),
FAMILY f6 (FIELD4),
FAMILY f7 (FIELD5),
FAMILY f8 (FIELD6),
FAMILY f9 (FIELD7),
FAMILY f10 (FIELD8),
FAMILY f11 (FIELD9)
); This dramatically reduces the amount of write contention because even though all transactions still read from the same key, they now often write to different keys. Below is a comparison of throughput between a schema using a single column family and a schema using a column family per field. We can see that even with multiple column families, throughput is still impacted by concurrency. However, it does not drop to 0 like the setup with only one column family. That improvement is nice, but such bad performance under contention shouldn't exist in the first place. I dug more through the logs and saw a lot of the following, which @solongordon pointed out above:
This again showed that transaction coordinators were seemingly abandoning transaction records and failing to heartbeat them. I also saw the following:
This is interesting because it shows that the txn_coord_sender is attempting to heartbeat its transaction record but failing because the record does not exist at the time that the heartbeat request is evaluated. I think this means that the heartbeat arrived at the txn record after a PushTxnRequest succeeded in aborting it. This isn't good. I'm wondering if its possible that the |
The transaction heartbeat timeout is quite aggressive, so I wouldn't be surprised. |
Wow, nice detective work.
You're not quite sure that it's maybeRejectClientLocked, or you're not quite sure why maybeRejectClientLocked is triggering?
Do you get no benefit at all without the full set of changes, or do you mean that they all provide incremental benefits?
This breaks the cases described in the comment above, right? I'd like to propose a third alternative for discussion: Add the full txn proto to HeartbeatTxnRequest and allow it to write a transaction record when none exists. This is similar to ignoring the "record not present" errors in the heartbeat loop, but it makes the heartbeats work (blocking other pushers) instead of silently ignoring failures. This feels consistent with the way we handle multi-range batches, by allowing other writes to happen before the BeginTransaction, although it does have some risk of writing a late transaction record that will just have to be GC'd. It would almost solve #23945, except that the HeartbeatTxn would be stuck in the command queue behind the BeginTransaction batch anyway. Maybe there's some way we could make the command queue smarter about txn ops. I think that for an immediate fix ignoring the errors is probably best. In the long run I'd look at the third option and refactor the relationship between BeginTxn and HeartbeatTxn. |
What I wrote was unclear. I'm sure about both of those things. I'm not sure how the BeginTxn and Put requests are being evaluated and succeeding separately from the EndTxn, which is hitting a WriteTooOld error. They were previously being evaluated as a single batch, which is where my confusion is arising from.
We see roughly the same "deadlock" when any one of them is missing. I think we're seeing this non-linear behavior where all of the changes are needed because with any one of them missing the first heartbeat to a new transaction record is being delayed long enough that the record expires.
This avoids the issue I was seeing, yes. I don't think it's actually the solution we want though, and I think you agree.
If we're going to move in this direction then I'd like to see if we could expand it all the way to the natural conclusion of deprecating
Could you expand on this? Are you referring to how we write an aborted transaction record when a PushTxnRequest is sent for a transaction that does not yet have a transaction record?
This would only be the case if we began the heartbeat when we sent the BeginTxn request instead of when we got the first response after sending it.
Yeah, I was beginning to think about the same thing. It would be really nice if we had some notion of priority in the CommandQueue so that HeartbeatTxn requests could jump in front of other lower priority requests. |
@nvanbenschoten the txn record is written the second time around because after the Epoch increment, it hits a new path in DistSender which splits off the The In the longer term, I agree that it'd be nice if |
"The comment above" referred to the comment in the code above this line. This change would break things for that scenario.
Yeah, allowing HeartbeatTxn to write a txn record would be equivalent to making BeginTransaction replayable. One reason why we might want to keep both methods is to switch from BeginTxn to HeartbeatTxn after we know that the txn record has been written once, since we'd know that any subsequent missing txn record would be because it had been aborted and cleaned up. (HeartbeatTxn might also be a little cheaper since it wouldn't need to carry the TxnMeta data)
How so? The batch (BeginTxn, Put) could be blocked in the command queue on the put's key, but that batch will then block all of its keys. The heartbeat couldn't go through until the Begin/Put batch finishes, even if we've gotten responses from other parts of the split batch. |
Related to cockroachdb#20448. This change adds a new error type - `TransactionNotFoundError`. This error is returned when an operation looks up a transaction record that it expects to exist and finds that it does not. This is possible for `HeartbeatTxnRequests` and `EndTxnRequests`. The change then detects this error type in the `TxnCoordSender` heartbeat loop. Unlike for any other error type, this error does not force the heartbeat loop to shut down. The reason for this is that the heartbeat loop can be started before a client is certain that a transaction record has been written. If this is ever the case, like we saw in the YCSB exploration, then terminating the heartbeat loop on a `TransactionNotFoundError` will prematurely abort the entire transaction. Release note (bug fix): Fixed bug where an expected transaction heartbeat failure aborted the transaction.
Related to cockroachdb#20448. This change adds a reason enum to`TransactionStatusError`. This enum includes a variant is returned when an operation looks up a transaction record that it expects to exist and finds that it does not. This is possible for `HeartbeatTxnRequests` and `EndTxnRequests`. The change then detects this error type in the `TxnCoordSender` heartbeat loop. Unlike for any other error type, this error does not force the heartbeat loop to shut down. The reason for this is that the heartbeat loop can be started before a client is certain that a transaction record has been written. If this is ever the case, like we saw in the YCSB exploration, then terminating the heartbeat loop on a `TransactionStatusError` will prematurely abort the entire transaction. Release note (bug fix): Fixed bug where an expected transaction heartbeat failure aborted the transaction.
Related to cockroachdb#20448. This change adds a reason enum to`TransactionStatusError`. This enum includes a variant is returned when an operation looks up a transaction record that it expects to exist and finds that it does not. This is possible for `HeartbeatTxnRequests` and `EndTxnRequests`. The change then detects this error type in the `TxnCoordSender` heartbeat loop. Unlike for any other error type, this error does not force the heartbeat loop to shut down. The reason for this is that the heartbeat loop can be started before a client is certain that a transaction record has been written. If this is ever the case, like we saw in the YCSB exploration, then terminating the heartbeat loop on a `TransactionStatusError` will prematurely abort the entire transaction. Release note (bug fix): Fixed bug where an expected transaction heartbeat failure aborted the transaction.
Fixes cockroachdb#23945. See cockroachdb#20448. This change addresses a case where delayed BeginTxn requests can result in txn records looking inactive immediately upon being written. We now bump the txn record's LastHeartbeat timestamp when writing the record. Release note: None
24969: opt: Hoist Exists operator and try to decorrelate r=andy-kimball a=andy-kimball This PR has two commits for easier review. The first adds support for OuterCols as a relational property. The second uses that to implement transforming the Exists operator into a SemiJoinApply operator, and then attempting to decorrelate that. 25034: storage: bump LastHeartbeat timestamp when writing txn record r=nvanbenschoten a=nvanbenschoten Fixes #23945. See #20448. This change addresses a case where delayed BeginTxn requests can result in txn records looking inactive immediately upon being written. We now bump the txn record's LastHeartbeat timestamp when writing the record. Release note: None Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
25014: storage: queue requests to push txn / resolve intents on single keys r=spencerkimball a=spencerkimball Previously, high contention on a single key would cause every thread to push the same conflicting transaction then resolve the same intent in parallel. This is inefficient as only one pusher needs to succeed, and only one resolver needs to resolve the intent, and then only one writer should proceed while the other readers/writers should in turn wait on the previous writer by pushing its transaction. This effectively serializes the conflicting reader/writers. One complication is that all pushers which may have a valid, writing transaction (i.e., `Transaction.Key != nil`), must push either the conflicting transaction or another transaction already pushing that transaction. This allows dependency cycles to be discovered. Fixes #20448 25791: jobs: bump default progress log time to 30s r=mjibson a=mjibson The previous code allowed updates to be performed every 1s, which could cause the MVCC row to be very large causing problems with splits. We can update much more slowly by default. In the case of a small backup job, the 5% fraction threshold will allow a speedier update rate. Remove a note that's not useful anymore since the referred function can now only be used in the described safe way. See #25770. Although this change didn't fix that bug, we still think it's a good idea. Release note: None 26293: opt: enable a few distsql logictests r=RaduBerinde a=RaduBerinde - `distsql_indexjoin`: this is only a planning test. Modifying the split points and queries a bit to make the condition more restrictive and make the optimizer choose index joins. There was a single plan that was different, and the difference was minor (the old planner is emitting an unnecessary column). - `distsql_expr`: logic-only test, enabling for opt. - `distsql_scrub`: planning test; opt version commented out for now. Release note: None Co-authored-by: Spencer Kimball <[email protected]> Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
BUG REPORT
I can consistently replicate Cockroach getting into a deadlock situation by running YCSB
workload A
This is how I run it:
Install
Create schema
Load data
Run
On m5.xlarge with 500 GB gp2 storage (1500 IOPS with bursts up to 3000) cockroach gets into a deadlock at around after
threads=96
. Please note that running a single test withthreads=96
is not sufficient, because the load accumulates with each run.If deadlocks under a heavy load of updates is not considered a bug, please close the issue.
The text was updated successfully, but these errors were encountered: