Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YCSB workload A kills cockroach #20448

Closed
amq opened this issue Dec 4, 2017 · 31 comments · Fixed by #25014
Closed

YCSB workload A kills cockroach #20448

amq opened this issue Dec 4, 2017 · 31 comments · Fixed by #25014
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@amq
Copy link

amq commented Dec 4, 2017

BUG REPORT

I can consistently replicate Cockroach getting into a deadlock situation by running YCSB workload A

Workload A: Update heavy workload
This workload has a mix of 50/50 reads and writes. An application example is a session store recording recent actions.

This is how I run it:

Install

wget https://github.com/brianfrankcooper/YCSB/releases/download/0.12.0/ycsb-0.12.0.tar.gz
tar -zxvf ycsb-0.12.0.tar.gz
cd ycsb-0.12.0
wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar
mv postgresql-42.1.4.jar lib
ant

Create schema

DROP DATABASE IF EXISTS ycsb;
CREATE DATABASE ycsb;
CREATE TABLE ycsb.usertable (
YCSB_KEY VARCHAR (255) PRIMARY KEY,
  FIELD0 TEXT,
  FIELD1 TEXT,
  FIELD2 TEXT,
  FIELD3 TEXT,
  FIELD4 TEXT,
  FIELD5 TEXT,
  FIELD6 TEXT,
  FIELD7 TEXT,
  FIELD8 TEXT,
  FIELD9 TEXT
);

Load data

./bin/ycsb load jdbc -P workloads/workloada -p db.driver=org.postgresql.Driver -p db.url=jdbc:postgresql://cockroach-01:26257/ycsb -p db.user=root -p db.batchsize=1000 -p recordcount=50000

Run

for threads in 1 2 4 8 16 32 64 96 128 160 192; do
  ./bin/ycsb run jdbc -P workloads/workloada -p db.driver=org.postgresql.Driver -p db.url=jdbc:postgresql://cockroach-01:26257/ycsb -p db.user=root -p operationcount=0 -p maxexecutiontime=60 -s -threads $threads
done

On m5.xlarge with 500 GB gp2 storage (1500 IOPS with bursts up to 3000) cockroach gets into a deadlock at around after threads=96. Please note that running a single test with threads=96 is not sufficient, because the load accumulates with each run.

If deadlocks under a heavy load of updates is not considered a bug, please close the issue.

@petermattis
Copy link
Collaborator

@amq Deadlock refers to a locking pattern that prevents forward progress. Did you actually see deadlock (and if so, what locks are involved), or is it that throughput falls to 0? Zero throughput might be deadlock, or it might be something else. Regardless, it is very concerning.

What version of Cockroach are you running? How many Cockroach machines are you running?

@rjnn
Copy link
Contributor

rjnn commented Dec 4, 2017

@amq, thank you for the report. i am attempting to reproduce this, thank you for the detailed reproduction instructions. I'll report back with the results of my reproduction.

@amq
Copy link
Author

amq commented Dec 4, 2017

I had v1.1.3 in three-node cluster.

  • throughput dropped to 0
  • show queries showed dozens of queries which were more than an hour old, still executing

@tbg
Copy link
Member

tbg commented Dec 4, 2017

@arjunravinarayan this could be similar to an issue a user reported privately. I'll send it your way (quota pool/snapshot refusal).

@amq any chance you're seeing message like these in the logs?

W171130 22:48:34.342958 5631 storage/quota_pool.go:201 [n3,s2,r32/3:/Table/51{-/1/-3834…}] have been waiting 1m0.000147601s attempting to acquire quota

@amq
Copy link
Author

amq commented Dec 5, 2017

I don't see attempting to acquire quota, but rather:

I171205 01:01:42.869333 149 server/status/runtime.go:223  [n1] runtime stats: 1.2 GiB RSS, 337 goroutines, 145 MiB/20 MiB/224 MiB GO alloc/idle/total, 738 MiB/881 MiB CGO alloc/total, 1037.69cgo/sec, 0.12/0.05 %(u/s)time, 0.00 %gc (0x)
W171205 01:01:46.040344 9234260 kv/dist_sender.go:1148  [client=10.0.0.172:56414,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 070bf876], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:01:46.925540 9234258 kv/dist_sender.go:1148  [client=10.0.0.172:56460,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 1b2f8854], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:01:47.409585 9234324 kv/dist_sender.go:1148  [client=10.0.0.172:56449,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 513e0c60], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:01:50.761656 9234229 kv/dist_sender.go:1148  [client=10.0.0.172:56472,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 68082f57], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:01:52.581574 9234216 kv/dist_sender.go:1148  [client=10.0.0.172:56392,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: a2dcccbf], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
<a minute later>
I171205 01:02:52.869334 149 server/status/runtime.go:223  [n1] runtime stats: 1.2 GiB RSS, 335 goroutines, 124 MiB/37 MiB/224 MiB GO alloc/idle/total, 762 MiB/889 MiB CGO alloc/total, 1021.70cgo/sec, 0.14/0.05 %(u/s)time, 0.00 %gc (1x)
W171205 01:02:53.443042 9234169 kv/dist_sender.go:1148  [client=10.0.0.172:56406,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 6f207769], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:02:53.444220 9234215 kv/dist_sender.go:1148  [client=10.0.0.172:56492,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: c58413bb], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:02:53.446547 9234260 kv/dist_sender.go:1148  [client=10.0.0.172:56414,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 08cc2705], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:02:59.962910 9234219 kv/dist_sender.go:1148  [client=10.0.0.172:56496,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 25a8eac9], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:03:00.769275 9234341 kv/dist_sender.go:1148  [client=10.0.0.172:56580,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 2e161520], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:03:00.769276 9234247 kv/dist_sender.go:1148  [client=10.0.0.172:56434,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: bb4b251d], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)
W171205 01:03:00.774334 9234239 kv/dist_sender.go:1148  [client=10.0.0.172:56420,user=root,n1] have been waiting 1m0s sending RPC to r46 for batch: [txn: 3365dda9], BeginTransaction [/Table/60/1/"user1573987489603120213"/0,/Min), Put [/Table/60/1/"user1573987489603120213"/0,/Min), EndTransaction [/Table/60/1/"user1573987489603120213"/0,/Min)

SHOW QUERIES;

+----------------------------------+---------+----------+----------------------------------+------------------------------------------------------+------------------+------------------------+-------------+-----------+
|             query_id             | node_id | username |              start               |                        query                         |  client_address  |    application_name    | distributed |   phase   |
+----------------------------------+---------+----------+----------------------------------+------------------------------------------------------+------------------+------------------------+-------------+-----------+
| 14fd401513c97d520000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd402644457ff50000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd403e83b46e6f0000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd403ea29b0dd40000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd4042fab4c1200000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd405cd4b67f520000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd4029d884891c0000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd40153fae7ebc0000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd4029f14cdf440000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd404102fd643a0000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
| 14fd4041259642a80000000000000001 |       1 | root     | 2017-12-05 00:57:49.891544+00:00 | UPDATE usertable SET field7 = $1 WHERE ycsb_key = $2 | 10.0.0.172:56406 | PostgreSQL JDBC Driver | false       | executing |
...
+----------------------------------+---------+----------+----------------------------------+------------------------------------------------------+------------------+------------------------+-------------+-----------+
(1525 rows)

In YCSB it looks like this:

2017-12-05 01:01:38:129 230 sec: 6024 operations; 0 current ops/sec; [READ: Count=0, Max=0, Min=9223372036854775807, Avg=?, 90=0, 99=0, 99.9=0, 99.99=0] [CLEANUP: Count=0, Max=0, Min=9223372036854775807, Avg=?, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=?, 90=0, 99=0, 99.9=0, 99.99=0]
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true
2017-12-05 01:01:48:129 240 sec: 6024 operations; 0 current ops/sec; [READ: Count=0, Max=0, Min=9223372036854775807, Avg=?, 90=0, 99=0, 99.9=0, 99.99=0] [CLEANUP: Count=0, Max=0, Min=9223372036854775807, Avg=?, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=?, 90=0, 99=0, 99.9=0, 99.99=0]
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true
Still waiting for thread Thread-98 to complete. Workload status: true

@rjnn rjnn assigned justinj and unassigned rjnn Feb 27, 2018
@jordanlewis jordanlewis assigned solongordon and unassigned justinj Mar 6, 2018
@jordanlewis
Copy link
Member

@solongordon the action item here is to see if it's possible to reproduce this issue with the app that's provided. You should be able to use a similar setup as the one you'll use for our internal ycsb load generator.

@solongordon
Copy link
Contributor

I was able to reproduce this issue against v1.1.5 and it behaved just as @amq described. On v2.0, the behavior is somewhat better. I don't see the same unbounded accumulation of queries; there are only ever as many running queries as threads, and they're cleaned up immediately when the YCSB client exits.

However, I still see precipitous drops in qps and when running YCSB with a large number of threads (16 or more). I suspect this is related to it being an update-heavy workload with a zipfian distribution, which means multiple threads are likely to make concurrent updates to the same row. I'll dig in some more to see what's going on.

@solongordon
Copy link
Contributor

The drop in qps definitely looks related to update contention, specifically for user8425665148555282324, the most common user in the distribution. I see some updates for that row taking upwards of a minute, which seems unreasonably high for the amount of load. I took a session trace of one such update.

Reading and writing from other rows is plenty fast while the workload is running.

@petermattis
Copy link
Collaborator

@nvanbenschoten or @tschottdorf Can you take a look at that trace? The query seems to be repeatedly running into conflicting intents on the key /Table/53/1/"user8425665148555282324"/0.

@nvanbenschoten
Copy link
Member

One thing to point out is that the load generator is using the postgres jdbc driver without a transaction retry loop, so any retriable errors that are pushed to the client are presumably resulting in the transaction being thrown away (maybe not even rolled back/cleaned up). This would explain why each transaction ends up tripping over other transaction's intents and throwing WriteIntentErrors. It would also explain why no transaction is able to use intents it laid down at a previous epoch to make forward progress. Essentially, I think we're running into a livelock at higher concurrency levels because each transaction gets in the others' way for a while before throwing out all the work it performs after it hits a single retryable error.

@solongordon are you comfortable enough with this load generator to modify it? If so, could you try adding a transaction retry loop similar to the one in our docs?

@bdarnell
Copy link
Contributor

One thing to point out is that the load generator is using the postgres jdbc driver without a transaction retry loop, so any retriable errors that are pushed to the client are presumably resulting in the transaction being thrown away (maybe not even rolled back/cleaned up).

We're supposed to clean up whenever a SQL transaction fails unless it contains a SAVEPOINT statement.

Also, the trace Solon posted showed one long request, so it's not just a cycle of unfair restarts caused by the lack of a within-transaction retry loop.

@nvanbenschoten
Copy link
Member

Also, the trace Solon posted showed one long request, so it's not just a cycle of unfair restarts caused by the lack of a within-transaction retry loop.

Yeah, it looks like you're right. The repeated [Open pos:20] logs indicate that we're in an open transaction that keeps getting retryable errors and auto-retrying without returning to the client.

The PushTxn requests are interesting in that trace. They each spend a long time (~200ms) waiting for conflicting commands in the command queue. Once out of the command queue, they seem to be successful in aborting the transaction they're trying to push. @solongordon it would be interesting to run with --vmodule=cmd_push_txn=1 so that we can see what is logged during the push txn evaluation.

@solongordon
Copy link
Contributor

I see a bunch of these in the leaseholder's log:
I180314 15:52:13.530474 1409986 storage/batcheval/cmd_push_txn.go:216 [n3,s3,r33/3:/Table/58{-/1/"user…}] failed to push "sql txn" id=a569c6e3 key=/Table/58/1/"user1573987489603120213"/0 rw=true pri=0.09463374 iso=SERIALIZABLE stat=PENDING epo=1 ts=1521042732.342015627,1 orig=1521042718.978763840,1 max=1521042679.785790236,0 wto=false rop=false seq=15

And these on the gateway node:
W180314 15:53:57.859084 6398409 kv/dist_sender.go:1305 [client=10.142.0.58:49916,user=root,txn=68f574f7,n1] have been waiting 1m0s sending RPC to r33 (currently pending: [(n3,s3):3]) for batch: [txn: 68f574f7], BeginTransaction [/Table/58/1/"user1573987489603120213"/0,/Min), Put [/Table/58/1/"user1573987489603120213"/0,/Min)

@nvanbenschoten
Copy link
Member

@solongordon were there any logs from

log.Infof(ctx, "%s "+s+" %s: %s (pushee last active: %s)",
You could try greping for "pushee last active".

@solongordon
Copy link
Contributor

Yes, should have mentioned, many of these as well (one per second or so):
I180314 15:52:14.387612 1410237 storage/batcheval/cmd_push_txn.go:208 [n3,s3,r33/3:/Table/58{-/1/"user…}] 0983c622 pushed fafe4c62: pushee is expired (pushee last active: 1521042732.342015627,1)

@petermattis
Copy link
Collaborator

@spencerkimball This is the issue I was mentioning. It touches retries and the push txn queue.

@nvanbenschoten
Copy link
Member

Interesting, seeing a pushee is expired error without any nodes going down implies to me that a TxnCoodSender is abandoning a transaction record.

@andreimatei you recently mentioned that you had found issues with how we roll back transactions after retryable errors. Could you expand on that? Do you think it could result in behavior like we see here, where transactions keep running into other abandoned transactions?

@andreimatei
Copy link
Contributor

@andreimatei you recently mentioned that you had found issues with how we roll back transactions after retryable errors. Could you expand on that? Do you think it could result in behavior like we see here, where transactions keep running into other abandoned transactions?

I don't think so. I was talking about cases where a txn's ctx is canceled. No reason to think that's happening here, I believe.

I've seen cases of expired transactions when I was overloading a cluster - the heartbeats weren't making it in time. I'd check whether that happens - heartbeat loops not running within 2s, or the heartbeat request taking a long time to execute.
When I've observed that in the past, it occurred to me that we should find a way to push back on... something when it occurs.

@tbg
Copy link
Member

tbg commented Mar 14, 2018

Planning to take a look tomorrow, if there have been any more insights please post them before that. ;-)

@bdarnell
Copy link
Contributor

I'd also look at the fact that the txnWaitQueue unblocks all waiting requests at once, potentially creating thundering herds and making no attempts at fairness.

@tbg tbg self-assigned this Mar 14, 2018
@nvanbenschoten
Copy link
Member

I got the load generator up and running on a three-node roachprod cluster. I was quickly able to see throughput tank as concurrency increased. Thanks for making the reproduction steps so easy @amq!

First, I'll note that even when throughput is bottoming out, CPU utilization on all nodes is only at around 5% and memory use is well under under 1GB. This indicates that the problem is due to bad behavior in Cockroach and not due to an overloaded system.

It's also interesting to look at what the workload is doing when performance gets bad, because it isn't very much. YCSB is essentially just running the statement UPDATE usertable SET field<N> = $1 WHERE ycsb_key = $2 over and over. ycsb_key is set using a zipf distribution, so a few rows will see a lot of traffic. The field number is chosen randomly between 0 and 9. When an UPDATE statement comes in to CRDB, it will find the KV with the matching ycsb_key and update the corresponding KV for the specified field.

Because the ddl above creates the table with a single column family, the UPDATE statement reduces to reading a key and then writing to that same key in a transaction. As concurrency increases in the transaction, performance drops. We certainly shouldn't be seeing such a bad drop in performance, but we do expect some hit as contention grows. If we are willing to change the schema, we can improve this situation significantly by breaking up the individual fields into multiple keys. We can do this by using column families, resulting in the schema:

CREATE TABLE ycsb.usertable (
  YCSB_KEY VARCHAR (255) PRIMARY KEY,
  FIELD0 TEXT,
  FIELD1 TEXT,
  FIELD2 TEXT,
  FIELD3 TEXT,
  FIELD4 TEXT,
  FIELD5 TEXT,
  FIELD6 TEXT,
  FIELD7 TEXT,
  FIELD8 TEXT,
  FIELD9 TEXT,
  FAMILY f1 (YCSB_KEY),
  FAMILY f2 (FIELD0),
  FAMILY f3 (FIELD1),
  FAMILY f4 (FIELD2),
  FAMILY f5 (FIELD3),
  FAMILY f6 (FIELD4),
  FAMILY f7 (FIELD5),
  FAMILY f8 (FIELD6),
  FAMILY f9 (FIELD7),
  FAMILY f10 (FIELD8),
  FAMILY f11 (FIELD9)
);

This dramatically reduces the amount of write contention because even though all transactions still read from the same key, they now often write to different keys. Below is a comparison of throughput between a schema using a single column family and a schema using a column family per field. We can see that even with multiple column families, throughput is still impacted by concurrency. However, it does not drop to 0 like the setup with only one column family.

chart

That improvement is nice, but such bad performance under contention shouldn't exist in the first place. I dug more through the logs and saw a lot of the following, which @solongordon pointed out above:

I180314 15:52:14.387612 1410237 storage/batcheval/cmd_push_txn.go:208 [n3,s3,r33/3:/Table/58{-/1/"user…}] 0983c622 pushed fafe4c62: pushee is expired (pushee last active: 1521042732.342015627,1)

This again showed that transaction coordinators were seemingly abandoning transaction records and failing to heartbeat them. I also saw the following:

I180315 00:27:07.392188 72328 kv/txn_coord_sender.go:1097  [client=10.142.0.29:39772,user=root,n1,txn=a58eb4e2] heartbeat failed: heartbeat for transaction "sql txn" id=a58eb4e2 key=/Table/52/1/"user1573987489603120213"/0 rw=false pri=0.03439626 iso=SERIALIZABLE stat=PENDING epo=1 ts=1521073616.475806071,1 orig=1521073616.475806071,1 max=1521073614.325326222,0 wto=false rop=false seq=13 failed; record not present
W180314 23:00:59.157659 224092 kv/txn_coord_sender.go:1026  [n1] abort due to inactivity failed for "sql txn" id=1c6c653a key=/Table/52/1/"user1573987489603120213"/0 rw=false pri=0.07864293 iso=SERIALIZABLE stat=PENDING epo=1 ts=1521068454.577385592,1 orig=1521068454.577385592,1 max=1521068451.585962884,0 wto=false rop=false seq=9: txn "sql txn" id=1c6c653a key=/Table/52/1/"user1573987489603120213"/0 rw=false pri=0.07864293 iso=SERIALIZABLE stat=PENDING epo=1 ts=1521068454.577385592,1 orig=1521068454.577385592,1 max=1521068451.086758817,0 wto=false rop=false seq=10: does not exist 

This is interesting because it shows that the txn_coord_sender is attempting to heartbeat its transaction record but failing because the record does not exist at the time that the heartbeat request is evaluated. I think this means that the heartbeat arrived at the txn record after a PushTxnRequest succeeded in aborting it. This isn't good. I'm wondering if its possible that the HeartbeatTxnRequest for the transactions that are successful in laying down intents are being starved by the attack of PushTxnRequests and QueryTxnRequest sent by contending transactions. All three of these request types would need to wait for each other in the CommandQueue, so it's possible that the HeartbeatTxnRequest is being delayed there so long that its record appears expired.

@tbg
Copy link
Member

tbg commented Mar 15, 2018

The transaction heartbeat timeout is quite aggressive, so I wouldn't be surprised.

@bdarnell
Copy link
Contributor

Wow, nice detective work.

  1. transaction record then gets written and intent is laid down but not committed. WriteTooOld error on EndTxn evaluation. Another retry
    -- I'm not quite sure how this happens
  2. retry never happens because it is blocked by maybeRejectClientLocked

You're not quite sure that it's maybeRejectClientLocked, or you're not quite sure why maybeRejectClientLocked is triggering?

(seriously, it doesn't seem to work without all of them).

Do you get no benefit at all without the full set of changes, or do you mean that they all provide incremental benefits?

  •   if /*pErr == nil || ambiguousErr ||*/ newTxn.Writing {
    

This breaks the cases described in the comment above, right?

I'd like to propose a third alternative for discussion: Add the full txn proto to HeartbeatTxnRequest and allow it to write a transaction record when none exists. This is similar to ignoring the "record not present" errors in the heartbeat loop, but it makes the heartbeats work (blocking other pushers) instead of silently ignoring failures. This feels consistent with the way we handle multi-range batches, by allowing other writes to happen before the BeginTransaction, although it does have some risk of writing a late transaction record that will just have to be GC'd. It would almost solve #23945, except that the HeartbeatTxn would be stuck in the command queue behind the BeginTransaction batch anyway. Maybe there's some way we could make the command queue smarter about txn ops.

I think that for an immediate fix ignoring the errors is probably best. In the long run I'd look at the third option and refactor the relationship between BeginTxn and HeartbeatTxn.

@nvanbenschoten
Copy link
Member

You're not quite sure that it's maybeRejectClientLocked, or you're not quite sure why maybeRejectClientLocked is triggering?

What I wrote was unclear. I'm sure about both of those things. I'm not sure how the BeginTxn and Put requests are being evaluated and succeeding separately from the EndTxn, which is hitting a WriteTooOld error. They were previously being evaluated as a single batch, which is where my confusion is arising from.

Do you get no benefit at all without the full set of changes, or do you mean that they all provide incremental benefits?

We see roughly the same "deadlock" when any one of them is missing. I think we're seeing this non-linear behavior where all of the changes are needed because with any one of them missing the first heartbeat to a new transaction record is being delayed long enough that the record expires.

This breaks the cases described in the comment above, right?

This avoids the issue I was seeing, yes. I don't think it's actually the solution we want though, and I think you agree.

I'd like to propose a third alternative for discussion: Add the full txn proto to HeartbeatTxnRequest and allow it to write a transaction record when none exists.

If we're going to move in this direction then I'd like to see if we could expand it all the way to the natural conclusion of deprecating HeartbeatTxnRequest and making BeginTransactionRequest subsume its responsibility entirely. BeginTransactionRequest would then need to be made idempotent, which would affect some of the replay protection it currently provides. This might be ok though. @tschottdorf may have thoughts on this, as I think he's been thinking about idempotency lately.

This feels consistent with the way we handle multi-range batches, by allowing other writes to happen before the BeginTransaction

Could you expand on this? Are you referring to how we write an aborted transaction record when a PushTxnRequest is sent for a transaction that does not yet have a transaction record?

It would almost solve #23945, except that the HeartbeatTxn would be stuck in the command queue behind the BeginTransaction batch anyway.

This would only be the case if we began the heartbeat when we sent the BeginTxn request instead of when we got the first response after sending it.

Maybe there's some way we could make the command queue smarter about txn ops.

Yeah, I was beginning to think about the same thing. It would be really nice if we had some notion of priority in the CommandQueue so that HeartbeatTxn requests could jump in front of other lower priority requests.

@tbg
Copy link
Member

tbg commented Mar 20, 2018

@nvanbenschoten the txn record is written the second time around because after the Epoch increment, it hits a new path in DistSender which splits off the EndTransaction into a separate batch (to prevent starvation, ironically).

The Writing status is generally a mess, but in the 2.0.x time frame, wouldn't making the heartbeat loop accept the absence of the txn record (at least for the first heartbeat, but why not always?) solve the problem? I don't think we ever rely on the heartbeat loop aborting in that scenario.

In the longer term, I agree that it'd be nice if BeginTransaction and HeartbeatTxn fused into one, but also we'd want to get rid of the Writing flag, and there are lots of subtle considerations to be made (I'm fleshing up a few of those in a related upcoming tech note).

@bdarnell
Copy link
Contributor

This breaks the cases described in the comment above, right?

This avoids the issue I was seeing, yes. I don't think it's actually the solution we want though, and I think you agree.

"The comment above" referred to the comment in the code above this line. This change would break things for that scenario.

If we're going to move in this direction then I'd like to see if we could expand it all the way to the natural conclusion of deprecating HeartbeatTxnRequest and making BeginTransactionRequest subsume its responsibility entirely. BeginTransactionRequest would then need to be made idempotent, which would affect some of the replay protection it currently provides. This might be ok though. @tschottdorf may have thoughts on this, as I think he's been thinking about idempotency lately.

Yeah, allowing HeartbeatTxn to write a txn record would be equivalent to making BeginTransaction replayable. One reason why we might want to keep both methods is to switch from BeginTxn to HeartbeatTxn after we know that the txn record has been written once, since we'd know that any subsequent missing txn record would be because it had been aborted and cleaned up. (HeartbeatTxn might also be a little cheaper since it wouldn't need to carry the TxnMeta data)

This would only be the case if we began the heartbeat when we sent the BeginTxn request instead of when we got the first response after sending it.

How so? The batch (BeginTxn, Put) could be blocked in the command queue on the put's key, but that batch will then block all of its keys. The heartbeat couldn't go through until the Begin/Put batch finishes, even if we've gotten responses from other parts of the split batch.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 21, 2018
Related to cockroachdb#20448.

This change adds a new error type - `TransactionNotFoundError`. This
error is returned when an operation looks up a transaction record that
it expects to exist and finds that it does not. This is possible for
`HeartbeatTxnRequests` and `EndTxnRequests`.

The change then detects this error type in the `TxnCoordSender`
heartbeat loop. Unlike for any other error type, this error does
not force the heartbeat loop to shut down. The reason for this is
that the heartbeat loop can be started before a client is certain
that a transaction record has been written. If this is ever the
case, like we saw in the YCSB exploration, then terminating the
heartbeat loop on a `TransactionNotFoundError` will prematurely
abort the entire transaction.

Release note (bug fix): Fixed bug where an expected transaction
heartbeat failure aborted the transaction.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 2, 2018
Related to cockroachdb#20448.

This change adds a reason enum to`TransactionStatusError`. This
enum includes a variant is returned when an operation looks up a
transaction record that it expects to exist and finds that it does
not. This is possible for `HeartbeatTxnRequests` and `EndTxnRequests`.

The change then detects this error type in the `TxnCoordSender`
heartbeat loop. Unlike for any other error type, this error does
not force the heartbeat loop to shut down. The reason for this is
that the heartbeat loop can be started before a client is certain
that a transaction record has been written. If this is ever the
case, like we saw in the YCSB exploration, then terminating the
heartbeat loop on a `TransactionStatusError` will prematurely
abort the entire transaction.

Release note (bug fix): Fixed bug where an expected transaction
heartbeat failure aborted the transaction.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 2, 2018
Related to cockroachdb#20448.

This change adds a reason enum to`TransactionStatusError`. This
enum includes a variant is returned when an operation looks up a
transaction record that it expects to exist and finds that it does
not. This is possible for `HeartbeatTxnRequests` and `EndTxnRequests`.

The change then detects this error type in the `TxnCoordSender`
heartbeat loop. Unlike for any other error type, this error does
not force the heartbeat loop to shut down. The reason for this is
that the heartbeat loop can be started before a client is certain
that a transaction record has been written. If this is ever the
case, like we saw in the YCSB exploration, then terminating the
heartbeat loop on a `TransactionStatusError` will prematurely
abort the entire transaction.

Release note (bug fix): Fixed bug where an expected transaction
heartbeat failure aborted the transaction.
@petermattis petermattis modified the milestones: 2.0, 2.1 Apr 5, 2018
@knz knz added O-community Originated from the community C-performance Perf of queries or internals. Solution not expected to change functional behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting and removed O-community-questions labels Apr 24, 2018
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 24, 2018
Fixes cockroachdb#23945.
See cockroachdb#20448.

This change addresses a case where delayed BeginTxn requests can result
in txn records looking inactive immediately upon being written. We now
bump the txn record's LastHeartbeat timestamp when writing the record.

Release note: None
craig bot pushed a commit that referenced this issue Apr 24, 2018
24969: opt: Hoist Exists operator and try to decorrelate r=andy-kimball a=andy-kimball

This PR has two commits for easier review. The first adds support for OuterCols as a relational property. The second uses that to implement transforming the Exists operator into a SemiJoinApply operator, and then attempting to decorrelate that.

25034: storage: bump LastHeartbeat timestamp when writing txn record r=nvanbenschoten a=nvanbenschoten

Fixes #23945.
See #20448.

This change addresses a case where delayed BeginTxn requests can result
in txn records looking inactive immediately upon being written. We now
bump the txn record's LastHeartbeat timestamp when writing the record.

Release note: None

Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten nvanbenschoten added the A-kv-transactions Relating to MVCC and the transactional model. label Apr 24, 2018
craig bot pushed a commit that referenced this issue May 31, 2018
25014: storage: queue requests to push txn / resolve intents on single keys r=spencerkimball a=spencerkimball

Previously, high contention on a single key would cause every thread to
push the same conflicting transaction then resolve the same intent in
parallel. This is inefficient as only one pusher needs to succeed, and
only one resolver needs to resolve the intent, and then only one writer
should proceed while the other readers/writers should in turn wait on
the previous writer by pushing its transaction. This effectively
serializes the conflicting reader/writers.
    
One complication is that all pushers which may have a valid, writing
transaction (i.e., `Transaction.Key != nil`), must push either the
conflicting transaction or another transaction already pushing that
transaction. This allows dependency cycles to be discovered.

Fixes #20448 

25791: jobs: bump default progress log time to 30s r=mjibson a=mjibson

The previous code allowed updates to be performed every 1s, which could
cause the MVCC row to be very large causing problems with splits. We
can update much more slowly by default. In the case of a small backup
job, the 5% fraction threshold will allow a speedier update rate.

Remove a note that's not useful anymore since the referred function
can now only be used in the described safe way.

See #25770. Although this change didn't fix that bug, we still think
it's a good idea.

Release note: None

26293: opt: enable a few distsql logictests r=RaduBerinde a=RaduBerinde

 - `distsql_indexjoin`: this is only a planning test. Modifying the
   split points and queries a bit to make the condition more
   restrictive and make the optimizer choose index joins. There was a
   single plan that was different, and the difference was minor (the
   old planner is emitting an unnecessary column).

 - `distsql_expr`: logic-only test, enabling for opt.

 - `distsql_scrub`: planning test; opt version commented out for now.

Release note: None

Co-authored-by: Spencer Kimball <[email protected]>
Co-authored-by: Matt Jibson <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@craig craig bot closed this as completed in #25014 May 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.