-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #39005
Comments
SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog
|
After a completed instance of the load, the next one never quite got to starting. The chaos is this: cockroach/pkg/cmd/roachtest/tpcc.go Lines 735 to 744 in 3cf2306
and there's a neverending stream of
until the test times out. I take this to mean that the prepare phase takes more than the time it takes for the gateway node to be killed. Nodes are killed randomly every 90s, so I assume somehow we got into a state where the PREPAREs never returned. I looked into some of the node logs and wasn't seeing anything out of the ordinary, but of course they're pretty messy thanks to the chaos. The last failure mode of this test in its history prior to this issue was an assertion that has since been fixed, but it also has timed out before: https://teamcity.cockroachdb.com/viewLog.html?buildTypeId=Cockroach_Nightlies_WorkloadNightly&buildId=1367354 |
SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/cfdaadc3514e7e8660f6c009ba159fdfd604f0a8 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1409070&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/5bd37e8eb58ca66b9293c234bc572411057fec3a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417287&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1460982&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/66bd279c9aa682c2b7adcec87ec0c639b8039a33 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1461635&tab=buildLog
|
Looks like this test needs an update after the recent partitioning changes. |
SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1462518&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog
|
^- this seems pretty bad,
|
Oh wait, that's the same thing I'm already fixing in |
40248: opt: calculate number of rows processed when costing joins r=rytaft a=rytaft This PR updates the costing of joins to take into account the number of rows processed by the operator. This number may be larger than the number of output rows if an additional filter is applied as part of the ON condition that is not used to determine equality columns for the join. For example, consider the query `SELECT * FROM abc JOIN def ON a = e AND b = 3;` Assuming there is no index on b, if a lookup join is used to execute this query, the number of rows processed is actually the same as the query `SELECT * FROM abc JOIN def ON a = e;` The difference is that the filter b=3 must also be applied to every row in the first query. The coster now takes this into account when determining the cost of joins. Fixes #34810 Release note: None 40431: workload: fix partition commands in tpcc import r=solongordon a=solongordon The commands for partitioning indexes in the TPCC import were erroring out due to a syntax change introduced in #39332. I updated them to use `ALTER PARTITION ... OF INDEX` rather than `ALTER PARTITION ... OF TABLE`. Fixes #39005 Fixes #40360 Fixes #40416 Release note: None Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Solon Gordon <[email protected]>
I fixed a related issue but looks like there is still an outstanding one being handled by @tbg. |
SHA: https://github.com/cockroachdb/cockroach/commits/4784fe3c51545db5fb5d411937ec1db2ef2b9761 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1472753&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/201b400c5a59d42d436f417a284a129cff3ed7b3 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1471316&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/47bb2a58c87fc1259291ec9dde78de3e54bd8a3d Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1475396&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/bdf41f7d03f0dafeeaf3bc6aac40f502ab069a6a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1478770&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/66832694652037f18cd4b29e1471cd237009ef98 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1478788&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/991282eacbbe1315fde694be9785ad8f6fa929d3 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1481778&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/42d307e191ff6787a45e058be164fa452c47f368 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1487519&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1488785&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1489712&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/169729d6a3d1c18d5d652fc40a87bbf5a3bb8a00 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1491709&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/d6a7e59e653596b8baca946b6be714956a0e4c2c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1496394&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition
|
SHA: https://github.com/cockroachdb/cockroach/commits/073999b81ddfed3bbc8409d534912fea12b6d500 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1498740&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition
|
@nvanbenschoten this doesn't seem like a real problem? Is the test configured to run for longer than the default timeout of 10 hours by accident? |
SHA: https://github.com/cockroachdb/cockroach/commits/d6a7e59e653596b8baca946b6be714956a0e4c2c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1499672&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition
|
SHA: https://github.com/cockroachdb/cockroach/commits/a92c7d01d3076eabafbd536d8a344511ec9081c6 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1500206&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition
|
Same as above I think? I think this test is passing just needs to be taught not to kill itself at the end. Who owns this fella? @tbg @nvanbenschoten |
Let's see |
Doesn't seem as simple as that. The test is here cockroach/pkg/cmd/roachtest/tpcc.go Lines 394 to 402 in 2fd18e8
and basically starts at 600 warehouses, line searching to figure out the max warehouse count that passes. From the log I see that it passes 600, 610, 630, 670 and then something goes wrong. It never missed the tpmc goal, so IMO it can't have been ready to quit (is that right @nvanbenschoten?) Lots of random stuff happens in the logs so I'm not quite sure what's going on, but I think it's two things: first "something" goes wrong that stalls the test. Then, at the 12h mark, the machines get nuked and we see ssh: connect to host 35.184.118.128 port 22: Connection refused for various IPs (presumably the nodes in the cluster) and then the test terminating. I'm going to have to take a closer look at what the "something" above is. |
I think what's happening is that the chaos runner is restarting a node, and while that is happening, the test harness is stopping the cluster. I see in CHAOS.log:
which interleaves with
in the main log. This does seem to check out with the code, where we use the outer cockroach/pkg/cmd/roachtest/tpcc.go Lines 758 to 781 in 2fd18e8
I started a run with today's |
This reverts commit f56a83d. That fix broke backward compatibility with v2.1, and it is no longer necessary due to 65c5e37. Refers cockroachdb#39005 Release note: None Release justification: Non-production code change
40984: workload: revert partition fix for tpcc import r=solongordon a=solongordon This reverts commit f56a83d. That fix broke backward compatibility with v2.1, and it is no longer necessary due to 65c5e37. Refers #39005 Release note: None Release justification: Non-production code change Co-authored-by: Solon Gordon <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/6b14c0aa3ed1b4ba6d5f937e9352c5383afe1c37 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1501050&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition
|
40981: roachtest: deflake tpccbench chaos tests r=nvanbenschoten a=tbg These tests were running a chaos agent across cluster restarts. Whenever a cluster restart would overlap with the chaos agent restarting a node, one of the two operations would fail and jam the test. Fixes #39005. I suggest https://github.com/cockroachdb/cockroach/pull/40981/files?w=1 for reviewing (i.e. ignore whitespace). Release justification: de-flakes a roachtest without changes to the release binary. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog
The text was updated successfully, but these errors were encountered: