roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410

cockroach-teamcity · 2021-09-20T09:51:43Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ d9278939d4be4d1109e9ed0a84c6e9cd2a8705bb:

		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:902
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1987
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:966
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1066
		  | main.(*monitorImpl).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:106
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) output in run_092812.543732632_n1_cockroach_workload_fixtures_import_tpcc
		Wraps: (7) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-3466946-1632119031-36-n12cpu4-geo:1 -- ././cockroach workload fixtures import tpcc --warehouses=3000 --checks=false --partitions=3 --zones="us-east1-b,us-west1-b,europe-west2-b" returned
		  | stderr:
		  | I210920 09:28:13.712212 1 ccl/workloadccl/fixture.go:345  [-] 1  starting import of 9 tables
		  | I210920 09:28:30.516886 68 ccl/workloadccl/fixture.go:478  [-] 2  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 16.795294584s, 0.46 MiB/s)
		  | I210920 09:28:35.638880 31 ccl/workloadccl/fixture.go:478  [-] 3  imported 3.0 MiB in district table (30000 rows, 0 index entries, took 21.918323337s, 0.13 MiB/s)
		  | I210920 09:28:37.368016 30 ccl/workloadccl/fixture.go:478  [-] 4  imported 160 KiB in warehouse table (3000 rows, 0 index entries, took 23.647565382s, 0.01 MiB/s)
		  | I210920 09:29:21.186560 67 ccl/workloadccl/fixture.go:478  [-] 5  imported 383 MiB in new_order table (27000000 rows, 0 index entries, took 1m7.464939651s, 5.68 MiB/s)
		  | I210920 09:34:46.272774 66 ccl/workloadccl/fixture.go:478  [-] 6  imported 4.7 GiB in order table (90000000 rows, 90000000 index entries, took 6m32.551220907s, 12.30 MiB/s)
		  | I210920 09:35:29.312540 33 ccl/workloadccl/fixture.go:478  [-] 7  imported 6.5 GiB in history table (90000000 rows, 0 index entries, took 7m15.591492291s, 15.18 MiB/s)
		  | I210920 09:43:21.245596 32 ccl/workloadccl/fixture.go:478  [-] 8  imported 52 GiB in customer table (90000000 rows, 90000000 index entries, took 15m7.524314635s, 58.41 MiB/s)
		  | I210920 09:49:11.939503 69 ccl/workloadccl/fixture.go:478  [-] 9  imported 90 GiB in stock table (300000000 rows, 0 index entries, took 20m58.217797335s, 73.36 MiB/s)
		  | I210920 09:50:36.478020 70 ccl/workloadccl/fixture.go:478  [-] 10  imported 51 GiB in order_line table (900015304 rows, 0 index entries, took 22m22.75670363s, 38.65 MiB/s)
		  | Error: importing fixture: importing table district: pq: relation "district" is offline: importing
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ././cockroach workload fixtures import tpcc --warehouses=3000 --checks=false --partitions=3 --zones="us-east1-b,us-west1-b,europe-west2-b"
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (8) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *exec.ExitError

Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Epic CRDB-7779}

The text was updated successfully, but these errors were encountered:

tbg · 2021-09-20T10:04:39Z

Looks similar to #61708.

ajwerner · 2021-09-20T15:51:45Z

If you download the debug zip, you'll see that the district table job has been marked successful. You'll also notice that if you decode the descriptors, that the table is not marked public.

cat ~/Downloads/system.descriptor.txt | awk '{print $(NF)}' | ~/local/1/cockroach debug decode-proto | tail -n+2 | jq '{"id": .table.id, "name": .table.name, "state": .table.state} | select(.state)'

{
  "id": 60,
  "name": "district",
  "state": "OFFLINE"
}

This change pulls the logic related to inserting table statistic stubs for new tables created during the import, out of the txn closure that sets the tables to `public` from `offline`. This was motivated by the failure mode observed in cockroachdb#70410 where we failed to write the stats stub for a particular table and ended up with a descriptor published in the offline state. Fixes: cockroachdb#70410 Release note: None

adityamaru · 2021-09-20T21:51:40Z

What we know so far:

This code seems like the likely culprit

cockroach/pkg/ccl/importccl/import_stmt.go

Lines 2403 to 2432 in a454599

    
           for _, tbl := range details.Tables { 
        
           	if tbl.IsNew { 
        
           		desc := tabledesc.NewUnsafeImmutable(tbl.Desc) 
        
           		id := roachpb.BulkOpSummaryID(uint64(desc.GetID()), uint64(desc.GetPrimaryIndexID())) 
        
           		rowCount := uint64(res.EntryCounts[id]) 
        
           		// TODO(michae2): collect distinct and null counts during import. 
        
           		distinctCount := uint64(float64(rowCount) * memo.UnknownDistinctCountRatio) 
        
           		nullCount := uint64(float64(rowCount) * memo.UnknownNullCountRatio) 
        
           		// Because we don't yet have real distinct and null counts, only produce 
        
           		// single-column stats to avoid the appearance of perfectly correlated 
        
           		// columns. 
        
           		multiColEnabled := false 
        
           		statistics, err := sql.StubTableStats(desc, jobspb.ImportStatsName, multiColEnabled) 
        
           		if err == nil { 
        
           			for _, statistic := range statistics { 
        
           				statistic.RowCount = rowCount 
        
           				statistic.DistinctCount = distinctCount 
        
           				statistic.NullCount = nullCount 
        
           			} 
        
           			err = stats.InsertNewStats(ctx, execCfg.InternalExecutor, txn, statistics) 
        
           		} 
        
           		if err != nil { 
        
           			// Failure to create statistics should not fail the entire import. 
        
           			log.Warningf( 
        
           				ctx, "error while creating statistics during import of %q: %v", 
        
           				desc.GetName(), err, 
        
           			) 
        
           		} 
        
           	} 
        
           }

it's reading from the old descriptors stored on the job, in the same DescsTxn as the one in which we publish descriptors to the public state.

We do see the writing of statistic stubs failing for the district table with:

1.unredacted/cockroach.log:W210920 09:28:31.139138 7447 ccl/importccl/import_stmt.go:2426 ⋮ [n1,job=‹694877204414857217›] 1065  error while creating statistics during import of ‹"district"›: ‹TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "unnamed" meta={id=5c415355 key=/Table/SystemConfigSpan/Start pri=0.01494186 epo=0 ts=1632130109.064229438,1 min=1632130109.003469602,0 seq=5} lock=true stat=PENDING rts=1632130109.003469602,0 wto=false gul=1632130109.503469602,0›

1.unredacted/cockroach.teamcity-3466946-1632119031-36-n12cpu4-geo-0001.ubuntu.2021-09-20T09_27_24Z.015863.log:W210920 09:28:31.139138 7447 ccl/importccl/import_stmt.go:2426 ⋮ [n1,job=‹694877204414857217›] 1065  error while creating statistics during import of ‹"district"›: ‹TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "unnamed" meta={id=5c415355 key=/Table/SystemConfigSpan/Start pri=0.01494186 epo=0 ts=1632130109.064229438,1 min=1632130109.003469602,0 seq=5} lock=true stat=PENDING rts=1632130109.003469602,0 wto=false gul=1632130109.503469602,0›

We swallow the above error when writing table stats stubs, and so the transaction should not rollback? It does indeed appear that details.TablePublished, which is also set in the same txn, is being persisted as true despite this failure. So, the txn seems to have committed BUT the descriptor was published in an offline state. This would only occur if the txn was rolled back, or we are unintentionally overwriting the descriptor with the old descriptor stored on the job details? I'm not clear how we got into the state we did given my understanding of txn semantics.

nvanbenschoten · 2021-09-21T16:35:39Z

The code you pointed to does seem suspect. We need to propagate errors up to the transaction retry loop for retryable errors to be handled correctly. This sounds like #43067, which is related to #22615.

The behavior that we would see if we failed to propagate the error is that any side effect of the transaction up to the point of the error would roll back, but the transaction would then continue in a new epoch past that point. So any side effect from after this point could still be committed. Is that consistent with the symptoms here?

adityamaru · 2021-09-21T16:40:31Z

Is that consistent with the symptoms here?

Yep, that's exactly what we're observing. Thanks for this!

michae2 · 2021-09-21T16:50:40Z

Thank you for the fix @adityamaru and everyone else for the analysis. I promise to propagate retryable errors from now on. 😊

70130: changefeedccl: check that rangefeeds are enabled earlier r=miretskiy a=stevendanna Pushing this check earlier means that for core changefeeds, the check happens before the backfill. Previously an error would only be seen after the backfill when we attempted to start the rangefeed, which some users found confusing. Release note: None 70326: [CC-4712] ui: fix drag-to-timerange specific window issue r=Santamaura a=Santamaura This PR fixes the issue where the user drags-to-timerange on a metrics page graph where the window is exactly 24 hours or 1 hour and the timewindow changes to the past day or past hour. This was due to a time scale function that determines the closes time scale returning a standard option instead of a custom option because the calculation only considers duration. Some additional logic was added to the function to handle this specific edge case. Release note (ui change): fix drag-to-timerange for specific window issue https://user-images.githubusercontent.com/17861665/133663386-8284d5c7-ca96-4d0b-8fbf-d62d4fd4a7e3.mp4 70447: importccl: move stub stats logic outside publish txn r=ajwerner,michae2 a=adityamaru This change pulls the logic related to inserting table statistic stubs for new tables created during the import, out of the txn closure that sets the tables to `public` from `offline`. This was motivated by the failure mode observed in #70410 where we failed to write the stats stub for a particular table and ended up with a descriptor published in the offline state. Fixes: #70410 Release note: None 70467: colexecop: harden CloseAndLogOnErr against panics r=yuzefovich a=yuzefovich We have seen a few cases where closing the `Closer`s would lead to a panic which wasn't caught because we're outside of the panic-catcher scope. This commit adds a panic-catcher to `CloseAndLogOnErr` (which is the "root" way of closing things) in order to increase the stability of CRDB in face of edge case bugs. Addresses: #70000. Addresses: #70438. Release note: None 70484: workload: log histogram write/encode failures, close output file r=erikgrinaker a=stevendanna We are currently observing incomplete histograms being output during nightly roachperf tpccbench runs. I don't think the changes here are likely to address the cause, as I would expect write failures to affect a broader range of roachperf output. But, it is still good to log any failures we do encounter. Further, we now sync and close the file explicitly. Informs #70313 Release note: None 70491: roachtest: retry java install r=miretskiy a=stevendanna Occasionally, the apt mirrors in GCP return 503 Service Unavailable. Here, we retry the install attempt 3 times with some backoff between attempts. I've also added the --no-install-recommends flag, although it does very little in this case. Release note: None Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Santamaura <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

tbg assigned dt Sep 20, 2021

blathers-crl bot added the T-disaster-recovery label Sep 20, 2021

livlobo assigned adityamaru and unassigned dt Sep 20, 2021

adityamaru mentioned this issue Sep 20, 2021

importccl: move stub stats logic outside publish txn #70447

Merged

craig bot closed this as completed in 5fba5c5 Sep 21, 2021

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410

cockroach-teamcity commented Sep 20, 2021 •

edited by exalate-issue-sync bot

Loading

tbg commented Sep 20, 2021

ajwerner commented Sep 20, 2021

adityamaru commented Sep 20, 2021

nvanbenschoten commented Sep 21, 2021

adityamaru commented Sep 21, 2021

michae2 commented Sep 21, 2021 •

edited

Loading

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410

Comments

cockroach-teamcity commented Sep 20, 2021 • edited by exalate-issue-sync bot Loading

tbg commented Sep 20, 2021

ajwerner commented Sep 20, 2021

adityamaru commented Sep 20, 2021

nvanbenschoten commented Sep 21, 2021

adityamaru commented Sep 21, 2021

michae2 commented Sep 21, 2021 • edited Loading

cockroach-teamcity commented Sep 20, 2021 •

edited by exalate-issue-sync bot

Loading

michae2 commented Sep 21, 2021 •

edited

Loading