Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410

Closed
cockroach-teamcity opened this issue Sep 20, 2021 · 6 comments · Fixed by #70447
Closed

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410

cockroach-teamcity opened this issue Sep 20, 2021 · 6 comments · Fixed by #70447
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 20, 2021

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ d9278939d4be4d1109e9ed0a84c6e9cd2a8705bb:

		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:902
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1987
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:966
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1066
		  | main.(*monitorImpl).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:106
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) output in run_092812.543732632_n1_cockroach_workload_fixtures_import_tpcc
		Wraps: (7) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-3466946-1632119031-36-n12cpu4-geo:1 -- ././cockroach workload fixtures import tpcc --warehouses=3000 --checks=false --partitions=3 --zones="us-east1-b,us-west1-b,europe-west2-b" returned
		  | stderr:
		  | I210920 09:28:13.712212 1 ccl/workloadccl/fixture.go:345  [-] 1  starting import of 9 tables
		  | I210920 09:28:30.516886 68 ccl/workloadccl/fixture.go:478  [-] 2  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 16.795294584s, 0.46 MiB/s)
		  | I210920 09:28:35.638880 31 ccl/workloadccl/fixture.go:478  [-] 3  imported 3.0 MiB in district table (30000 rows, 0 index entries, took 21.918323337s, 0.13 MiB/s)
		  | I210920 09:28:37.368016 30 ccl/workloadccl/fixture.go:478  [-] 4  imported 160 KiB in warehouse table (3000 rows, 0 index entries, took 23.647565382s, 0.01 MiB/s)
		  | I210920 09:29:21.186560 67 ccl/workloadccl/fixture.go:478  [-] 5  imported 383 MiB in new_order table (27000000 rows, 0 index entries, took 1m7.464939651s, 5.68 MiB/s)
		  | I210920 09:34:46.272774 66 ccl/workloadccl/fixture.go:478  [-] 6  imported 4.7 GiB in order table (90000000 rows, 90000000 index entries, took 6m32.551220907s, 12.30 MiB/s)
		  | I210920 09:35:29.312540 33 ccl/workloadccl/fixture.go:478  [-] 7  imported 6.5 GiB in history table (90000000 rows, 0 index entries, took 7m15.591492291s, 15.18 MiB/s)
		  | I210920 09:43:21.245596 32 ccl/workloadccl/fixture.go:478  [-] 8  imported 52 GiB in customer table (90000000 rows, 90000000 index entries, took 15m7.524314635s, 58.41 MiB/s)
		  | I210920 09:49:11.939503 69 ccl/workloadccl/fixture.go:478  [-] 9  imported 90 GiB in stock table (300000000 rows, 0 index entries, took 20m58.217797335s, 73.36 MiB/s)
		  | I210920 09:50:36.478020 70 ccl/workloadccl/fixture.go:478  [-] 10  imported 51 GiB in order_line table (900015304 rows, 0 index entries, took 22m22.75670363s, 38.65 MiB/s)
		  | Error: importing fixture: importing table district: pq: relation "district" is offline: importing
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ././cockroach workload fixtures import tpcc --warehouses=3000 --checks=false --partitions=3 --zones="us-east1-b,us-west1-b,europe-west2-b"
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (8) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *exec.ExitError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Epic CRDB-7779

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 20, 2021
@tbg tbg assigned dt Sep 20, 2021
@tbg
Copy link
Member

tbg commented Sep 20, 2021

Looks similar to #61708.

@ajwerner
Copy link
Contributor

If you download the debug zip, you'll see that the district table job has been marked successful. You'll also notice that if you decode the descriptors, that the table is not marked public.

cat ~/Downloads/system.descriptor.txt | awk '{print $(NF)}' | ~/local/1/cockroach debug decode-proto | tail -n+2 | jq '{"id": .table.id, "name": .table.name, "state": .table.state} | select(.state)'

{
  "id": 60,
  "name": "district",
  "state": "OFFLINE"
}

adityamaru added a commit to adityamaru/cockroach that referenced this issue Sep 20, 2021
This change pulls the logic related to inserting table statistic
stubs for new tables created during the import, out of the txn
closure that sets the tables to `public` from `offline`.
This was motivated by the failure mode observed in cockroachdb#70410 where
we failed to write the stats stub for a particular table and ended
up with a descriptor published in the offline state.

Fixes: cockroachdb#70410

Release note: None
@adityamaru
Copy link
Contributor

What we know so far:

This code seems like the likely culprit

for _, tbl := range details.Tables {
if tbl.IsNew {
desc := tabledesc.NewUnsafeImmutable(tbl.Desc)
id := roachpb.BulkOpSummaryID(uint64(desc.GetID()), uint64(desc.GetPrimaryIndexID()))
rowCount := uint64(res.EntryCounts[id])
// TODO(michae2): collect distinct and null counts during import.
distinctCount := uint64(float64(rowCount) * memo.UnknownDistinctCountRatio)
nullCount := uint64(float64(rowCount) * memo.UnknownNullCountRatio)
// Because we don't yet have real distinct and null counts, only produce
// single-column stats to avoid the appearance of perfectly correlated
// columns.
multiColEnabled := false
statistics, err := sql.StubTableStats(desc, jobspb.ImportStatsName, multiColEnabled)
if err == nil {
for _, statistic := range statistics {
statistic.RowCount = rowCount
statistic.DistinctCount = distinctCount
statistic.NullCount = nullCount
}
err = stats.InsertNewStats(ctx, execCfg.InternalExecutor, txn, statistics)
}
if err != nil {
// Failure to create statistics should not fail the entire import.
log.Warningf(
ctx, "error while creating statistics during import of %q: %v",
desc.GetName(), err,
)
}
}
}

it's reading from the old descriptors stored on the job, in the same DescsTxn as the one in which we publish descriptors to the public state.

We do see the writing of statistic stubs failing for the district table with:

1.unredacted/cockroach.log:W210920 09:28:31.139138 7447 ccl/importccl/import_stmt.go:2426 ⋮ [n1,job=‹694877204414857217›] 1065  error while creating statistics during import of ‹"district"›: ‹TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "unnamed" meta={id=5c415355 key=/Table/SystemConfigSpan/Start pri=0.01494186 epo=0 ts=1632130109.064229438,1 min=1632130109.003469602,0 seq=5} lock=true stat=PENDING rts=1632130109.003469602,0 wto=false gul=1632130109.503469602,0›

1.unredacted/cockroach.teamcity-3466946-1632119031-36-n12cpu4-geo-0001.ubuntu.2021-09-20T09_27_24Z.015863.log:W210920 09:28:31.139138 7447 ccl/importccl/import_stmt.go:2426 ⋮ [n1,job=‹694877204414857217›] 1065  error while creating statistics during import of ‹"district"›: ‹TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "unnamed" meta={id=5c415355 key=/Table/SystemConfigSpan/Start pri=0.01494186 epo=0 ts=1632130109.064229438,1 min=1632130109.003469602,0 seq=5} lock=true stat=PENDING rts=1632130109.003469602,0 wto=false gul=1632130109.503469602,0›

We swallow the above error when writing table stats stubs, and so the transaction should not rollback? It does indeed appear that details.TablePublished, which is also set in the same txn, is being persisted as true despite this failure. So, the txn seems to have committed BUT the descriptor was published in an offline state. This would only occur if the txn was rolled back, or we are unintentionally overwriting the descriptor with the old descriptor stored on the job details? I'm not clear how we got into the state we did given my understanding of txn semantics.

@nvanbenschoten
Copy link
Member

The code you pointed to does seem suspect. We need to propagate errors up to the transaction retry loop for retryable errors to be handled correctly. This sounds like #43067, which is related to #22615.

The behavior that we would see if we failed to propagate the error is that any side effect of the transaction up to the point of the error would roll back, but the transaction would then continue in a new epoch past that point. So any side effect from after this point could still be committed. Is that consistent with the symptoms here?

@adityamaru
Copy link
Contributor

Is that consistent with the symptoms here?

Yep, that's exactly what we're observing. Thanks for this!

@michae2
Copy link
Collaborator

michae2 commented Sep 21, 2021

Thank you for the fix @adityamaru and everyone else for the analysis. I promise to propagate retryable errors from now on. 😊

craig bot pushed a commit that referenced this issue Sep 21, 2021
70130: changefeedccl: check that rangefeeds are enabled earlier r=miretskiy a=stevendanna

Pushing this check earlier means that for core changefeeds, the check
happens before the backfill. Previously an error would only be seen
after the backfill when we attempted to start the rangefeed, which
some users found confusing.

Release note: None

70326: [CC-4712] ui: fix drag-to-timerange specific window issue r=Santamaura a=Santamaura

This PR fixes the issue where the user drags-to-timerange on a metrics page graph where the window is exactly 24 hours or 1 hour and the timewindow changes to the past day or past hour. This was due to a time scale function that determines the closes time scale returning a standard option instead of a custom option because the calculation only considers duration. Some additional logic was added to the function to handle this specific edge case.

Release note (ui change): fix drag-to-timerange for specific window issue

https://user-images.githubusercontent.com/17861665/133663386-8284d5c7-ca96-4d0b-8fbf-d62d4fd4a7e3.mp4





70447: importccl: move stub stats logic outside publish txn r=ajwerner,michae2 a=adityamaru

This change pulls the logic related to inserting table statistic
stubs for new tables created during the import, out of the txn
closure that sets the tables to `public` from `offline`.
This was motivated by the failure mode observed in #70410 where
we failed to write the stats stub for a particular table and ended
up with a descriptor published in the offline state.

Fixes: #70410

Release note: None

70467: colexecop: harden CloseAndLogOnErr against panics r=yuzefovich a=yuzefovich

We have seen a few cases where closing the `Closer`s would lead to
a panic which wasn't caught because we're outside of the panic-catcher
scope. This commit adds a panic-catcher to `CloseAndLogOnErr` (which is
the "root" way of closing things) in order to increase the stability of
CRDB in face of edge case bugs.

Addresses: #70000.
Addresses: #70438.

Release note: None

70484: workload: log histogram write/encode failures, close output file r=erikgrinaker a=stevendanna

We are currently observing incomplete histograms being output during
nightly roachperf tpccbench runs.

I don't think the changes here are likely to address the cause, as I
would expect write failures to affect a broader range of roachperf
output. But, it is still good to log any failures we do encounter.

Further, we now sync and close the file explicitly.

Informs #70313

Release note: None

70491: roachtest: retry java install r=miretskiy a=stevendanna

Occasionally, the apt mirrors in GCP return 503 Service
Unavailable.

Here, we retry the install attempt 3 times with some backoff between
attempts.

I've also added the --no-install-recommends flag, although it does
very little in this case.

Release note: None

Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Santamaura <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 5fba5c5 Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants