Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/nodeShutdown/worker failed #81353

Closed
cockroach-teamcity opened this issue May 17, 2022 · 3 comments · Fixed by #89354
Closed

roachtest: import/nodeShutdown/worker failed #81353

cockroach-teamcity opened this issue May 17, 2022 · 3 comments · Fixed by #89354
Assignees
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 17, 2022

roachtest.import/nodeShutdown/worker failed with artifacts on release-22.1 @ f6578028f0b029cbb623bd278c57c6bef85c4835:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/import/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,import.go:95,test_runner.go:875: monitor failure: monitor task failed: unexpectedly found job 762493548270747650 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:95
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:875
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 762493548270747650 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-15229

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 17, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@stevendanna stevendanna self-assigned this May 24, 2022
@stevendanna
Copy link
Collaborator

stevendanna commented May 25, 2022

I'm going to leave the release blocker on this for now. This test starts a job, kills a node, and asserts that the job should still run until completion.

In this test run. The job was created using a gateway node n2. It was adopted by n4. The test runner killed n3.

Test runner:

05:22:44 jobs.go:73: started running job with ID 762493548270747650
05:22:45 cluster.go:659: test status: stopping nodes :3

n4:
I220517 05:22:45.776328 326 jobs/adopt.go:105 ⋮ [n4] 157  claimed 1 jobs
I220517 05:22:45.781259 2746 jobs/adopt.go:240 ⋮ [n4] 158  job 762493548270747650: resuming execution
I220517 05:22:45.790718 2749 jobs/registry.go:1134 ⋮ [n4] 159  IMPORT job 762493548270747650: stepping through state running with error: <nil>

The job then failed because of n3's breaker being open on n4:

I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190  IMPORT job 762493548270747650: stepping through state reverting with error: unable to dial n3: ‹breaker open›
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +(1) attached stack trace
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  -- stack trace:
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | 	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:193
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).Dial
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | 	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:118
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | github.com/cockroachdb/cockroach/pkg/sql.runnerRequest.run
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | 	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:85
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).initRunners.func1
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | 	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:115
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | runtime.goexit
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +  | 	GOROOT/src/runtime/asm_amd64.s:1581
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +Wraps: (2) unable to dial n3
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +Wraps: (3) ‹breaker open›
I220517 05:22:54.976787 2749 jobs/registry.go:1134 ⋮ [n4] 190 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

@stevendanna
Copy link
Collaborator

I don't think this needs to block the release, but it is something we should continue to look into.

@stevendanna stevendanna removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 30, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.import/nodeShutdown/worker failed with artifacts on release-22.1 @ 055c8b6bfb804ac6ddbfe1937ddd49d9b2e5eac1:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/import/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,import.go:95,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 788540211621658626 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:95
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 788540211621658626 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

stevendanna added a commit to stevendanna/cockroach that referenced this issue Oct 4, 2022
We would like to retry circuit breaker open errors. In fact,
jobs.IsPermanentBulkJobError already looks like it would return false
for breaker open errors.

But, there are actually two circuit breaker packages we use:

    github.com/cockroachdb/circuitbreaker
    github.com/cockroachdb/cockroach/pkg/util/circuit

Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch
errors from one of these packages. Now, we test for both.

As a result, ErrBreakerOpen errors emerging from the nodedialer will
now be retried.

Fixes cockroachdb#89159
Fixes cockroachdb#85111
Fixes cockroachdb#81353

I may be being a bit optimistic that this will fully fixe those
failures. Success of the job still requires that the retry of the job
is successful.

Release note (bug fix): Fix bug that resulted in some retriable errors
not being retried during IMPORT.
craig bot pushed a commit that referenced this issue Oct 5, 2022
89261: memo: fix zigzag join stats and costs r=msirek a=msirek

Fixes https://github.com/cockroachlabs/support/issues/1821

Non-covering zigzag join can have a selectivity estimate orders of
magnitude lower than competing plans, causing its cost to be
underestimated. This can make the optimizer choose zigzag join when
there are many qualified rows, which is known to perform poorly.

Also, the per-row cost of zigzag join is underestimated so that even
if selectivity estimates are accurate, the optimizer may still plan
a query using a slow zigzag join.

The selectivity issue is due to a difference between how `buildSelect`
and `buildZigZagJoin` in the `statisticsBuilder` treat constraints
(A filtered Select from the base table should have the same selectivity
as the zigzag join). In `buildSelect`, `filterRelExpr` builds a
filtered histogram via `applyFilters` with new `DistinctCount`s,
then calculates selectivity on the constrained columns, taking into
account which `histCols` already adjusted `DistinctCount`.
```
numUnappliedConjuncts, constrainedCols, histCols :=
sb.applyFilters(filters, e, relProps, false /* skipOrTermAccounting */)
...
corr := sb.correlationFromMultiColDistinctCounts(constrainedCols, e, s)
s.ApplySelectivity(sb.
 selectivityFromConstrainedCols(constrainedCols, histCols, e, s, corr))
```
In `buildZigZagJoin`, `applyFilters` is also called, but the
information about which columns adjusted stats is not considered:
```
multiColSelectivity, _ :=
    sb.selectivityFromMultiColDistinctCounts(constrainedCols, zigzag, s)
s.ApplySelectivity(multiColSelectivity)
```
The solution is to update `buildZigZagJoin` to match the logic in
`filterRelExpr`. This can't be done for zigzag join on inverted indexes
because the constraints aren't pushed into the ON clause. Validating
zigzag join stats on inverted indexes is left for future work.

The costing issue is simply that seek costs are using `seqIOCostFactor`
instead of `randIOCostFactor` like lookup join and inverted join use:
```
cost := memo.Cost(rowCount) * (2*(cpuCostFactor+seqIOCostFactor)
                                       + scanCost + filterPerRow)
```
Every time zigzag join zigs or zags and starts a new scan, that initial
read is like a random IO and incurs some startup overhead. In fact,
profiling has shown it to be quite expensive. The solution is to make
the seek cost be at least on par with lookup join by replacing
`seqIOCostFactor` with `randIOCostFactor + lookupJoinRetrieveRowCost`.
Further fine-tuning may be needed. It may be possible to speed up zigzag
join by trying a point lookup to find a match in the other index before
starting a new scan. This improvement and refinement of costs could be
done simultaneously.

Release note (bug fix): This patch fixes optimizer selectivity and cost
estimates of zigzag join in order to prevent query plans from using it
when it would perform poorly (when many rows are qualified).

89354: jobs,sql/importer: retry circuit breaker open errors r=dt a=stevendanna

We would like to retry circuit breaker open errors. In fact, jobs.IsPermanentBulkJobError already looks like it would return false for breaker open errors.

But, there are actually two circuit breaker packages we use:

    github.com/cockroachdb/circuitbreaker
    github.com/cockroachdb/cockroach/pkg/util/circuit

Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch errors from one of these packages. Now, we test for both.

As a result, ErrBreakerOpen errors emerging from the nodedialer will now be retried.

Fixes #89159
Fixes #85111
Fixes #81353

I may be being a bit optimistic that this will fully fixe those failures. Success of the job still requires that the retry of the job is successful.

Release note (bug fix): Fix bug that resulted in some retriable errors not being retried during IMPORT.

Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
@craig craig bot closed this as completed in 6953c0c Oct 5, 2022
blathers-crl bot pushed a commit that referenced this issue Oct 5, 2022
We would like to retry circuit breaker open errors. In fact,
jobs.IsPermanentBulkJobError already looks like it would return false
for breaker open errors.

But, there are actually two circuit breaker packages we use:

    github.com/cockroachdb/circuitbreaker
    github.com/cockroachdb/cockroach/pkg/util/circuit

Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch
errors from one of these packages. Now, we test for both.

As a result, ErrBreakerOpen errors emerging from the nodedialer will
now be retried.

Fixes #89159
Fixes #85111
Fixes #81353

I may be being a bit optimistic that this will fully fixe those
failures. Success of the job still requires that the retry of the job
is successful.

Release note (bug fix): Fix bug that resulted in some retriable errors
not being retried during IMPORT.
blathers-crl bot pushed a commit that referenced this issue Oct 21, 2022
We would like to retry circuit breaker open errors. In fact,
jobs.IsPermanentBulkJobError already looks like it would return false
for breaker open errors.

But, there are actually two circuit breaker packages we use:

    github.com/cockroachdb/circuitbreaker
    github.com/cockroachdb/cockroach/pkg/util/circuit

Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch
errors from one of these packages. Now, we test for both.

As a result, ErrBreakerOpen errors emerging from the nodedialer will
now be retried.

Fixes #89159
Fixes #85111
Fixes #81353

I may be being a bit optimistic that this will fully fixe those
failures. Success of the job still requires that the retry of the job
is successful.

Release note (bug fix): Fix bug that resulted in some retriable errors
not being retried during IMPORT.
stevendanna added a commit to stevendanna/cockroach that referenced this issue Nov 10, 2022
We would like to retry circuit breaker open errors. In fact,
jobs.IsPermanentBulkJobError already looks like it would return false
for breaker open errors.

But, there are actually two circuit breaker packages we use:

    github.com/cockroachdb/circuitbreaker
    github.com/cockroachdb/cockroach/pkg/util/circuit

Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch
errors from one of these packages. Now, we test for both.

As a result, ErrBreakerOpen errors emerging from the nodedialer will
now be retried.

Fixes cockroachdb#89159
Fixes cockroachdb#85111
Fixes cockroachdb#81353

I may be being a bit optimistic that this will fully fixe those
failures. Success of the job still requires that the retry of the job
is successful.

Release note (bug fix): Fix bug that resulted in some retriable errors
not being retried during IMPORT.
stevendanna added a commit that referenced this issue Nov 10, 2022
stevendanna added a commit that referenced this issue Nov 14, 2022
We would like to retry circuit breaker open errors. In fact,
jobs.IsPermanentBulkJobError already looks like it would return false
for breaker open errors.

But, there are actually two circuit breaker packages we use:

    github.com/cockroachdb/circuitbreaker
    github.com/cockroachdb/cockroach/pkg/util/circuit

Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch
errors from one of these packages. Now, we test for both.

As a result, ErrBreakerOpen errors emerging from the nodedialer will
now be retried.

Fixes #89159
Fixes #85111
Fixes #81353

I may be being a bit optimistic that this will fully fixe those
failures. Success of the job still requires that the retry of the job
is successful.

Release note (bug fix): Fix bug that resulted in some retriable errors
not being retried during IMPORT.
stevendanna added a commit that referenced this issue Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants