Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/importer: TestImportWorkerFailure failed #102839

Closed
cockroach-teamcity opened this issue May 6, 2023 · 18 comments · Fixed by #108626
Closed

sql/importer: TestImportWorkerFailure failed #102839

cockroach-teamcity opened this issue May 6, 2023 · 18 comments · Fixed by #108626
Assignees
Labels
A-import Issues related to IMPORT syntax branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 6, 2023

sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ 4226a83871bbce776bc9389fca5cf084b4bb7632:

=== RUN   TestImportWorkerFailure
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure3379689604
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5430: pq: updating job details after publishing schemas: job 862797201834016769: select-job: grpc: connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)" [code 14/Unavailable]
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure3379689604
--- FAIL: TestImportWorkerFailure (42.46s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-27679

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels May 6, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone May 6, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label May 6, 2023
@cucaroach
Copy link
Contributor

The full failure looks like this:

I230506 12:13:30.736962 231027 jobs/registry.go:1582 ⋮ [T1,n1] 518  IMPORT job 862797201834016769: stepping through state reverting with error: updating job details after publishing schemas: job 862797201834016769: ‹select-job›: grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +(1) attached stack trace
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  -- stack trace:
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1105
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3665
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1594
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn.func4
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1682
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:965
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/txn.go:928
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:964
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:927
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:902
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1670
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1592
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3664
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1076
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).Resume
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:331
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1624
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1625
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:474
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).Start.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:978
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (2) updating job details after publishing schemas
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (3) attached stack trace
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  -- stack trace:
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.update.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/update.go:83
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | [...repeated from below...]
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (4) job 862797201834016769
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (5) attached stack trace
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  -- stack trace:
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).execInternal.func1.1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:971
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:418
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:469
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).queryInternalBuffered
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:583
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowExWithCols
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:635
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowEx
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:621
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.update
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/update.go:102
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.Update
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/update.go:378
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.SetDetails
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:668
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1103
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3665
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1594
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn.func4
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1682
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:965
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/txn.go:928
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:964
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:927
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:902
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1670
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1592
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3664
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1076
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).Resume
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:331
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1624
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1625
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:474
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).Start.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:978
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | runtime.goexit
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	src/runtime/asm_amd64.s:1594
I230506 12:13:30.736962 231027 jobs/registry.go:1582 ⋮ [T1,n1] 518 +Wraps: (6) ‹select-job›
I230506 12:13:30.736962 231027 jobs/registry.go:1582 ⋮ [T1,n1] 518 +Wraps: (7) grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *status.Error

@cucaroach
Copy link
Contributor

I wasn't able to repro this although the failure has happened twice. I wonder if someone with strong jobs mojo could take a look?

@cucaroach
Copy link
Contributor

Using stress after 2885 runs I got:

test logs left over in: /tmp/_tmp/060246e25d27a0efc6dd14ec672a0f2b/logTestImportWorkerFailure3155831314
--- FAIL: TestImportWorkerFailure (17.54s)
    test_log_scope.go:161: test logs captured to: /tmp/_tmp/060246e25d27a0efc6dd14ec672a0f2b/logTestImportWorkerFailure3155831314
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5435: query 'SELECT * FROM t ORDER BY i': expected:
        0
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        
        got:
        1
        2
        4
        5
        7
        8
        10
        11
        13
        14
        16
        17
        19
        
    panic.go:522: -- test log scope end --
FAIL
I230509 15:29:22.694464 1 (gostd) testmain.go:254  [-] 1  Test //pkg/sql/importer:importer_test exited with error code 1

@rytaft
Copy link
Collaborator

rytaft commented May 9, 2023

Assigning @cucaroach to skip this test (and mark this issue with skipped-test label), can put it on Bugs to Fix after if the fix is not obvious.

@dt
Copy link
Member

dt commented May 9, 2023

on the first one that complained about transport closing, I suspect that is this SELECT query here: https://github.com/cockroachdb/cockroach/blob/439c515b2a0058648731da73993c409544404da1/pkg/jobs/update.go

My guess is that this SELECT is being distributed? and thus when a node (other than the gateway) is terminated, the execution of the distributed query fails, but this code doesn't retry it and assumes any error is a failure of the job and so the job fails?

I have no idea at all about that second one -- that one looks just... wrong... right?

@cucaroach
Copy link
Contributor

on the first one that complained about transport closing, I suspect that is this SELECT query here: 439c515/pkg/jobs/update.go

My guess is that this SELECT is being distributed? and thus when a node (other than the gateway) is terminated, the execution of the distributed query fails, but this code doesn't retry it and assumes any error is a failure of the job and so the job fails?

Thanks for looking! I believe the internal executor will always run with DistSQLMode of 0 (ie off) but maybe the distsender GRPC failed? Its a little surprising a raw GRPC error made it all the way to the results iterator w/o getting wrapped/adorned with more info, if I could repro it I could try to fix that.

cucaroach added a commit to cucaroach/cockroach that referenced this issue May 10, 2023
craig bot pushed a commit that referenced this issue May 10, 2023
103016: kvserver: deflake TestAcquireLeaseTimeout r=erikgrinaker a=tbg

There are a few callers that use non-cancelable contexts, and if one of them got into the test we'd deadlock.

Touches #102975.
(The backport will close it).

Epic: none
Release note: None

103033: import: skip TestImportWorkerFailure r=cucaroach a=cucaroach

Informs #102839
Release note: none


Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
@yuzefovich
Copy link
Member

I believe the internal executor will always run with DistSQLMode of 0 (ie off)

That's not the case anymore, as of #101486.

@cucaroach
Copy link
Contributor

That's not the case anymore, as of #101486.

Right but that hasn't been backported to 23.1 as far as I can tell. And wouldn't we expect to see some distsql error wrapping the underlying grpc error, like "inbox communication error"?

@yuzefovich
Copy link
Member

Hm, I didn't see that it's 23.1. Indeed, #101486 should be ignored. However, #93218 could make some internal queries use DistSQL when they didn't previously (context is in https://cockroachlabs.slack.com/archives/C0168LW5THS/p1679603545588689?thread_ts=1679437525.752149&cid=C0168LW5THS):

I wonder if the big thing that changed was sharing of session data as opposed to using some weird minimal default session data and then that resulted in various features like distsql or the streamer being enabled when before they were not

when did that change?

In cases where the internal executor is used in the context of an open connExecutor transaction, we’ll now use the session data of that outer transaction since #93218


And wouldn't we expect to see some distsql error wrapping the underlying grpc error, like "inbox communication error"?

I don't think it's guaranteed to always be the case.

@cockroach-teamcity
Copy link
Member Author

sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ e84e2cabef32a59d88041132f059a6768dfe56bb:

=== RUN   TestImportWorkerFailure
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure1950566143
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5443: pq: updating job details after publishing schemas: job 870445175802920961: select-job: grpc: connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)" [code 14/Unavailable]
    panic.go:522: -- test log scope end --
--- FAIL: TestImportWorkerFailure (43.97s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cucaroach
Copy link
Contributor

I should note that this has only failed under race and deadlock and I've also seen it fail under stress. Tempting to just disable under those conditions to get the test running again but leave this bug open to try to get a grip on the legit failures.

@cockroach-teamcity
Copy link
Member Author

sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ 78dae31f503cec8e00fa2f18ed6a65da6042acbe:

=== RUN   TestImportWorkerFailure
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure2654984900
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5439: pq: updating job details after publishing schemas: job 883166950884081665: select-job: grpc: connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)" [code 14/Unavailable]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure2654984900
--- FAIL: TestImportWorkerFailure (67.07s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

cucaroach added a commit to cucaroach/cockroach that referenced this issue Jul 17, 2023
…deadlock

Better to have the test running some of the time to catch regressions.

Informs: cockroachdb#102839
Epic: None
Release note: None
@cucaroach
Copy link
Contributor

Removed the skipped-test label, its not skipped on 23.1 at all and skipped only for deadlock,race and stress on master.

@cucaroach cucaroach added the A-import Issues related to IMPORT syntax label Jul 26, 2023
@yuzefovich
Copy link
Member

I got the same concerning failure as this one when stressing for #107456.

michae2 added a commit to michae2/cockroach that referenced this issue Aug 10, 2023
michae2 added a commit to michae2/cockroach that referenced this issue Aug 10, 2023
@michae2
Copy link
Collaborator

michae2 commented Aug 11, 2023

When this test was added five years ago, it was checked in skipped, with this comment:

	// TODO(mjibson): Although this test passes most of the time it still
	// sometimes fails because not all kinds of failures caused by shutting a
	// node down are detected and retried.
	t.Skip("flakey due to undetected kinds of failures when the node is shutdown")

With this history in mind, I have a proposal for this test. Maybe @dt or @stevendanna could comment on whether this makes sense:

  1. We remove a successful import as a necessary condition for test success. My reading of that comment is that unconditional success of import during node failure has never been guaranteed, though we would like it to be (and maybe it will be one day). I don't think it makes sense to test for something that is not guaranteed.
  2. Instead, we make test success depend on import atomicity: that is, either the import fails and no data is imported, or the import succeeds and all data is imported. IMO this is more important than unconditional success, and the unconditional success criteria is hiding this more scary failure mode, which we are now tracking separately in import: import worker failure can succeed with missing data #108547.

michae2 added a commit to michae2/cockroach that referenced this issue Aug 11, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

This PR also completely unskips `TestImportWorkerFailure` so that we can
test the atomicity of imports more thoroughly.

Fixes: cockroachdb#102839

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Aug 14, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

This PR also unskips `TestImportWorkerFailure` under stress so that we
can test the atomicity of imports more thoroughly.

Fixes: cockroachdb#102839

Release note: None
craig bot pushed a commit that referenced this issue Aug 14, 2023
108210: cli: add limit statment_statistics to debug zip r=j82w a=j82w

This adds statement_statistics to the debug zip. It is limited to the transaction fingerprint ids in the the transaction_contention_events table. This is because the statement_statistics table maps the fingerprint to the query text. It also adds the top 100 statements by cpu usage.

closes: #108180

Release note (cli change): Added limited statement_statistics to the debug zip.

108382: ccl/sqlproxyccl: serve a dirty cache whenever the watcher fails r=JeffSwenson a=jaylim-crl

Previously, we will invalidate all tenant metadata entries whenever the
watcher fails. This can cause issues when the directory server fails
(e.g. Kubernetes API server is down). It is possible that existing SQL pods
are still up, but we're invalidating the entire directory cache. We should
allow incoming requests with existing SQL pods to connect to those pods.

This commit addresses the issue by serving a stale cache whenever the watcher
fails and not invalidating the cache.

Release note: None

Epic: CC-25053

108626: importer: only check import *atomicity* in TestImportWorkerFailure r=dt,yuzefovich,cucaroach a=michae2

Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey.

Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547.

This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today.

Fixes: #102839

Release note: None

Co-authored-by: j82w <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
@craig craig bot closed this as completed in 14ef3c0 Aug 15, 2023
@github-project-automation github-project-automation bot moved this from Bugs to Fix to Done in SQL Queries Aug 15, 2023
michae2 added a commit to michae2/cockroach that referenced this issue Aug 16, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
michae2 added a commit that referenced this issue Aug 16, 2023
Five years ago, in #26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in #105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in #108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: #102839

Release note: None
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this issue Nov 8, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this issue Nov 8, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-import Issues related to IMPORT syntax branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants