sql/importer: TestImportWorkerFailure failed #102839

cockroach-teamcity · 2023-05-06T12:28:36Z

sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ 4226a83871bbce776bc9389fca5cf084b4bb7632:

=== RUN   TestImportWorkerFailure
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure3379689604
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5430: pq: updating job details after publishing schemas: job 862797201834016769: select-job: grpc: connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)" [code 14/Unavailable]
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure3379689604
--- FAIL: TestImportWorkerFailure (42.46s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries _{This test on roachdash | Improve this report!

Jira issue: CRDB-27679}

The text was updated successfully, but these errors were encountered:

cucaroach · 2023-05-09T15:12:44Z

The full failure looks like this:

I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518  IMPORT job 862797201834016769: stepping through state reverting with error: updating job details after publishing schemas: job 862797201834016769: â€¹select-jobâ€º: grpc: â€¹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"â€º [code 14/Unavailable]
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +(1) attached stack trace
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  -- stack trace:
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1105
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3665
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1594
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn.func4
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1682
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:965
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/txn.go:928
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:964
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:927
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:902
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1670
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1592
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3664
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1076
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).Resume
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:331
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1624
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1625
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:474
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).Start.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:978
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (2) updating job details after publishing schemas
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (3) attached stack trace
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  -- stack trace:
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.update.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/update.go:83
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | [...repeated from below...]
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (4) job 862797201834016769
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (5) attached stack trace
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  -- stack trace:
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).execInternal.func1.1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:971
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:418
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:469
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).queryInternalBuffered
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:583
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowExWithCols
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:635
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowEx
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:621
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.update
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/update.go:102
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.Update
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/update.go:378
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.Updater.SetDetails
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:668
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1103
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3665
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1594
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn.func4
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1682
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:965
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/txn.go:928
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:964
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:927
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:902
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1670
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1592
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3664
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).publishSchemas
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:1076
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/sql/importer.(*importResumer).Resume
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/sql/importer/import_job.go:331
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1624
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1625
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:474
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).Start.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:978
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | runtime.goexit
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +  | 	src/runtime/asm_amd64.s:1594
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (6) â€¹select-jobâ€º
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Wraps: (7) grpc: â€¹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"â€º [code 14/Unavailable]
I230506 12:13:30.736962 231027 jobs/registry.go:1582 â‹® [T1,n1] 518 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *status.Error

cucaroach · 2023-05-09T15:34:27Z

I wasn't able to repro this although the failure has happened twice. I wonder if someone with strong jobs mojo could take a look?

cucaroach · 2023-05-09T16:21:39Z

Using stress after 2885 runs I got:

test logs left over in: /tmp/_tmp/060246e25d27a0efc6dd14ec672a0f2b/logTestImportWorkerFailure3155831314
--- FAIL: TestImportWorkerFailure (17.54s)
    test_log_scope.go:161: test logs captured to: /tmp/_tmp/060246e25d27a0efc6dd14ec672a0f2b/logTestImportWorkerFailure3155831314
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5435: query 'SELECT * FROM t ORDER BY i': expected:
        0
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        
        got:
        1
        2
        4
        5
        7
        8
        10
        11
        13
        14
        16
        17
        19
        
    panic.go:522: -- test log scope end --
FAIL
I230509 15:29:22.694464 1 (gostd) testmain.go:254  [-] 1  Test //pkg/sql/importer:importer_test exited with error code 1

rytaft · 2023-05-09T18:30:26Z

Assigning @cucaroach to skip this test (and mark this issue with skipped-test label), can put it on Bugs to Fix after if the fix is not obvious.

dt · 2023-05-09T23:22:19Z

on the first one that complained about transport closing, I suspect that is this SELECT query here: https://github.com/cockroachdb/cockroach/blob/439c515b2a0058648731da73993c409544404da1/pkg/jobs/update.go

My guess is that this SELECT is being distributed? and thus when a node (other than the gateway) is terminated, the execution of the distributed query fails, but this code doesn't retry it and assumes any error is a failure of the job and so the job fails?

I have no idea at all about that second one -- that one looks just... wrong... right?

cucaroach · 2023-05-10T16:15:14Z

on the first one that complained about transport closing, I suspect that is this SELECT query here: 439c515/pkg/jobs/update.go

My guess is that this SELECT is being distributed? and thus when a node (other than the gateway) is terminated, the execution of the distributed query fails, but this code doesn't retry it and assumes any error is a failure of the job and so the job fails?

Thanks for looking! I believe the internal executor will always run with DistSQLMode of 0 (ie off) but maybe the distsender GRPC failed? Its a little surprising a raw GRPC error made it all the way to the results iterator w/o getting wrapped/adorned with more info, if I could repro it I could try to fix that.

Informs cockroachdb#102839 Release note: none

103016: kvserver: deflake TestAcquireLeaseTimeout r=erikgrinaker a=tbg There are a few callers that use non-cancelable contexts, and if one of them got into the test we'd deadlock. Touches #102975. (The backport will close it). Epic: none Release note: None 103033: import: skip TestImportWorkerFailure r=cucaroach a=cucaroach Informs #102839 Release note: none Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Tommy Reilly <[email protected]>

yuzefovich · 2023-05-12T10:09:59Z

I believe the internal executor will always run with DistSQLMode of 0 (ie off)

That's not the case anymore, as of #101486.

cucaroach · 2023-05-12T12:28:23Z

That's not the case anymore, as of #101486.

Right but that hasn't been backported to 23.1 as far as I can tell. And wouldn't we expect to see some distsql error wrapping the underlying grpc error, like "inbox communication error"?

yuzefovich · 2023-05-12T12:53:02Z

Hm, I didn't see that it's 23.1. Indeed, #101486 should be ignored. However, #93218 could make some internal queries use DistSQL when they didn't previously (context is in https://cockroachlabs.slack.com/archives/C0168LW5THS/p1679603545588689?thread_ts=1679437525.752149&cid=C0168LW5THS):

I wonder if the big thing that changed was sharing of session data as opposed to using some weird minimal default session data and then that resulted in various features like distsql or the streamer being enabled when before they were not

when did that change?

In cases where the internal executor is used in the context of an open connExecutor transaction, we’ll now use the session data of that outer transaction since #93218

And wouldn't we expect to see some distsql error wrapping the underlying grpc error, like "inbox communication error"?

I don't think it's guaranteed to always be the case.

cockroach-teamcity · 2023-06-02T12:46:09Z

sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ e84e2cabef32a59d88041132f059a6768dfe56bb:

=== RUN   TestImportWorkerFailure
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure1950566143
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5443: pq: updating job details after publishing schemas: job 870445175802920961: select-job: grpc: connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)" [code 14/Unavailable]
    panic.go:522: -- test log scope end --
--- FAIL: TestImportWorkerFailure (43.97s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cucaroach · 2023-06-28T13:23:10Z

I should note that this has only failed under race and deadlock and I've also seen it fail under stress. Tempting to just disable under those conditions to get the test running again but leave this bug open to try to get a grip on the legit failures.

cockroach-teamcity · 2023-07-17T11:15:44Z

sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ 78dae31f503cec8e00fa2f18ed6a65da6042acbe:

=== RUN   TestImportWorkerFailure
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure2654984900
    test_log_scope.go:79: use -show-logs to present logs inline
    import_stmt_test.go:5439: pq: updating job details after publishing schemas: job 883166950884081665: select-job: grpc: connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)" [code 14/Unavailable]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/73dfce30b9a5630b1b4dabed3c94b32c/logTestImportWorkerFailure2654984900
--- FAIL: TestImportWorkerFailure (67.07s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

…deadlock Better to have the test running some of the time to catch regressions. Informs: cockroachdb#102839 Epic: None Release note: None

cucaroach · 2023-07-26T13:51:53Z

Removed the skipped-test label, its not skipped on 23.1 at all and skipped only for deadlock,race and stress on master.

yuzefovich · 2023-08-05T00:35:13Z

I got the same concerning failure as this one when stressing for #107456.

Informs: cockroachdb#102839 Release note: None

michae2 · 2023-08-11T17:23:41Z

When this test was added five years ago, it was checked in skipped, with this comment:

	// TODO(mjibson): Although this test passes most of the time it still
	// sometimes fails because not all kinds of failures caused by shutting a
	// node down are detected and retried.
	t.Skip("flakey due to undetected kinds of failures when the node is shutdown")

With this history in mind, I have a proposal for this test. Maybe @dt or @stevendanna could comment on whether this makes sense:

We remove a successful import as a necessary condition for test success. My reading of that comment is that unconditional success of import during node failure has never been guaranteed, though we would like it to be (and maybe it will be one day). I don't think it makes sense to test for something that is not guaranteed.
Instead, we make test success depend on import atomicity: that is, either the import fails and no data is imported, or the import succeeds and all data is imported. IMO this is more important than unconditional success, and the unconditional success criteria is hiding this more scary failure mode, which we are now tracking separately in import: import worker failure can succeed with missing data #108547.

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also completely unskips `TestImportWorkerFailure` so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also unskips `TestImportWorkerFailure` under stress so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None

108210: cli: add limit statment_statistics to debug zip r=j82w a=j82w This adds statement_statistics to the debug zip. It is limited to the transaction fingerprint ids in the the transaction_contention_events table. This is because the statement_statistics table maps the fingerprint to the query text. It also adds the top 100 statements by cpu usage. closes: #108180 Release note (cli change): Added limited statement_statistics to the debug zip. 108382: ccl/sqlproxyccl: serve a dirty cache whenever the watcher fails r=JeffSwenson a=jaylim-crl Previously, we will invalidate all tenant metadata entries whenever the watcher fails. This can cause issues when the directory server fails (e.g. Kubernetes API server is down). It is possible that existing SQL pods are still up, but we're invalidating the entire directory cache. We should allow incoming requests with existing SQL pods to connect to those pods. This commit addresses the issue by serving a stale cache whenever the watcher fails and not invalidating the cache. Release note: None Epic: CC-25053 108626: importer: only check import *atomicity* in TestImportWorkerFailure r=dt,yuzefovich,cucaroach a=michae2 Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None Co-authored-by: j82w <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Michael Erickson <[email protected]>

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None

Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None

cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels May 6, 2023

cockroach-teamcity added this to the 23.1 milestone May 6, 2023

blathers-crl bot added the T-sql-queries SQL Queries Team label May 6, 2023

cockroach-teamcity mentioned this issue May 7, 2023

sql/importer: TestImportWorkerFailure failed #102858

Closed

rytaft assigned cucaroach May 9, 2023

cucaroach mentioned this issue May 10, 2023

import: skip TestImportWorkerFailure #103033

Merged

cucaroach added a commit to cucaroach/cockroach that referenced this issue May 10, 2023

import: skip TestImportWorkerFailure

15b2d6a

Informs cockroachdb#102839 Release note: none

mgartner added the skipped-test label May 10, 2023

cucaroach mentioned this issue Jun 27, 2023

sql: re-execute distributed query as local for some errors #105451

Merged

This was referenced Jun 28, 2023

sql: only skip import worker failure under race,stress,deadlock #105712

Merged

release-23.1: sql: only skip import worker failure under race,stress,deadlock #106908

Closed

cucaroach mentioned this issue Jul 19, 2023

sql/importer: TestExportImportBank failed #98665

Closed

cockroach-teamcity mentioned this issue Jul 24, 2023

sql/importer: unskip TestImportWorkerFailure in more configs #107456

Closed

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Bugs to Fix in SQL Queries Jul 24, 2023

cucaroach removed the skipped-test label Jul 26, 2023

cucaroach added the A-import Issues related to IMPORT syntax label Jul 26, 2023

cucaroach mentioned this issue Aug 10, 2023

import: import worker failure can succeed with missing data #108547

Closed

michae2 added a commit to michae2/cockroach that referenced this issue Aug 10, 2023

importer: completely unskip TestImportWorkerFailure

bfcc406

Informs: cockroachdb#102839 Release note: None

michae2 mentioned this issue Aug 10, 2023

experiment for 102839 #108551

Closed

michae2 added a commit to michae2/cockroach that referenced this issue Aug 10, 2023

sql: autoretry transactions on IsSQLRetryableErrors

58f7f12

Informs: cockroachdb#102839 Release note: None

michae2 mentioned this issue Aug 11, 2023

importer: only check import *atomicity* in TestImportWorkerFailure #108626

Merged

craig bot closed this as completed in 14ef3c0 Aug 15, 2023

github-project-automation bot moved this from Bugs to Fix to Done in SQL Queries Aug 15, 2023

michae2 mentioned this issue Aug 16, 2023

release-23.1: importer: only check import *atomicity* in TestImportWorkerFailure #108854

Merged

michae2 mentioned this issue Aug 16, 2023

release-23.1.9-rc: importer: only check import *atomicity* in TestImportWorkerFailure #108855

Merged

yuzefovich mentioned this issue Nov 8, 2023

release-22.2: importer: only check import *atomicity* in TestImportWorkerFailure #114054

Merged

yuzefovich mentioned this issue Nov 8, 2023

release-22.2.17-rc: importer: only check import *atomicity* in TestImportWorkerFailure #114055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/importer: TestImportWorkerFailure failed #102839

sql/importer: TestImportWorkerFailure failed #102839

cockroach-teamcity commented May 6, 2023 •

edited by cockroach-jira-scripts

Loading

cucaroach commented May 9, 2023

cucaroach commented May 9, 2023

cucaroach commented May 9, 2023

rytaft commented May 9, 2023

dt commented May 9, 2023

cucaroach commented May 10, 2023

yuzefovich commented May 12, 2023

cucaroach commented May 12, 2023

yuzefovich commented May 12, 2023

cockroach-teamcity commented Jun 2, 2023

cucaroach commented Jun 28, 2023

cockroach-teamcity commented Jul 17, 2023

cucaroach commented Jul 26, 2023

yuzefovich commented Aug 5, 2023

michae2 commented Aug 11, 2023

sql/importer: TestImportWorkerFailure failed #102839

sql/importer: TestImportWorkerFailure failed #102839

Comments

cockroach-teamcity commented May 6, 2023 • edited by cockroach-jira-scripts Loading

cucaroach commented May 9, 2023

cucaroach commented May 9, 2023

cucaroach commented May 9, 2023

rytaft commented May 9, 2023

dt commented May 9, 2023

cucaroach commented May 10, 2023

yuzefovich commented May 12, 2023

cucaroach commented May 12, 2023

yuzefovich commented May 12, 2023

cockroach-teamcity commented Jun 2, 2023

cucaroach commented Jun 28, 2023

cockroach-teamcity commented Jul 17, 2023

cucaroach commented Jul 26, 2023

yuzefovich commented Aug 5, 2023

michae2 commented Aug 11, 2023

cockroach-teamcity commented May 6, 2023 •

edited by cockroach-jira-scripts

Loading