Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sql Server Fails to Start with DistSQL Error #106537

Closed
jeffswenson opened this issue Jul 10, 2023 · 0 comments · Fixed by #108336
Closed

Sql Server Fails to Start with DistSQL Error #106537

jeffswenson opened this issue Jul 10, 2023 · 0 comments · Fixed by #108336
Assignees
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 T-sql-queries SQL Queries Team

Comments

@jeffswenson
Copy link
Collaborator

jeffswenson commented Jul 10, 2023

Describe the problem
SQL servers identify themselves in the system.sql_instances table. The table
contains the sql address and rpc address of the sql server. Here are example
contents of the sql_instances table:

[email protected]:26257/defaultdb> select * from system.sql_instances;
  id |                      addr                      |                session_id                |              locality               |                    sql_addr                    | crdb_region | binary_version
-----+------------------------------------------------+------------------------------------------+-------------------------------------+------------------------------------------------+-------------+-----------------
   1 | 10-0-3-139.us-central1.pod.cluster.local:26257 | \x02c1a251a76346fc91267ee61c139e38       | {"Tiers": "region=gcp-us-central1"} | NULL                                           | \x80        | NULL
   2 | 10-0-5-139.us-central1.pod.cluster.local:26257 | \x010180a704ae5254ce4c048c96b79f2f930dfb | {"Tiers": "region=gcp-us-central1"} | 10-0-5-139.us-central1.pod.cluster.local:26257 | \x80        | 23.1

If the sql_instance table contains a stale row with an ipaddress+port that is
reused by something else, then sql servers will fail to start with a distsql
error. It looks like a query run during sql server start up attempts to
schedule a dist sql flow on the stale instance and the distsql query failing
causes the process to crash.

This was discovered while attempting to deflake
#105402. The bug could happen in
production if an ip address was reused by another tenant or if a sql server
container crashed in a pod and was restarted.

To Reproduce
PR 106538 contains a regression test. Running the test will fail with the following stack trace:

--- FAIL: TestStartTenantWithStaleInstance (9.74s)
    test_server_shim.go:467: migration-job-find-already-completed: failed to connect to n1 at 127.0.0.1:46163: initial connection heartbeat failed: grpc: connection error: desc = "transport: authentication handshake failed: context deadline exceeded" [code 14/Unavailable]
        (1) attached stack trace
          -- stack trace:
          | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).execInternal.func1.1
          |     github.com/cockroachdb/cockroach/pkg/sql/internal.go:1032
          | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func1
          |     github.com/cockroachdb/cockroach/pkg/sql/internal.go:477
          | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next
          |     github.com/cockroachdb/cockroach/pkg/sql/internal.go:528
          | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).queryInternalBuffered
          |     github.com/cockroachdb/cockroach/pkg/sql/internal.go:642
          | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowExWithCols
          |     github.com/cockroachdb/cockroach/pkg/sql/internal.go:694
          | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowEx
          |     github.com/cockroachdb/cockroach/pkg/sql/internal.go:680
          | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRow
          |     github.com/cockroachdb/cockroach/pkg/sql/internal.go:664
          | github.com/cockroachdb/cockroach/pkg/upgrade/migrationstable.CheckIfMigrationCompleted
          |     github.com/cockroachdb/cockroach/pkg/upgrade/migrationstable/migrations_table.go:118
          | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).RunPermanentUpgrades.func2
          |     github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:220
          | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetryEx[...]
          |     github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:142
          | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).RunPermanentUpgrades
          |     github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:216
          | github.com/cockroachdb/cockroach/pkg/server.(*SQLServer).preStart
          |     github.com/cockroachdb/cockroach/pkg/server/server_sql.go:1627
          | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).PreStart
          |     github.com/cockroachdb/cockroach/pkg/server/tenant.go:746
          | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).Start
          |     github.com/cockroachdb/cockroach/pkg/server/tenant.go:913
          | github.com/cockroachdb/cockroach/pkg/server.(*TestServer).StartTenant
          |     github.com/cockroachdb/cockroach/pkg/server/testserver.go:1188
          | github.com/cockroachdb/cockroach/pkg/testutils/serverutils.StartTenant
          |     github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_server_shim.go:465
          | github.com/cockroachdb/cockroach/pkg/ccl/serverccl.TestStartTenantWithStaleInstance
          |     github.com/cockroachdb/cockroach/pkg/ccl/serverccl/server_sql_test.go:463
          | [...repeated from below...]
        Wraps: (2) migration-job-find-already-completed
        Wraps: (3) attached stack trace
          -- stack trace:
          | github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial
          |     github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:181
          | github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).Dial
          |     github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:103
          | github.com/cockroachdb/cockroach/pkg/sql.runnerRequest.run
          |     github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:112
          | github.com/cockroachdb/cockroach/pkg/sql.(*runnerCoordinator).init.func1
          |     github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:152
          | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
          |     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (4) failed to connect to n1 at 127.0.0.1:46163
        Wraps: (5) forced error mark
          | "originated at breaker breaker"
          | github.com/cockroachdb/cockroach/pkg/util/circuit/*circuit.breakerErrorMark::
        Wraps: (6) forced error mark
          | "breaker open"
          | github.com/cockroachdb/errors/withstack/*withstack.withStack::
        Wraps: (7) initial connection heartbeat failed
        Wraps: (8) grpc: connection error: desc = "transport: authentication handshake failed: context deadline exceeded" [code 14/Unavailable]
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *markers.withMark (6) *markers.withMark (7) *netutil.InitialHeartbeatFailedError (8) *status.Error

Jira issue: CRDB-29607

@jeffswenson jeffswenson added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-multitenancy Related to multi-tenancy T-sql-queries SQL Queries Team T-multitenant Issues owned by the multi-tenant virtual team labels Jul 10, 2023
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 10, 2023
This PR contains a test that reproduces cockroachdb#106537.

Part of: cockroachdb#106537
Release note: None
@yuzefovich yuzefovich self-assigned this Jul 10, 2023
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 11, 2023
This contains fixes to two sources of flakes in TestDirectoryConnect:
- sqlproxy http draining is now tied into the stopper. This avoids a
	source of goroutine leaks.
- The sql server is gracefully drained to work around cockroachdb#106537.

When combined with cockroachdb#106599, I was able to run the test for 25K
interations under stress with no flakes.

Fixes: cockroachdb#105402
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 11, 2023
This contains fixes to two sources of flakes in TestDirectoryConnect:
- sqlproxy http draining is now tied into the stopper. This avoids a
	source of goroutine leaks.
- The sql server is gracefully drained to work around cockroachdb#106537.

When combined with cockroachdb#106599, I was able to run the test for 25K
interations under stress with no flakes.

Fixes: cockroachdb#105402
craig bot pushed a commit that referenced this issue Jul 11, 2023
106549: sqlproxyccl: deflake TestDirectoryConnect r=JeffSwenson a=JeffSwenson

This contains fixes to two sources of flakes in TestDirectoryConnect:
- sqlproxy http draining is now tied into the stopper. This avoids a
	source of goroutine leaks.
- The sql server is gracefully drained to work around #106537.

When combined with #106599, I was able to run the test for 25K
interations under stress with no flakes.

Fixes: #105402

Co-authored-by: Jeff <[email protected]>
@mgartner mgartner moved this to Active in SQL Queries Jul 24, 2023
@craig craig bot closed this as completed in 1f8fa96 Aug 10, 2023
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Aug 10, 2023
@exalate-issue-sync exalate-issue-sync bot removed the T-multitenant Issues owned by the multi-tenant virtual team label Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 T-sql-queries SQL Queries Team
Projects
Archived in project
2 participants