release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450

knz · 2023-04-13T15:00:48Z

Backports:

9/9 commits from server: fix the tenant server error handling #100436.
1/1 commit from jobs,server: graceful shutdown for secondary tenant servers #99958.
11/11 commits from server: fix a rare deadlock during tenant server shutdown (deflakes TestServerStartStop) #101296.

/cc @cockroachdb/release

Release justification: prevents SQL app disruption during node restarts
Epic: CRDB-23559

…enantStartup Release note: None

Release note: None

ahead of splitting "new" vs "start" during construction. Release note: None

Release note: None

This peels the call to "start" from the `newTenantServer` interface and pulls it into the orchestration retry loop. This change also incidentally reveals an earlier misdesign: we are calling `newTenantServer` _then_ `start` in the same retry loop. If `new` succeeds but `start` fails, the next retry will call `newTenantServer` again *with the same stopper*, which will leak closers from the previous call to `new`. Release note: None

Prior to this patch, if an error occurred during the initialization or startup of a secondary tenant server, the initialization would leak state into the stopper defined for that tenant. Generally, reusing a stopper across server startup failures is not safe (and API violation). This patch fixes it by decoupling the intermediate stopper used for orchestration from the one used per tenant server. Release note: None

Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None

This change ensures that tenant servers managed by the server controller receive a graceful drain request as part of the graceful drain process of the surrounding KV node. This change, in turn, ensures that SQL clients connected to these secondary tenant servers benefit from the same guarantees (and graceful periods) as clients to the system tenant. Release note: None

This helps while troubleshooting tests. Release note: None

This also ensures the test fails quickly if there is a deadlock while the server is shutting down. (This makes the timeout for this test shorter than the standard timeout for the stopper.Stop method, which is 15 minutes.) Release note: None

This commit extracts the core logic from `server_controller_orchestration.go` into an interface (`serverOrchestrator`) that can be mocked. We will use this in testing. This extraction is also useful because it exposes a few shortcomings in the current implementation (wrt server shutdown). This makes it easier to fix them in a later commit. Release note: None

This moves `(o *channelOrchestrator) startControlledServer` alongside the other methods of `channelOrchestrator`. There is further no code change. Release note: None

The sever controller was writing to `useGracefulDrainDuringTenantShutdown` unconditionally. As a result, if two or more expedited stops caught up with a graceful stop, some of the writes to the channel could be blocked. This would never happen on a running system, but could be exercised in tests. This patch fixes this by making the write unblocking. This generally makes the code more readable anyway, since naked channel writes are bound to raise eyebrows anyways. Release note: None

Prior to this patch, immediate shutdown requests for secondary tenant servers were serviced using the graceful path. This was incorrect, and resulted in excessively long shutdown sequences in tests in some cases. This patch fixes it. Release note: None

Release note: None

Prior to this patch, the server orchestrator was unable to notice when the surrounding server was stopping after it had started a graceful drain already. This is because once the `propagate-close` task (and its `select`) monitoring various stop signals noticed a request for graceful drains, it would stop monitoring the other stop signals to focus exclusively on the graceful drain; thereby missing when the stopper for the surrounding server was quiescing. The fix is to run the monitor for graceful and ungraceful shutdowns in separate tasks, which this patch does. This goes 80% of the way towards de-flaking `TestServerStartStop` in `pkg/ccl/serverccl`, which was exhibiting the deadlock described above. With this fix in place, the deadlock disappears entirely. However, this reveals _another_ bug in the SQL layer which also needs to be fixed to consider `TestServerStartStop` stable. We will do this in a separate commit. Release note: None

Prior to this patch, the task that was responsible for propagating a graceful drain captured a variable supposed to reference the tenant server, but before this variable was assigned. This created a possible race condition, in the unlikely case case where the server startup would fail _and_ a graceful drain would be requested, concurrently. This patch fixes it by only starting to propagate graceful drains after the server is fully initialized. (But before it starts accepting clients, so that we don't create a window of time where clients can connects but graceful drains don't propagate.) This is also achieved by extracting the two shutdown tasks into separate functions, to clarify the flow of parameters. Release note: None

Release note: None

When a hard stop (e.g. test exists) catches up with an ongoing graceful drain, some bytes do not get released properly in a memory monitor. This triggers a panic during shutdown, since the byte monitor code verifies that all allocated bytes have been released. This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both. `TestServerStartStop` happens to do this and this is where this problem was caught. We are now tracking it as issue cockroachdb#101297. Until that issue is fixed, this commit papers over the problem by removing the assertion in the byte monitor. Release note: None

blathers-crl · 2023-04-13T15:00:53Z

cockroach-teamcity · 2023-04-13T15:01:04Z

This change is

jeffswenson

LGTM

knz added 21 commits April 13, 2023 17:00

serverccl: clarify the progress inside TestServerControllerMultiNodeT…

c369c5b

…enantStartup Release note: None

serverccl: make TestServerControllerMultiNodeTenantStartup faster

421cdfe

Release note: None

server: unexport some functions

bc65cf2

Release note: None

server: save a reference to BaseConfig in SQLServerWrapper

6fd5f2a

Release note: None

server: extend the onDemandServer interface

0ca1bac

ahead of splitting "new" vs "start" during construction. Release note: None

server: lift reportTenantInfo into *SQLServerWrapper

e400f27

Release note: None

serverccl: simplify TestServerStartupGuardrails

439e909

Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None

server: add some shutdown logging

fa215f0

This helps while troubleshooting tests. Release note: None

server: move the main tenant orchestration logic to its own file

92ec831

This moves `(o *channelOrchestrator) startControlledServer` alongside the other methods of `channelOrchestrator`. There is further no code change. Release note: None

server: extra logging for server controller

265dd25

Release note: None

util/mon: better logging

1a772e1

Release note: None

knz requested review from stevendanna and jeffswenson April 13, 2023 15:00

knz requested a review from a team as a code owner April 13, 2023 15:00

knz requested a review from a team April 13, 2023 15:00

knz requested a review from a team as a code owner April 13, 2023 15:00

knz requested a review from a team April 13, 2023 15:00

knz requested review from a team as code owners April 13, 2023 15:00

knz requested a review from a team April 13, 2023 15:00

jeffswenson approved these changes Apr 13, 2023

View reviewed changes

knz merged commit aaef49c into cockroachdb:release-23.1.0 Apr 13, 2023

knz deleted the backport23.1.0-101089 branch April 13, 2023 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450

release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450

knz commented Apr 13, 2023

blathers-crl bot commented Apr 13, 2023 •

edited by knz

Loading

cockroach-teamcity commented Apr 13, 2023

jeffswenson left a comment

release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450

release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450

Conversation

knz commented Apr 13, 2023

blathers-crl bot commented Apr 13, 2023 • edited by knz Loading

cockroach-teamcity commented Apr 13, 2023

jeffswenson left a comment

Choose a reason for hiding this comment

blathers-crl bot commented Apr 13, 2023 •

edited by knz

Loading