-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…enantStartup Release note: None
Release note: None
Release note: None
Release note: None
ahead of splitting "new" vs "start" during construction. Release note: None
Release note: None
This peels the call to "start" from the `newTenantServer` interface and pulls it into the orchestration retry loop. This change also incidentally reveals an earlier misdesign: we are calling `newTenantServer` _then_ `start` in the same retry loop. If `new` succeeds but `start` fails, the next retry will call `newTenantServer` again *with the same stopper*, which will leak closers from the previous call to `new`. Release note: None
Prior to this patch, if an error occurred during the initialization or startup of a secondary tenant server, the initialization would leak state into the stopper defined for that tenant. Generally, reusing a stopper across server startup failures is not safe (and API violation). This patch fixes it by decoupling the intermediate stopper used for orchestration from the one used per tenant server. Release note: None
Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None
This change ensures that tenant servers managed by the server controller receive a graceful drain request as part of the graceful drain process of the surrounding KV node. This change, in turn, ensures that SQL clients connected to these secondary tenant servers benefit from the same guarantees (and graceful periods) as clients to the system tenant. Release note: None
This helps while troubleshooting tests. Release note: None
This also ensures the test fails quickly if there is a deadlock while the server is shutting down. (This makes the timeout for this test shorter than the standard timeout for the stopper.Stop method, which is 15 minutes.) Release note: None
This commit extracts the core logic from `server_controller_orchestration.go` into an interface (`serverOrchestrator`) that can be mocked. We will use this in testing. This extraction is also useful because it exposes a few shortcomings in the current implementation (wrt server shutdown). This makes it easier to fix them in a later commit. Release note: None
This moves `(o *channelOrchestrator) startControlledServer` alongside the other methods of `channelOrchestrator`. There is further no code change. Release note: None
The sever controller was writing to `useGracefulDrainDuringTenantShutdown` unconditionally. As a result, if two or more expedited stops caught up with a graceful stop, some of the writes to the channel could be blocked. This would never happen on a running system, but could be exercised in tests. This patch fixes this by making the write unblocking. This generally makes the code more readable anyway, since naked channel writes are bound to raise eyebrows anyways. Release note: None
Prior to this patch, immediate shutdown requests for secondary tenant servers were serviced using the graceful path. This was incorrect, and resulted in excessively long shutdown sequences in tests in some cases. This patch fixes it. Release note: None
Release note: None
Prior to this patch, the server orchestrator was unable to notice when the surrounding server was stopping after it had started a graceful drain already. This is because once the `propagate-close` task (and its `select`) monitoring various stop signals noticed a request for graceful drains, it would stop monitoring the other stop signals to focus exclusively on the graceful drain; thereby missing when the stopper for the surrounding server was quiescing. The fix is to run the monitor for graceful and ungraceful shutdowns in separate tasks, which this patch does. This goes 80% of the way towards de-flaking `TestServerStartStop` in `pkg/ccl/serverccl`, which was exhibiting the deadlock described above. With this fix in place, the deadlock disappears entirely. However, this reveals _another_ bug in the SQL layer which also needs to be fixed to consider `TestServerStartStop` stable. We will do this in a separate commit. Release note: None
Prior to this patch, the task that was responsible for propagating a graceful drain captured a variable supposed to reference the tenant server, but before this variable was assigned. This created a possible race condition, in the unlikely case case where the server startup would fail _and_ a graceful drain would be requested, concurrently. This patch fixes it by only starting to propagate graceful drains after the server is fully initialized. (But before it starts accepting clients, so that we don't create a window of time where clients can connects but graceful drains don't propagate.) This is also achieved by extracting the two shutdown tasks into separate functions, to clarify the flow of parameters. Release note: None
Release note: None
When a hard stop (e.g. test exists) catches up with an ongoing graceful drain, some bytes do not get released properly in a memory monitor. This triggers a panic during shutdown, since the byte monitor code verifies that all allocated bytes have been released. This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both. `TestServerStartStop` happens to do this and this is where this problem was caught. We are now tracking it as issue cockroachdb#101297. Until that issue is fixed, this commit papers over the problem by removing the assertion in the byte monitor. Release note: None
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
jeffswenson
approved these changes
Apr 13, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backports:
/cc @cockroachdb/release
Release justification: prevents SQL app disruption during node restarts
Epic: CRDB-23559