Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: improve tenant server start/stop in shared-process multitenancy #101089

Merged
merged 22 commits into from
Apr 13, 2023

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 10, 2023

@blathers-crl
Copy link

blathers-crl bot commented Apr 10, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz changed the base branch from master to release-23.1 April 10, 2023 15:08
@knz
Copy link
Contributor Author

knz commented Apr 10, 2023

I plan to only mark this ready once we have a fix for #100989.

@knz knz force-pushed the 20230410-backport-server-start branch from d12c723 to 54d9819 Compare April 10, 2023 15:18
knz added 21 commits April 13, 2023 16:01
ahead of splitting "new" vs "start" during construction.

Release note: None
This peels the call to "start" from the `newTenantServer`
interface and pulls it into the orchestration retry loop.

This change also incidentally reveals an earlier misdesign: we are
calling `newTenantServer` _then_ `start` in the same retry loop. If
`new` succeeds but `start` fails, the next retry will call
`newTenantServer` again *with the same stopper*, which will leak
closers from the previous call to `new`.

Release note: None
Prior to this patch, if an error occurred during the initialization or
startup of a secondary tenant server, the initialization would leak
state into the stopper defined for that tenant. Generally, reusing
a stopper across server startup failures is not safe (and API
violation).

This patch fixes it by decoupling the intermediate stopper used for
orchestration from the one used per tenant server.

Release note: None
Prior to this patch, the test was not cleaning up its server stopper
reliably at the end of each sub-test. This patch fixes it.

Release note: None
This change ensures that tenant servers managed by the server
controller receive a graceful drain request as part of the graceful
drain process of the surrounding KV node.

This change, in turn, ensures that SQL clients connected to these
secondary tenant servers benefit from the same guarantees (and
graceful periods) as clients to the system tenant.

Release note: None
This helps while troubleshooting tests.

Release note: None
This also ensures the test fails quickly if there is a deadlock while
the server is shutting down.

(This makes the timeout for this test shorter than the standard
timeout for the stopper.Stop method, which is 15 minutes.)

Release note: None
This commit extracts the core logic from
`server_controller_orchestration.go` into an
interface (`serverOrchestrator`) that can be mocked. We will use this
in testing.

This extraction is also useful because it exposes a few shortcomings
in the current implementation (wrt server shutdown). This makes it
easier to fix them in a later commit.

Release note: None
This moves `(o *channelOrchestrator) startControlledServer` alongside
the other methods of `channelOrchestrator`.
There is further no code change.

Release note: None
The sever controller was writing to
`useGracefulDrainDuringTenantShutdown` unconditionally. As a result,
if two or more expedited stops caught up with a graceful stop, some of
the writes to the channel could be blocked. This would never happen on
a running system, but could be exercised in tests.

This patch fixes this by making the write unblocking. This generally
makes the code more readable anyway, since naked channel writes are
bound to raise eyebrows anyways.

Release note: None
Prior to this patch, immediate shutdown requests for secondary tenant
servers were serviced using the graceful path. This was incorrect, and
resulted in excessively long shutdown sequences in tests in some
cases.

This patch fixes it.

Release note: None
Prior to this patch, the server orchestrator was unable to notice when
the surrounding server was stopping after it had started a graceful
drain already. This is because once the `propagate-close` task (and
its `select`) monitoring various stop signals noticed a request for
graceful drains, it would stop monitoring the other stop signals to
focus exclusively on the graceful drain; thereby missing when the
stopper for the surrounding server was quiescing.

The fix is to run the monitor for graceful and ungraceful shutdowns in
separate tasks, which this patch does.

This goes 80% of the way towards de-flaking `TestServerStartStop` in
`pkg/ccl/serverccl`, which was exhibiting the deadlock described
above. With this fix in place, the deadlock disappears entirely.
However, this reveals _another_ bug in the SQL layer which also needs
to be fixed to consider `TestServerStartStop` stable. We will do this
in a separate commit.

Release note: None
Prior to this patch, the task that was responsible for propagating a
graceful drain captured a variable supposed to reference the tenant
server, but before this variable was assigned.

This created a possible race condition, in the unlikely case case
where the server startup would fail _and_ a graceful drain would be
requested, concurrently.

This patch fixes it by only starting to propagate graceful drains
after the server is fully initialized. (But before it starts accepting
clients, so that we don't create a window of time where clients can
connects but graceful drains don't propagate.)

This is also achieved by extracting the two shutdown tasks into
separate functions, to clarify the flow of parameters.

Release note: None
Release note: None
When a hard stop (e.g. test exists) catches up with an ongoing
graceful drain, some bytes do not get released properly in a memory
monitor. This triggers a panic during shutdown, since the byte monitor
code verifies that all allocated bytes have been released.

This bug is relatively hard to trigger because in most cases,
a server is shut down either only using a graceful drain, or only
using a hard stop, but not both.

`TestServerStartStop` happens to do this and this is where this
problem was caught. We are now tracking it as issue cockroachdb#101297.

Until that issue is fixed, this commit papers over the problem by
removing the assertion in the byte monitor.

Release note: None
@knz knz force-pushed the 20230410-backport-server-start branch from 54d9819 to 1095b52 Compare April 13, 2023 14:04
@knz knz marked this pull request as ready for review April 13, 2023 14:58
@knz knz requested a review from a team as a code owner April 13, 2023 14:58
@knz knz requested a review from a team April 13, 2023 14:58
@knz knz requested a review from a team as a code owner April 13, 2023 14:58
@knz knz requested a review from a team April 13, 2023 14:58
@knz knz requested review from a team as code owners April 13, 2023 14:58
@knz knz requested review from a team, stevendanna and jeffswenson April 13, 2023 14:58
Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@knz knz merged commit e9482d7 into cockroachdb:release-23.1 Apr 13, 2023
@knz knz deleted the 20230410-backport-server-start branch April 13, 2023 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants