Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: sqlproxy connection migration #75707

Merged
merged 1 commit into from
Mar 19, 2022

Conversation

jaylim-crl
Copy link
Collaborator

@jaylim-crl jaylim-crl commented Jan 29, 2022

RFC Text

This proposal describes the connection migration mechanism within the SQL
Proxy that allows us to migrate SQL connections from one SQL pod to another
by building on top of the session migration work by the SQL Experience team.

This RFC is only scoped to how to perform the migration from one SQL pod to
another within the SQL proxy. The details of when to perform the migration
will be discussed in a follow-up RFC. Note that the main use-case here is for
CockroachDB Serverless, and we should optimize for that.

What this will eventually solve:

  • rude disconnects during Serverless scale-down events
  • load imbalance during steady state and Serverless scale-up events

From the user's perspective, the connection should remain usable without
interruption, and there will be better resource utilization across all
provisioned SQL pods.

This RFC has already received some reviews in an earlier form of an internal
Google Doc, Proxy Connection Migration.

Release justification: RFC/docs only change.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jaylim-crl jaylim-crl force-pushed the jay/rfc-connection-migration branch 2 times, most recently from 5a20549 to c02ceb3 Compare January 29, 2022 17:50
@jaylim-crl jaylim-crl changed the title rfc: sqlproxy connection migration rfcs: sqlproxy connection migration Jan 29, 2022
@jaylim-crl jaylim-crl force-pushed the jay/rfc-connection-migration branch 2 times, most recently from 5c1a226 to 77db251 Compare January 29, 2022 17:58
@jaylim-crl jaylim-crl marked this pull request as ready for review January 29, 2022 18:00
@jaylim-crl jaylim-crl requested a review from a team as a code owner January 29, 2022 18:00
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely written. Only did a quick skim in order to leave a few comments about some relevant concerns about related work.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):

draining phase lasts for at most 10 minutes, and if connections are still active
at the end of the phase, they will be terminated abruptly, which can lead to
poor user experiences.

In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 54 at r1 (raw file):

For every client connection, the proxy selects a backend based on some routing
algorithm, performs authentication with it, and does a [transparent forwarding](https://github.com/cockroachdb/cockroach/blob/ced473fa0ed00660812a6f6678155fb30e6a871f/pkg/ccl/sqlproxyccl/proxy.go#L84-L86)
of all packets from the client to the server without intercepting messages after

Nit: the messages are still intercepted, right? That is, we're still reading from one socket and writing to another socket. It is simply that we're not decoding or altering the messages after the authentication phase.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 105 at r1 (raw file):

is aware of Postgres message boundaries.

We propose to make the proxy aware of pgwire message boundaries. This implies

Coincidentally, we likely need to make the proxy aware of pgwire message contents in order to support the pgwire cancel protocol.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):

instances. These instances come with an additional decoding overhead for every
message. For the proxy, there is no need to decode the messages in a steady
state.

This might need to change to support the pgwire cancel protocol. In particular, I think we'd need to decode BackendKeyData messages sent by the server in order to alter them before sending them to the client. It isn't clear yet exactly how this is going to work, but worth staying abreast of the cancel protocol requirements. Cc @rafiss who is leading the charge on that work.

Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! For some reason I'm not seeing the comments on Reviewable, so I quoted again.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, and @rafiss)


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):
From Peter:

In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.

This is an interesting use-case, and I hadn't thought about that. Thanks for bringing this up. Thinking about it more, we could even use this for tenant upgrades within Serverless. Today, we just terminate old pods during upgrades, and spin up new ones, resulting in rude disconnects. There's an issue filed to make the process less disruptive: https://cockroachlabs.atlassian.net/browse/CC-4796. cc: @alyshanjahani-crl

This RFC proposes a foundation to how we will transfer connections from one pod to another, and will support the use-case that was described. The interesting bit that needs to be updated when that happens would be the pod selection algorithm. When transferring connections during upgrades, a caller could invoke the relevant APIs on the forwarder (e.g. RequestTransfer) to transfer a connection, and the pod selection algorithm would need to return a pod that is ready to serve connections (e.g. newer pods which are ready, or older pods which haven't been drained yet).


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 55 at r1 (raw file):
From Peter:

Nit: the messages are still intercepted, right? That is, we're still reading from one socket and writing to another socket. It is simply that we're not decoding or altering the messages after the authentication phase.

Right, we're still reading from one, and writing over to the other. My orignal intention was to emphasize that bytes are just forwarded in a transparent manner, without any notion of pgwire message boundaries. I'll rework this.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 105 at r1 (raw file):
From Peter:

Coincidentally, we likely need to make the proxy aware of pgwire message contents in order to support the pgwire cancel protocol.

Ack - that's good to know.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):
From Peter:

This might need to change to support the pgwire cancel protocol. In particular, I think we'd need to decode BackendKeyData messages sent by the server in order to alter them before sending them to the client. It isn't clear yet exactly how this is going to work, but worth staying abreast of the cancel protocol requirements. Cc @rafiss who is leading the charge on that work.

That's fine, and the proposed design in this RFC can do that. We could peek on the interceptor to look for BackendKeyData messages during the steady state, and decode accordingly in the response processor. I'm not entirely sure how the cancel operation will be requested (e.g. user-initiated, server-initiated), but if we didn't want to decode all BackendKeyData messages during the steady state, we could have an additional signal (e.g. check on SimpleQuery in the case of user-initiated cancel operations) to start the decoding window. Regardless, with decoding, there will be some performance hit as there might be allocations and copying, depending on how large these messages are.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

From Peter:

This might need to change to support the pgwire cancel protocol. In particular, I think we'd need to decode BackendKeyData messages sent by the server in order to alter them before sending them to the client. It isn't clear yet exactly how this is going to work, but worth staying abreast of the cancel protocol requirements. Cc @rafiss who is leading the charge on that work.

That's fine, and the proposed design in this RFC can do that. We could peek on the interceptor to look for BackendKeyData messages during the steady state, and decode accordingly in the response processor. I'm not entirely sure how the cancel operation will be requested (e.g. user-initiated, server-initiated), but if we didn't want to decode all BackendKeyData messages during the steady state, we could have an additional signal (e.g. check on SimpleQuery in the case of user-initiated cancel operations) to start the decoding window. Regardless, with decoding, there will be some performance hit as there might be allocations and copying, depending on how large these messages are.

The BackendKeyData messages are very small. 12 or 16 bytes. I think we'd need to decode all of them. While cancellation is user-initiated, the CancelMessage from the user contains info in the BackendKeyData messages.

Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Interesting how the comments appeared (see timestamps):

image.png

I marked some of them as done to avoid confusion.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, @petermattis, and @rafiss)


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.

Done.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 105 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Coincidentally, we likely need to make the proxy aware of pgwire message contents in order to support the pgwire cancel protocol.

Done.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The BackendKeyData messages are very small. 12 or 16 bytes. I think we'd need to decode all of them. While cancellation is user-initiated, the CancelMessage from the user contains info in the BackendKeyData messages.

Ack. Since the messages are small, I wouldn't be too worried about this since it's likely that the memory will come from the internal buffer without any additional allocations (at least for most of the time).

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Interesting how the comments appeared (see timestamps):

Looks like some sort of reviewable bug.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, and @rafiss)

Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I do have a couple production concerns. You can add them to this RFC or open a Jira ticket.

  1. There should be metrics tracking the number of failed transfers and the number of connections closed mid-transfer. The metrics should be visible on the sqlproxy grafana page.
  2. We need a way to disable this for a specific tenant. That way if there is a workload that really tickles the known edge cases in the protocol, we can opt the tenant out of session transfers. I'm imaging something like a flag on the sqlproxy or an option returned by the tenant directory.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jaylim-crl, and @rafiss)


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

From Peter:

In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.

This is an interesting use-case, and I hadn't thought about that. Thanks for bringing this up. Thinking about it more, we could even use this for tenant upgrades within Serverless. Today, we just terminate old pods during upgrades, and spin up new ones, resulting in rude disconnects. There's an issue filed to make the process less disruptive: https://cockroachlabs.atlassian.net/browse/CC-4796. cc: @alyshanjahani-crl

This RFC proposes a foundation to how we will transfer connections from one pod to another, and will support the use-case that was described. The interesting bit that needs to be updated when that happens would be the pod selection algorithm. When transferring connections during upgrades, a caller could invoke the relevant APIs on the forwarder (e.g. RequestTransfer) to transfer a connection, and the pod selection algorithm would need to return a pod that is ready to serve connections (e.g. newer pods which are ready, or older pods which haven't been drained yet).

The big asterisk for migrating sessions during the upgrade process is version skew. If we want to transfer between versions, we need the DB to guarantee the serialized session is compatible.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 227 at r1 (raw file):

coupled with the [proxy's handler](https://github.com/cockroachdb/cockroach/blob/ad27839166526a92444bf0f719cf9ec5f62c2f7f/pkg/ccl/sqlproxyccl/proxy_handler.go#L209).
At the same time, the [authenticate](https://github.com/cockroachdb/cockroach/blob/ad27839166526a92444bf0f719cf9ec5f62c2f7f/pkg/ccl/sqlproxyccl/authentication.go#L21)
function that is used for authentication is coupled with the throttler. We

The sqlproxy does not need to check the throttler when it is transferring a connection between sql pods. The throttle is intended to prevent credential guesses and the proxy only transfers authenticated connections. The design you discuss handles this case well. The sqlproxy can inject no-op throttle hooks when transferring sessions.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jaylim-crl, and @rafiss)


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

The big asterisk for migrating sessions during the upgrade process is version skew. If we want to transfer between versions, we need the DB to guarantee the serialized session is compatible.

Currently CockroachDB changes query planning rules, adds new SQL session variables, and makes other user-visible changes automatically during upgrades.
This is not compatible with seamless session migration across versions: a SQL client could auto-detect a version upon connecting, make choices/assumptions about SQL semantics over that connection, and if we migrate to a new version these assumptions would break with inscrutable errors client-side.

At the very least, we should never transfer a session across an upgrade to a different major version of cockroachdb.

Then I am tempted to have us make the choice to transfer sessions across versions configurable for the end-user, so that they can disable this if it's incompatible with their client app. Something like a choice between the following three:

  1. enable seamless migrations, warning: this can cause your apps to experience different versions of CockroachDB within the same SQL connection.
  2. disable seamless migrations, block upgrades with open connections, warning: this can cause your apps to block cockroachdb bug fixes.
  3. disable seamless migrations, close connections upon upgrades, warning: this can cause your apps to experience a server-side disconnect upon upgrades.

Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Updated this slightly to reflect the implementation that was merged.

There should be metrics tracking the number of failed transfers and the number of connections closed mid-transfer. The metrics should be visible on the sqlproxy grafana page.

I added this to the meta tracking issue here: #76000

We need a way to disable this for a specific tenant. That way if there is a workload that really tickles the known edge cases in the protocol, we can opt the tenant out of session transfers. I'm imaging something like a flag on the sqlproxy or an option returned by the tenant directory.

That's going to be part of the next RFC. This RFC is only scoped to how to transfer. The mechanism of when to transfer will be handled by the rebalancer, and for that, we could have it check on the tenant metadata to know whether connections can be transferred.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jeffswenson, and @rafiss)


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

Currently CockroachDB changes query planning rules, adds new SQL session variables, and makes other user-visible changes automatically during upgrades.
This is not compatible with seamless session migration across versions: a SQL client could auto-detect a version upon connecting, make choices/assumptions about SQL semantics over that connection, and if we migrate to a new version these assumptions would break with inscrutable errors client-side.

At the very least, we should never transfer a session across an upgrade to a different major version of cockroachdb.

Then I am tempted to have us make the choice to transfer sessions across versions configurable for the end-user, so that they can disable this if it's incompatible with their client app. Something like a choice between the following three:

  1. enable seamless migrations, warning: this can cause your apps to experience different versions of CockroachDB within the same SQL connection.
  2. disable seamless migrations, block upgrades with open connections, warning: this can cause your apps to block cockroachdb bug fixes.
  3. disable seamless migrations, close connections upon upgrades, warning: this can cause your apps to experience a server-side disconnect upon upgrades.

Those are all valid points. We will consider that in the upcoming RFC about when to transfer a connection. In CC today, tenant upgrades destroy the old SQL pods right away without putting them into the DRAINING state, causing connections to be terminated right away, so migrating across version upgrades is not possible. That said, if we wanted to be defensive to start, we could ensure that we never transfer connections whenever versions do not match.


docs/RFCS/20220129_sqlproxy_connection_migration.md, line 227 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

The sqlproxy does not need to check the throttler when it is transferring a connection between sql pods. The throttle is intended to prevent credential guesses and the proxy only transfers authenticated connections. The design you discuss handles this case well. The sqlproxy can inject no-op throttle hooks when transferring sessions.

The current design only throttles whenever we connect using the regular auth method. If there's a need in the future, we could also throttle the token approach (assuming that it becomes an end-user feature). I removed the part of decoupling throttler and authenticator since that's already implicit with the code today (i.e. authenticate takes in a throttleHook, though we still import the throttler).

@jaylim-crl jaylim-crl force-pushed the jay/rfc-connection-migration branch from 76b8559 to edb29eb Compare March 13, 2022 19:46
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything has been merged, and this feature is now completed. The RFC has been updated to reflect what we have merged. I'll merge this next week if no one else has concerns.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jeffswenson, and @rafiss)

This proposal describes the connection migration mechanism within the SQL
Proxy that allows us to migrate SQL connections from one SQL pod to another
by building on top of the session migration work by the SQL Experience team.

This RFC is only scoped to _how_ to perform the migration from one SQL pod to
another within the SQL proxy. The details of _when_ to perform the migration
will be discussed in a follow-up RFC. Note that the main use-case here is for
CockroachDB Serverless, and we should optimize for that.

What this will eventually solve:
- rude disconnects during Serverless scale-down events
- load imbalance during steady state and Serverless scale-up events

From the user's perspective, the connection should remain usable without
interruption, and there will be better resource utilization across all
provisioned SQL pods.

Release justification: RFC/docs only change.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/rfc-connection-migration branch from edb29eb to 8ce1d8a Compare March 13, 2022 19:54
@jaylim-crl
Copy link
Collaborator Author

bors r=JeffSwenson

If there are more concerns, I'd be happy to address in another PR.

@craig
Copy link
Contributor

craig bot commented Mar 19, 2022

Build succeeded:

@craig craig bot merged commit b3525c9 into cockroachdb:master Mar 19, 2022
@jaylim-crl jaylim-crl deleted the jay/rfc-connection-migration branch March 20, 2022 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants