Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql, ccl/sqlproxyccl: connection migration tracking issue #76000

Closed
13 of 14 tasks
jaylim-crl opened this issue Feb 4, 2022 · 3 comments
Closed
13 of 14 tasks

sql, ccl/sqlproxyccl: connection migration tracking issue #76000

jaylim-crl opened this issue Feb 4, 2022 · 3 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-serverless

Comments

@jaylim-crl
Copy link
Collaborator

jaylim-crl commented Feb 4, 2022

This issue tracks the implementation of the connection migration mechanism, as described in this RFC: #75707.

Related Epics:

Jira issue: CRDB-12912

@blathers-crl
Copy link

blathers-crl bot commented Feb 4, 2022

Hi @jaylim-crl, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@jaylim-crl jaylim-crl added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 4, 2022
@jaylim-crl jaylim-crl self-assigned this Feb 4, 2022
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Feb 7, 2022
Informs cockroachdb#76000.

This commit adds the SHOW TRANSFER STATE observer statement, as described in
the sqlproxy connection migration RFC. This observer statement will be used
whenever a connection is about to be migrated to retrieve the relevant values
needed for the transfer process. A unique aspect to this statement is that
serialization or token generation errors will be returned as a SQL value
instead of an ErrorResponse. This will allow the sqlproxy to react accordingly,
and to reduce ambiguity issues during transferring.

This observer statement will only work on tenants (due to the need of token
generation). Since this is meant to be used internally only, there is no
release note.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Feb 7, 2022
Informs cockroachdb#76000.

This commit implements postgres interceptors, namely FrontendInterceptor and
BackendInterceptor, as described in the sqlproxy connection migration RFC.
These interceptors will be used as building blocks for the forwarder component
that we will be adding in a later PR. Since the forwarder component has not
been added, a simple proxy test (i.e. TestSimpleProxy) has been added to
illustrate how the frontend and backend interceptors can be used within the
proxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Feb 7, 2022
Informs cockroachdb#76000.

This commit refactors the ConnectionCopy call in proxy_handler.go into a new
forwarder component, which was described in the connection migration RFC. At
the moment, this forwarder component does basic forwarding through
ConnectionCopy, just like before, so there should be no behavioral changes to
the proxy. This will serve as a building block for subsequent PRs.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Feb 11, 2022
…val token

Informs cockroachdb#76000.

Previously, all the connection establishment logic is coupled with the handler
function within proxy_handler.go. This makes connecting to a new SQL pod during
connection migration difficult. This commit refactors all of those connection
logic out of the proxy handler into a connector component, as described in the
connection migration RFC. At the same time, we also add support for the session
revival token within this connector component.

Note that the overall behavior of the SQL proxy should be unchanged with this
commit.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Feb 11, 2022
Informs cockroachdb#76000.

This commit implements postgres interceptors, namely FrontendInterceptor and
BackendInterceptor, as described in the sqlproxy connection migration RFC.
These interceptors will be used as building blocks for the forwarder component
that we will be adding in a later PR. Since the forwarder component has not
been added, a simple proxy test (i.e. TestSimpleProxy) has been added to
illustrate how the frontend and backend interceptors can be used within the
proxy.

Release note: None
craig bot pushed a commit that referenced this issue Mar 3, 2022
…77328 #77335

75751: sql: Add DateStyle/IntervalStyle visitor r=e-mbrown a=e-mbrown

The DateStyle visitor allows for cast expressions with string
to interval and date/interval types to string cast to be rewritten.
These stable cast cause issues with DateStyle/IntervalStyle formatting so they
need to be wrapped in builtins containing their immutable version.

Release note: None
Release justification: Low risk update to new functionality

76705: backupccl: add prototype metadata.sst r=rhu713 a=rhu713

This adds writing of an additional file to the completion of BACKUP. This new file
is an sstable that contains the same metadata currently stored in the BACKUP_MANIFEST
file and statistics files, but organizes that data differently.

The current BACKUP_MANIFEST file contains a single binary-encoded protobuf message
of type BackupManifest, that in turn has several fields some of which are repeated
to contain e.g. the TableDescriptor for every table backed up, or every revision to
every table descriptor backed up. This can result in these manifests being quite large
in some cases, which is potentially concerning because as a single protobuf message,
one has to read and unmarshal the entire struct into memory to read any field(s) of it.

Organizing this metadata into an SSTable where repeated fields are instead stored as
separate messages under separate keys should instead allow reading it incrementally:
one can seek to a particular key or key prefix and then scan, acting on whatever data
is found as it is read, without loading the entire file at once (when opened using the
same seek-ing remote SST reader we use to read backup data ssts).

This initial prototype adds only the writer -- RESTORE does not rely on, or even open,
this new file at this time.

Release note: none.

77018: release: automate orchestration version update r=celiala a=rail

Previously, as a part of the release process we had to bump the
orchestration versions using `sed` with some error-prone regexes.

This patch adds `set-orchestration-version` subcommand to the release
tool. It uses templates in order to generate the orchestration files.

Release note: None

77055: sql: change index backfill merger to use batch api r=rhu713 a=rhu713

Use Batch API instead of txn.Scan() in order to limit the number of bytes per
batch response in the index backfill merger.

Fixes #76685.

Release note: None

77065: bazel: use test sharding more liberally r=rail a=rickystewart

Closes #76376.

Release note: None

77109: ccl/sqlproxyccl: add helpers related to connection migration r=JeffSwenson,andy-kimball a=jaylim-crl

#### ccl/sqlproxyccl: add helpers related to connection migration 

Informs #76000. Extracted from #76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

Release note: None

Release justification: Helpers added in this commit are needed for the
connection migration work. Connection migration is currently not being used
in production, and CockroachCloud is the only user of sqlproxy.
  
#### ccl/sqlproxyccl: fix math for defaultBufferSize in interceptors 

Previously, we incorrectly defined defaultBufferSize as 16K bytes. Note that
2 << 13 is 16K bytes. This commit fixes that behavior to match the original
intention of 8K bytes.

Release note: None

Release justification: This fixes an unintentional buglet within the sqlproxy
code that was introduced with the interceptors back then. Not having this in
means we're using double the memory for each connection within the sqlproxy.



77307: sql: add cluster setting to limit max size of serialized session r=otan,jaylim-crl a=rafiss

fixes #77302

The sql.session_transfer.max_session_size cluster setting can be used to
limit the max size of a session that is serialized using
crdb_internal.serialize_session.

No release note since this is not a public setting.

Release justification: high priority fix for new functionality.

Release note: None

77318: roachpb: extract keysbase to break some dependencies r=yuzefovich a=yuzefovich

This commit extracts a couple of things out of `roachpb` into new
`keysbase` package in order to break the dependency of `util/json` and
`sql/inverted` on `roachpb` (which is a part of the effort to clean up
the dependencies of `execgen`).

Addresses: #77234.

Release note: None

Release justification: low risk change to clean up the dependencies.

77319: sessiondatapb: move one enum definition into lex package r=yuzefovich a=yuzefovich

This commit moves the definition of `BytesEncodeFormat` enum from
`sessiondatapb` to `lex`. This is done in order to make `lex` not depend
on a lot of stuff (eventually on `roachpb`) and is a part of the effort
to clean up the dependencies of `execgen`. Note that the proto package
name is not changed, so this change is backwards-compatible.

Informs: #77234.

Release note: None

Release justification: low risk change to clean up the dependencies.

77328: roachtest: log stdout and stderr in sstable corruption test r=itsbilal a=nicktrav

To aid in debugging #77321, log the contents stdout and stderr if the
manifest dump command fails.

Release justification: Tests only.

Release note: None.

77335: kvserver: fix race that caused truncator to truncate non-alive replica r=tbg,erikgrinaker a=sumeerbhola

This was causing truncated state to be written to such a
replica, which would then get picked up as the
HardState.Commit value when a different replica was later
added back for the same range. See
#77030 (comment)
for the detailed explanation.

Also restore the default value of
kv.raft_log.loosely_coupled_truncation.enabled to true.

Fixes #77030

Release justification: Bug fix.
Release note: None

Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Informs cockroachdb#76000.

This commit implements postgres interceptors, namely FrontendInterceptor and
BackendInterceptor, as described in the sqlproxy connection migration RFC.
These interceptors will be used as building blocks for the forwarder component
that we will be adding in a later PR. Since the forwarder component has not
been added, a simple proxy test (i.e. TestSimpleProxy) has been added to
illustrate how the frontend and backend interceptors can be used within the
proxy.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
…y size

Informs cockroachdb#76000. Follow-up to cockroachdb#76006.

Previously, PeekMsg was returning the body size (excluding header size), which
is a bit awkward from an API point of view because most callers of PeekMsg
immediately adds the header size to the returned size previously. This commit
cleans the API design up by making PeekMsg return the message size instead,
i.e. header inclusive. At the same time, returning the message size makes it
consistent with the ReadMsg API since that returns the entire message.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
…val token

Informs cockroachdb#76000.

Previously, all the connection establishment logic is coupled with the handler
function within proxy_handler.go. This makes connecting to a new SQL pod during
connection migration difficult. This commit refactors all of those connection
logic out of the proxy handler into a connector component, as described in the
connection migration RFC. At the same time, we also add support for the session
revival token within this connector component.

Note that the overall behavior of the SQL proxy should be unchanged with this
commit.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Informs cockroachdb#76000.

This commit refactors the ConnectionCopy call in proxy_handler.go into a new
forwarder component, which was described in the connection migration RFC. At
the moment, this forwarder component does basic forwarding through
ConnectionCopy, just like before, so there should be no behavioral changes to
the proxy. This will serve as a building block for subsequent PRs.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
…ceptors

Informs cockroachdb#76000.

Previously, we were using io.Copy through ConnectionCopy to forward messages
between the client and SQL server. Now that the interceptors have been merged,
we will update the forwarder to use these interceptors instead of the old
approach.

There are a few notable changes in this commit:
1. We wrap clientConn with the a readTimeoutConn that was exposed in the
   previous commit. This allows us to unblock on Read whenever an activity
   occurs (e.g. context cancellation, and in the future, when transfer is
   requested).
2. There are two goroutines per connection within the forwarder: one for the
   request processor (client to server), and the other for the response
   processor (server to client).
3. We also removed unnecessary checks for codeExpiredClientConnection and
   codeIdleDisconnect errors when copying from server to client. These are
   already handled by the idle monitor's callback as well as the denylist
   watcher's callback. When those constructs were implemented back then, we
   did not remove them from ConnectionCopy.

We need to clean up error messages, and how we propagate them back to the
user because today we just close the connection without returning a response,
resulting in, I believe, a broken pipe error.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Informs cockroachdb#76000.

This commit adds the SHOW TRANSFER STATE observer statement, as described in
the sqlproxy connection migration RFC. This observer statement will be used
whenever a connection is about to be migrated to retrieve the relevant values
needed for the transfer process. A unique aspect to this statement is that
serialization or token generation errors will be returned as a SQL value
instead of an ErrorResponse. This will allow the sqlproxy to react accordingly,
and to reduce ambiguity issues during transferring.

This observer statement will only work on tenants (due to the need of token
generation). Since this is meant to be used internally only, there is no
release note.

Release note: None

Co-authored-by: Jay <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

Previously, we assumed that with the token-based authentication, the server is
ready to accept queries the moment we connect to it. This assumption has been
proved wrong during the integration tests with the forwarder, and there's
an implicit AuthenticateOK step that happens after connecting to the server.
During that time, initial connection data such as ParameterStatus and
BackendKeyData messages will be sent to the client as well. For now, we will
ignore those messages. Once we start implementing query cancellation within
the proxy, that has to be updated to cache the new BackendKeyData.

This commit also fixes a buglet to handle pgwire messages with no body.
pgproto3's Receive methods will still call Next if the body size is 0, and
previously, we were returning an io.EOF error. This commit changes that
behavior to return an empty slice.

Release note: None

Release justification: This fixes two buglets: one that was introduced when we
added token-based authentication support to the proxy in cockroachdb#76417, and another
when we added the interceptors. This is low risk as part of the code is
guarded behind the connection migration feature, which is currently not being
used in production. To add on, CockroachCloud is the only user of sqlproxy.
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Mar 10, 2022
Informs cockroachdb#76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:
- proxy.conn_migration.success
- proxy.conn_migration.error_fatal
- proxy.conn_migration.error_recoverable
- proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Mar 10, 2022
Informs cockroachdb#76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:
- proxy.conn_migration.success
- proxy.conn_migration.error_fatal
- proxy.conn_migration.error_recoverable
- proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Mar 11, 2022
Informs cockroachdb#76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:
- proxy.conn_migration.success
- proxy.conn_migration.error_fatal
- proxy.conn_migration.error_recoverable
- proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None
craig bot pushed a commit that referenced this issue Mar 13, 2022
76805: ccl/sqlproxyccl: complete connection migration support in the forwarder r=JeffSwenson a=jaylim-crl

#### ccl/sqlproxyccl: add last message details to the forwarder's processor

This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.

At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.

Release justification: Low risk, sqlproxy-only change.

Release note: None

#### ccl/sqlproxyccl: support waitResumed on the processor to block until resumption

Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.

Release justification: sqlproxy-only change.

Release note: None

#### ccl/sqlproxyccl: complete connection migration support in the forwarder

Informs #76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:
- proxy.conn_migration.success
- proxy.conn_migration.error_fatal
- proxy.conn_migration.error_recoverable
- proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None


Co-authored-by: Jay <[email protected]>
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Mar 13, 2022
…e size

Informs cockroachdb#76000, and follow up to cockroachdb#76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Mar 14, 2022
…e size

Informs cockroachdb#76000, and follow up to cockroachdb#76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
craig bot pushed a commit that referenced this issue Mar 14, 2022
77442: sqlproxyccl: Add codeUnavailable to the list of error codes r=alyshanjahani-crl a=alyshanjahani-crl

Release justification: fixes for high-priority bug in existing functionality

Previously, if a tenant cluster had maxPods set to 0 the error returned by
directory.EnsureTenantAddr was not treated as a non-retryable error.

The tenant directory implementation used in CockroachCloud now identifies
this situation and returns a status error with codes.FailedPrecondition and
a descriptive message.

This patch adds a check for the FailedPrecondition error in
connector.lookupAddr.

Release note: None

77626: sql: re-enable txnidcache by default r=maryliag,ajwerner a=Azhng

Previously, TxnID cache was disabled by default due to performance
issues. This has since been resolved in #77220.
This commit re-enables TxnID Cache by default.

Release justification: low risk high reward change.
Release note: None

77675: cmd/roachtest: tweaks to sstable corruption test r=itsbilal a=nicktrav

A few minor changes to the existing roachtest:

Reduce the warehouse count on ingest back to 100 - this should be ample,
and it matches the warehouse count on the read path in a later part of
the test.

Change the stop options to wait until the process has been terminated
before continuing.

Re-work the command to search for SSTables to corrupt - use tables from
the most recent manifest (if multiple are present); consider SSTables
with either a start or end key representing a user table; shuffle tables
to distribute corruption over the LSM; perform filtering in the bash
command rather than in the test runner itself.

Release justification: Roachtest-only change.

Release note: None.

Touches #77321.

77700: ccl/sqlproxyccl: add connection migration-related metrics r=JeffSwenson a=jaylim-crl

#### ccl/sqlproxyccl: add metric that measures connection migration latency

Previously, we added support for connection migration in #76805. This commit
adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency
for attempted connection migrations. Having this metric would be beneficial
as we can now know how long users were blocked during connection migrations.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None

#### ccl/sqlproxyccl: add metric that records the transfer response message size

Informs #76000, and follow up to #76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None

Co-authored-by: Alyshan Jahani <[email protected]>
Co-authored-by: Azhng <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Jay <[email protected]>
craig bot pushed a commit that referenced this issue Mar 23, 2022
78276: ccl/sqlproxyccl: rename tenant.Resolver to tenant.DirectoryCache  r=JeffSwenson,andy-kimball a=jaylim-crl

Informs #76000.

#### ccl/sqlproxyccl: move the TenantResolver interface to the tenant package 

Previously, the TenantResolver interface lived in the base sqlproxyccl package,
and there was no way to enforce that the tenant.Directory struct implements
this interface since that would result in a cyclic dependency. This commit
moves the TenantResolver interface into the tenant package, and enforced that
the tenant.Directory struct implements that.

This would then allow us to rename the tenant.Directory struct to
tenant.directoryCache, and the tenant.Resolver interface to
tenant.DirectoryCache.

Release note: None

#### ccl/sqlproxyccl: rename tenant.Resolver to tenant.DirectoryCache 

This commit renames the tenant.Resolver interface to tenant.DirectoryCache.
The existing Directory struct was also renamed to directoryCache to prevent
confusions since there were previously two definitions of a "directory": one
pointing to the cache of entries, whereas another referring to the actual
directory server.

Release note: None

Co-authored-by: Jay <[email protected]>
@rafiss
Copy link
Collaborator

rafiss commented May 31, 2023

Can this be closed?

@jaylim-crl
Copy link
Collaborator Author

The connection migration work has been completed, and has been running in production for over a year now. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-serverless
Projects
None yet
Development

No branches or pull requests

2 participants