Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: make txn ID resolution work with ephemeral SQL Pod #75998

Closed
Tracked by #74485
Azhng opened this issue Feb 4, 2022 · 2 comments
Closed
Tracked by #74485

sql: make txn ID resolution work with ephemeral SQL Pod #75998

Azhng opened this issue Feb 4, 2022 · 2 comments
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale

Comments

@Azhng
Copy link
Contributor

Azhng commented Feb 4, 2022

The Contention Event Store RFC described a mechanism to resolve a transaction ID (uniquely identifies a transaction execution) into transaction fingerprint ID (identifies historical transaction statistics). Namely, transaction ID resolution protocol.

This protocol depends on two piece of infrastructure

  1. TxnID Cache - TxnID Cache serves the purpose of data storage. It's an in-memory FIFO buffer living in each SQL Server, that records the transaction fingerprint ID for each transaction ID.
  2. Coordinator Node ID stored in TxnIntent - This piece of infrastructure provides the routing capability for transaction ID resolution protocol. Using the Coordinator Node ID stored in the TxnIntent, the contention event registry can send out RPCs to the coordinator node to query for transaction fingerprint ID.

However, the ephemeral nature of SQL Pod introduces problems for this protocol (as it stands today) to function properly.

  1. Since TxnID Cache lives purely in-memory, when the SQL Pod goes down, all TxnID <> TxnFingerprintID mapping stored in the TxnID Cache gets destroyed. Hence, we no longer have data storage.
  2. When a SQL Pod goes down, the Coordinator Node ID (SQLInstanceID in this case) stored in the TxnIntent is no longer valid. Hence, contention event registry will not be able to route RPC request to correct node to request for transaction ID resolution.

There exists other projects in CRDB today that cope with this ephemeral nature of SQL Pods. Namely, Session migrations, where a SQL connection (along with its sessions) can be seamlessly transferred from one SQL Pod to another SQL Pod. This suggests that making TxnID resolution functional in serverless is possible.


As a strawman solution, that tackles the two problem listed above:

  1. Data Storage Problem: Provides a pair of builtins that serializes TxnID Cache and deserializes TxnID Cache onto a different node. (alternatively, is it possible to hook into Server Draining Event and randomly pick a peer to send over a serialized version of TxnID Cache?)
  2. Routing Problem: Provides a way to solve the routing problem after the Coordinator Node ID becomes invalid (due to SQL Pod goes away). There are two way to do this:
    1. Option 1: Brute force method: when routing fails, the contention registry fallback to use RPC fanout. Since we addressed the data storage issue by migrating TxnID Cache onto a different node, this would likely to work. After the bruteforce RPC fanout, the contention registry can take note of the new Coordinator Node ID that replaced the now-no-longer-existed coordinator node.
    2. Option 2: Leaving behind some sort of tombstone/forwarding-record when a SQL Pod goes away, then the contention registry can use that forwarding-record to route RPC to the new node where the TxnID Cache resides.

Jira issue: CRDB-12911

@blathers-crl
Copy link

blathers-crl bot commented Feb 4, 2022

Hi @Azhng, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@Azhng Azhng changed the title Investigate a way to work with ephemeral SQL Pod sql: make txn ID resolution work with ephemeral SQL Pod Feb 4, 2022
@Azhng Azhng added A-sql-observability Related to observability of the SQL layer T-sql-observability labels Feb 4, 2022
@Azhng Azhng added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 4, 2022
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

1 participant