kv: consistent follower reads with leaseholder coordination #72593

nvanbenschoten · 2021-11-10T04:02:51Z

To date, follower reads (rfc/follower_reads.md, rfc/follower_reads_implementation.md, rfc/non_blocking_txns.md) have always been viewed first and foremost as a tool to minimize latency for read-only operations. By avoiding all communication with a range's leaseholder, follower reads can help a transaction avoid cross-region communication, dramatically reducing latency. However, in order to avoid any coordination with the leaseholder, follower reads trade off some utility — they either require reads to be stale or writes to be pushed into the future. This limits the places where they can be used.

This issue explores an extended form of "consistent" follower read that can be used in more situations than "stale" follower reads but still requires synchronous fixed-size (with respect to data accessed) communication with the range's leaseholder, negating what we have traditionally viewed as the primary benefit of follower reads. It also explores the secondary benefits of follower reads that remain even if the leaseholder helps coordinate the read off of a follower.

Motivations

Network costs in public clouds are expensive. They are also asymmetric, with pricing dependent on the source and destination of data transfer. For example, we see from EC2's data transfer pricing page that cross-region transfer costs between $0.01-$0.02 per GB, cross-zone transfer costs $0.01 per GB, and intra-zone transfer is free. This asymmetric pricing provides a strong incentive to minimize the amount of data shipped across regions/zones, even if some communication across regions/zones is unavoidable. Recognizing that many clients often have a follower for a given range closer (in data transfer cost terms) than the leaseholder presents an opportunity for cost savings.
Load-based splitting and rebalancing can help spread out well-distributed load across a cluster of nodes. However, they cannot spread out hotspots that cannot be split into different ranges. For read-heavy hotspots, serving reads from followers replicas can provide a form of load-balancing. This is true even if the leaseholder is contacted at some point to facilitate the read from the follower, as long as the follower is the one performing the expensive portion of the read (e.g. reading from disk, sending the result set to the client over the network, etc.).
(Stretch motivation) In the future, followers may store data in a different layout than leaseholders (e.g. column-oriented instead of row-oriented), which may be better suited for large analytical-style reads. The data organization would exchange write performance for read performance, so it would be more appropriate for a follower by virtue of the fact that followers can apply log entries at a slower cadence than leaseholders (e.g. batching 100s of entries to apply at a time). Serving reads from follower replicas would allow these read-optimized followers to be used even for consistent reads.

High-Level Overview

The key idea here is that even if the leaseholder is contacted during a read, it doesn't need to be the one to serve the read's results. Instead, it can be contacted to do some light bookkeeping and then offload the heavy-lifting to a follower replica who may be a better candidate to serve the data back to the client.

For the sake of this issue, let's pretend we introduced a new request type called EstablishResolvedTimestamp (a sibling to the QueryResolvedTimestamp request).

In response to an EstablishResolvedTimestamp request, the leaseholder would concern itself with concurrency control and with determining how far the follower needs to catch up on its Raft log before its state machine contains a fully resolved view of the specified span x timestamp segment of "keyspacetime". Morally, the leaseholder would be in charge of creating a resolved timestamp over the given key span at the given timestamp. So the API would look something like this:

type EstablishResolvedTimestampRequest struct{ Transaction, Timestamp, Span }
type EstablishResolvedTimestampResponse struct{ LeaseAppliedIndex }

With this new API, followers can now be used to serve consistent follower reads. Either of the following appeaches would work here, and each has their own benefits:

Follower-coordinated

client issues scan/get to nearest follower replica with ts
follower checks closed timestamp against ts, determines its closed timestamp is not high enough
follower sends EstablishResolvedTimestampRequest to leaseholder
leaseholder grabs latches, checks lock table, bumps timestamp cache over span, and notes current lease_applied_index
leaseholder returns EstablishResolvedTimestampResponse with lease_applied_index
follower waits to apply log entry with lease_applied_index >= one from response
follower reads serves read and returns to client

Benefits:

simpler
minimal API changes
can avoid leaseholder hop if closed timestamp happens to be high enough
can short-circuit leaseholder hop if closed timestamp increases while EstablishResolvedTimestamp outstanding, which can be helpful for stale reads as well

Client-coordinated

client issues EstablishResolvedTimestampRequest to leaseholder
leaseholder grabs latches, checks lock table, bumps timestamp cache over span, and notes current lease_applied_index
leaseholder returns EstablishResolvedTimestampResponse with lease_applied_index
client redirects to follower with lease_applied_index
follower waits to apply log entry with lease_applied_index >= one from response
follower reads serves read and returns to client

Extended client-coordinated

client issues scan/get to leaseholder replica with some establish_resolved_timestamp_on_large_result flag
leaseholder grabs latches, checks lock table, bumps timestamp cache over span, evaluates
leaseholder determines if result is small or large. If small, return. If large, return lease_applied_index
client redirects to follower with lease_applied_index
follower waits to apply log entry with lease_applied_index >= one from response
follower reads serves read and returns to client

Benefits:

lazy determination of single-hop vs. multi-hop, based on actual result size instead of guess

Additional unstructured notes:

- the Transaction is needed in EstablishResolvedTimestampRequest for deadlock detection
- if an EstablishResolvedTimestampRequest is scanning the entire range, it can also bump the closed timestamp
- the EstablishResolvedTimestampResponse could carry an observed timestamp to help avoid some uncertainty restarts
- uncertainty works as expected on follower
-- it *does not* need to resolve up to uncertainty interval
-- it knows that any causal predecessor will have been included in a log entry with <= lease_applied_index
- read-your-writes in read-write txn works as expected on follower
-- any intent writes will be flushed during pipeline stall on leaseholder and will have been included in a log entry with <= lease_applied_index
- how do limited scans play into this?
- how does an actor tail the log and wait for a lease_applied_index?
-- what if it sees a split? or a replica removal?
- when a follower is waiting to apply, is liveness guaranteed?
-- Does it ever need to wake up the range from quiescence or ensure an active leaseholder?

Jira issue: CRDB-11223

Epic CRDB-14991

The text was updated successfully, but these errors were encountered:

ajwerner · 2021-12-08T21:47:29Z

Do sql pods know their AZ and the AZ of sql nodes? I assume they know the latter thing, or, at least, have node descriptors which give them some info. Do we need to plumb more info into the sql pods to help facilitate this?

This commit adds support for the `--locality` and `--max-offset` flags to the `cockroach mt start-sql` command. The first of these is important because tenant SQL pods should know where they reside. This will be important in the future for multi-region serverless and also for projects like cockroachdb#72593. The second of these is important because the SQL pod's max-offset setting needs to be the same as the host cluster's. If we want to be able to configure the host cluster's maximum clock offset to some non-default value, we'll need SQL pods to be configured identically. Validation of plumbing: ```sh ./cockroach start-single-node --insecure --max-offset=250ms ./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)' # verify --max-offset ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)" ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms # successful # verify --locality ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' ERROR: gateway_region(): no region set on the locality flag on this node ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1 ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' gateway_region ------------------ us-east1 ```

73500: kv,storage: persist gateway node id in transaction intents r=AlexTalks a=AlexTalks This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in #71965, utilizing option 2. Release note: None 73862: sql: add test asserting CREATE/USAGE on public schema r=otan a=rafiss refs #70266 The public schema currently always has CREATE/USAGE privileges for the public role. Add a test that confirms this. Release note: None 73873: scdeps: tighten dependencies, log more side effects r=postamar a=postamar This commit reworks the dependency injection for the event logger, among other declarative schema changer dependencies. It also makes the test dependencies more chatty in the side effects log. Release note: None 73932: ui: select grants tab on table details page r=maryliag a=maryliag Previosuly, when the grants view was selected on the Database Details page, it was going to the Table Details with the Overview tab selected. With this commit, if the view mode selected is Grant, the grant tab is selected on the Table Details page. Fixes #68829 Release note: None 73943: cli: support --locality and --max-offset flags with sql tenant pods r=nvanbenschoten a=nvanbenschoten This commit adds support for the `--locality` and `--max-offset` flags to the `cockroach mt start-sql` command. The first of these is important because tenant SQL pods should know where they reside. This will be important in the future for multi-region serverless and also for projects like #72593. The second of these is important because the SQL pod's max-offset setting needs to be the same as the host cluster's. If we want to be able to configure the host cluster's maximum clock offset to some non-default value, we'll need SQL pods to be configured identically. Validation of plumbing: ```sh ./cockroach start-single-node --insecure --max-offset=250ms ./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)' # verify --max-offset ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)" ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms # successful # verify --locality ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' ERROR: gateway_region(): no region set on the locality flag on this node ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1 ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' gateway_region ------------------ us-east1 ``` 73946: ccl/sqlproxyccl: fix TestWatchPods under stressrace r=jaylim-crl a=jaylim-crl Fixes #69220. Regression from #67452. In #67452, we omitted DRAINING pods from the tenant directory. Whenever a pod goes into the DRAINING state, the pod watcher needs time to update the directory. Not waiting for that while calling EnsureTenantAddr produces a stale result. This commit updates TestWatchPods by polling on EnsureTenantAddr until the pod watcher updates the directory. Release note: None 73954: sqlsmith: don't compare voids for joins r=rafiss a=otan No comparison expr is defined on voids, so don't generate comparisons for them. Resolves #73901 Resolves #73898 Resolves #73777 Release note: None Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Oliver Tan <[email protected]>

This commit adds support for the `--locality` and `--max-offset` flags to the `cockroach mt start-sql` command. The first of these is important because tenant SQL pods should know where they reside. This will be important in the future for multi-region serverless and also for projects like cockroachdb#72593. The second of these is important because the SQL pod's max-offset setting needs to be the same as the host cluster's. If we want to be able to configure the host cluster's maximum clock offset to some non-default value, we'll need SQL pods to be configured identically. Validation of plumbing: ```sh ./cockroach start-single-node --insecure --max-offset=250ms ./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)' # verify --max-offset ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)" ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms # successful # verify --locality ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' ERROR: gateway_region(): no region set on the locality flag on this node ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1 ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' gateway_region ------------------ us-east1 ```

github-actions · 2023-10-02T11:05:15Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

blathers-crl bot added the T-kv KV Team label Nov 10, 2021

nvanbenschoten mentioned this issue Dec 16, 2021

cli: support --locality and --max-offset flags with sql tenant pods #73943

Merged

arulajmani mentioned this issue May 4, 2022

kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000

Closed

ajwerner mentioned this issue Jun 1, 2022

sql,kv,storage: push column batch generation into kvserver #82323

Open

10 tasks

aayushshah15 mentioned this issue Jul 5, 2022

kv: release latches before evaluation for read-only requests #66485

Closed

nvanbenschoten mentioned this issue Sep 26, 2022

kvserver: allow committing entries not in leader's stable storage #88699

Open

github-actions bot added the no-issue-activity label Oct 2, 2023

github-actions bot added the X-stale label Oct 16, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2023

exalate-issue-sync bot closed this as completed Oct 16, 2023

nvanbenschoten reopened this Oct 23, 2023

nvanbenschoten removed X-stale no-issue-activity labels Oct 23, 2023

This comment was marked as outdated.

Sign in to view

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: consistent follower reads with leaseholder coordination #72593

kv: consistent follower reads with leaseholder coordination #72593

nvanbenschoten commented Nov 10, 2021 •

edited by exalate-issue-sync bot

Loading

ajwerner commented Dec 8, 2021

github-actions bot commented Oct 2, 2023

This comment was marked as outdated.

kv: consistent follower reads with leaseholder coordination #72593

kv: consistent follower reads with leaseholder coordination #72593

Comments

nvanbenschoten commented Nov 10, 2021 • edited by exalate-issue-sync bot Loading

Motivations

High-Level Overview

Follower-coordinated

Client-coordinated

Extended client-coordinated

Additional unstructured notes:

ajwerner commented Dec 8, 2021

github-actions bot commented Oct 2, 2023

This comment was marked as outdated.

nvanbenschoten commented Nov 10, 2021 •

edited by exalate-issue-sync bot

Loading