sql: implement immutable tables that automatically use follower reads #26301

jordanlewis · 2018-05-31T21:09:29Z

Background

It's very common in real-world database schemas to have reference tables - infrequently-updated tables, often relatively small, that are used as foreign key tables in applications. Some examples:

a table of zipcodes. zipcodes are very infrequently deleted or added
the items table from tpcc. items in tpcc are never updated
a table of historical closing prices in a financial database. this table is updated once a day, but not during the day

These kinds of tables present some issues for CockroachDB. In CockroachDB, doing a read from a table requires going to the table's leaseholder, to make sure that the reads are consistent. In an application that plays to CockroachDB's strengths that has many, globally distributed datacenters, reference tables like this will cause poor performance in several situations:

inserts into tables that have foreign keys to a reference table will be at least as slow as a read to the leaseholder of the reference table, which might live across the globe
joins from tables to a reference table will require going to the leaseholder, which might live across the globe

This may be unacceptable for customers who desire low-latency reads against reference tables and are willing to sacrifice write performance against those tables.

We can support this in CockroachDB by providing users the ability to mark a table as immutable, and using that information to provide fast, transactional reads to any replica.

Proposed solution for CockroachDB: Immutable tables

Immutable tables are tables that cannot be modified. Tables can be set as immutable, and reset back to mutable, by performing a schema change with an ordinary ALTER TABLE command.

When a table is marked as immutable, it's possible to drastically improve read performance from it by altering the read semantics. There are (at least) two ways of doing this:

Using inconsistent reads against the table. ~~That's safe because we know that the table is immutable, implying that all of its replicas are up-to-date.~~
Using follower reads (read from a follower with timestamp bound #16593) against the table, with a read timestamp somewhere in the past, divorced from the active transaction timestamp. That's safe because follower reads guarantee that follower replicas are up-to-date as of some interval of time in the past.

The first solution is harder to implement, because of the semantics of leaving the immutable state - we'd have to invent a way to wait until data has fully replicated to all replicas. It also might be unsafe altogether - see @nvanbenschoten's comment on this issue.

The second solution is easier, and should be possible as soon as follower reads are available, using the schema change infrastructure and an intermediate state similar to the intermediate WRITE_ONLY state used for index backfills. The rest of this issue proposes to use the second solution.

Details on immutable/mutable states and state change process

Table descriptors will be extended with a new field, mutability, that has three possible values:

mutable
- Reads from a mutable table behave as normal.
- Writes to a mutable table are permitted.
immutable-must-read-transactional
- Reads from an immutable-must-read-transactional table behave as normal.
- Writes to an immutable-must-read-transactional table are not permitted.
immutable
- Reads from an immutable table are performed at a historical timestamp, t seconds in the past (where t is defined by the follower read safe past interval), and get served by a follower read
- Writes to an immutable table are not permitted

By default, mutability is set to mutable. To transition a table from mutable to immutable, a user issues an ALTER TABLE t SET IMMUTABLE command (or similar - syntax to be decided later), which performs the following actions:

Set table descriptor to immutable-must-read-transactional and wait until the cluster converges on the new descriptor.
Wait t seconds to ensure that, once transitioned to immutable, all reads from t seconds ago occur before the last possible write to the reference table.
Set table descriptor to immutable.

To transition a table from immutable to mutable, a user issues an ALTER TABLE t SET MUTABLE command, which performs the following actions:

Set table descriptor to immutable-must-read-transactional and wait until the cluster converges on the new descriptor.
Set table descriptor to mutable. (note we don't have to wait t seconds here, since we're moving in a "safe" direction, causing all future reads to be ordinary transactional reads.)

The text was updated successfully, but these errors were encountered:

tbg · 2018-05-31T21:20:57Z

The immutable table may run the GC queue, which could make the historical timestamps unsafe. This is just a silly comment to make sure we're using a "relatively current" follower read timestamp, and not just some fixed past date.

One remaining problem is that when an immutable table needs to be updated, performance will fall off a cliff. But at least the operators get to choose when they do this, so hopefully it's workable. It also seems really hard to avoid that problem.

nvanbenschoten · 2018-05-31T21:32:26Z

Using inconsistent reads against the table. That's safe because we know that the table is immutable, implying that all of its replicas are up-to-date.

I don't think this is true. Inconsistent reads provide no staleness bounds whatsoever, so it doesn't follow that if a table is immutable then all of its replicas are up-to-date. Follower reads, however, provide a strict staleness bound and will perform the more expensive leaseholder read if the local follower is not sufficiently up-to-date.

One remaining problem is that when an immutable table needs to be updated, performance will fall off a cliff.

Performance will certainly drop, but it will be no worse than the case today when partitioned tables have foreign keys pointing at non-partitioned tables. Also, only latencies should increase in this scenario It shouldn't create a throughput bottleneck that could suddenly topple a cluster. This is a much better situation to be in than the opposite, where changing an immutable table to mutable implies performing significantly more work and limiting available throughput.

jordanlewis · 2018-05-31T21:41:23Z

Inconsistent reads provide no staleness bounds whatsoever, so it doesn't follow that if a table is immutable then all of its replicas are up-to-date.

Updated the description to take this into account - thanks.

bdarnell · 2018-06-01T15:28:26Z

Additionally, data from immutable tables can be cached aggressively, and the lookups themselves can be batched across transactions.

I'm concerned about the sharp performance difference of making an immutable table mutable. How can an admin tell whether they have the headroom to make the change?

It's worth considering other restrictions that could be imposed on a table descriptor that would allow foreign key-style reads to be sped up without needing to reverse those rules temporarily to make a change. For example, a "primary keys are append-only" rule would disallow deletions and changes to PK columns, and allow the existence of a PK to be cached and read non-transactionally for FK checks. If the local read says the FK doesn't exist, you could either fall back to a remote read or just fail the request (perhaps an option on the FK constraint. Failing the request would mean that it takes some time after the reference table is updated before it can be used, but that would generally be OK and wouldn't allow data that violates the FK to be inserted)

The more precise the restrictions the trickier it gets to use them for non-FK purposes (so maybe an "insert-only" flag would be better than one limited to PKs). But in any case as an admin I'd be more comfortable using flags like "insert-only" instead of "usually immutable, but can be made mutable at high performance cost and risk".

knz · 2018-06-04T07:53:49Z

An alternative to the immutable-to-mutable transition:

keep the immutable data as-is
add two new (mutable) tables deleted_rows and new_or_changed_rows
implement a "smart view" which:
- for read queries runs the query from new_or_changed_rows UNION immutable_rows EXCEPT deleted_rows
- for delete queries adds the row to deleted_rows
- for inserts/updates adds the row to new_or_changed_rows

Then we'd have a background job which creates an entire new table with the combination of all 3.

Then when the background job completes we replace the "smart view" by the new mutable descriptor.

tbg · 2018-06-04T13:49:11Z

I like this proposal, especially with Ben's twist about making these tables insert-only.

If the local read says the FK doesn't exist, you could either fall back to a remote read or just fail the request (perhaps an option on the FK constraint. Failing the request would mean that it takes some time after the reference table is updated before it can be used, but that would generally be OK and wouldn't allow data that violates the FK to be inserted)

Falling back to the remote read sounds like the better option if we get to aggressively cache the result.

@knz what you're proposing seems to add a lot of complexity in SQL-land. And don't you still have the problem that you can't guarantee that a deletion doesn't mess up the internal consistency?

a-robinson · 2018-06-04T14:48:17Z

I don't think this is true. Inconsistent reads provide no staleness bounds whatsoever, so it doesn't follow that if a table is immutable then all of its replicas are up-to-date. Follower reads, however, provide a strict staleness bound and will perform the more expensive leaseholder read if the local follower is not sufficiently up-to-date.

Couldn't we get around that, though, by simply using the intermediate immutable-must-read-transactional state to wait until all replicas are up-to-date before changing to the mutable state and allowing inconsistent reads to start?

Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of a **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](cockroachdb#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](cockroachdb#25229). Release note: None

tbg · 2018-06-05T11:50:20Z

Hmm, that's not a bad idea, though it requires guarantees that are basically true all the time but not necessarily always. If a replica set is fully replicated once, can there be some scenario in which it becomes non-fully replicated later? Yes, that could happen if the replica set is in the middle of a replica change (and a preemptive snapshot that wasn't the most recent data is sent). The newly added replica could need a second to catch up. And even if there isn't a replica change in progress, could there be an old preemptive snapshot somewhere else that gets reused in a later replica change?

nvanbenschoten · 2018-06-05T14:16:19Z

Couldn't we get around that, though, by simply using the intermediate immutable-must-read-transactional state to wait until all replicas are up-to-date before changing to the mutable state and allowing inconsistent reads to start?

Yeah, I don't think this is insurmountable, but it's would require facilities that aren't currently provided by our storage layer. First, we'd need to define what all replicas being "up-to-date" means. How would we track this? We don't have good hook here. We could probably wait for all Ranges to quiesce as an indication of all replicas being up to date, but this relies on breaking the abstraction that only a single table will ever send load to a given range. We'd then also need the second facility that @tschottdorf alluded to - a guarantee that once all replicas in a range have reached a certain log entry, no replica in that range will ever regress. I wasn't aware of cases that break this guarantee, but it's not one we explicit make and it sounds like that's because it's not held in all cases.

The nice thing about using follower reads is that it provides both of these facilities while maintaining the abstraction of only thinking about the time domain in SQL and ignoring all replication concerns.

justinj · 2018-06-05T14:18:16Z

🔬🐶 Would it be complete outlandish to treat these tables as schema elements rather than as data? I know schema changes aren't transactional, but I think this could be made to work using the usual sort of schema change state machine (obviously with very poor mutation performance) and there's already schema change mechanisms for maintaining consistency between replicas that could possibly be used for this.

Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of a **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](cockroachdb#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](cockroachdb#25229). Release note: None

26362: RFC: follower reads r=bdarnell,nvanbenschoten a=tschottdorf NB: this is extracted from #21056; please don't add new commentary on the tech note there. ---- Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](#25229). Release note: None 27699: storage: fix stopper race in compactor r=petermattis a=tschottdorf Starting workers without a surrounding task is unfortunately often not the right thing to do when the worker accesses other state that might become invalidated once the stopper begins to stop. In this particular case, the compactor might end up accessing the engine even though it had already been closed. I wasn't able to repro this failure in the first place, but pretty sure this: Fixes #27232. Release note: None 27704: issues: fix email fallback r=petermattis a=tschottdorf This was not my email address. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

jordanlewis · 2018-10-07T23:34:35Z

Coming back and looking at this again in anticipation of 2.2 work, I'm not sure how much extra performance this proposal would gain over naive follower reads. @tschottdorf care to comment? Would the follower reads implementation we're initially targetting already get us a good chunk of the possible gains here?

bdarnell · 2018-10-08T00:52:53Z

They're fairly different. Follower reads (as currently planned) only take effect for clients that opt in to potentially-stale data. This makes them kind of useless for foreign key checks, which require up-to-date information. Immutable tables optimize for reads by placing the entire burden on the (infrequent) writers.

Instead of follower reads, immutable tables compete with quorum leases. If we supported quorum leases with a quorum equal to the entire group size, then we'd have what we want here: each member would be able to serve consistent reads, but writes would have to talk to 100% of the replicas to commit (or wait for a timeout to expire the lease of any non-responsive node).

andy-kimball · 2018-10-29T15:04:04Z

I think we should support Follower reads for scenarios where clients opt-in using AS OF SYSTEM TIME, but then implement something like #31955 for cases where consistency is required (or do quorum leases). The trouble with immutability is that it works well until the table needs to be updated. Even if that only happened infrequently, it could be quite disruptive to any running workloads, since the DBA would need to make the table mutable, make the update/insert/delete, then make the table immutable again. During that period of time, I'd expect to see latency spikes, since other transactions would be much more likely to block or retry.

awoods187 · 2019-06-06T20:32:47Z

Closing this in favor of #31955

jordanlewis added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. labels May 31, 2018

tbg mentioned this issue Jun 4, 2018

RFC: follower reads #26362

Merged

tbg added this to the 2.2 milestone Jul 22, 2018

petermattis removed this from the 2.2 milestone Oct 5, 2018

nvanbenschoten removed the A-coreperf label Oct 12, 2018

tbg mentioned this issue Oct 28, 2018

opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware #31955

Closed

awoods187 closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: implement immutable tables that automatically use follower reads #26301

sql: implement immutable tables that automatically use follower reads #26301

jordanlewis commented May 31, 2018 •

edited

Loading

tbg commented May 31, 2018

nvanbenschoten commented May 31, 2018

jordanlewis commented May 31, 2018

bdarnell commented Jun 1, 2018

knz commented Jun 4, 2018

tbg commented Jun 4, 2018

a-robinson commented Jun 4, 2018

tbg commented Jun 5, 2018

nvanbenschoten commented Jun 5, 2018

justinj commented Jun 5, 2018

jordanlewis commented Oct 7, 2018

bdarnell commented Oct 8, 2018

andy-kimball commented Oct 29, 2018

awoods187 commented Jun 6, 2019

sql: implement immutable tables that automatically use follower reads #26301

sql: implement immutable tables that automatically use follower reads #26301

Comments

jordanlewis commented May 31, 2018 • edited Loading

Background

Proposed solution for CockroachDB: Immutable tables

Details on immutable/mutable states and state change process

tbg commented May 31, 2018

nvanbenschoten commented May 31, 2018

jordanlewis commented May 31, 2018

bdarnell commented Jun 1, 2018

knz commented Jun 4, 2018

tbg commented Jun 4, 2018

a-robinson commented Jun 4, 2018

tbg commented Jun 5, 2018

nvanbenschoten commented Jun 5, 2018

justinj commented Jun 5, 2018

jordanlewis commented Oct 7, 2018

bdarnell commented Oct 8, 2018

andy-kimball commented Oct 29, 2018

awoods187 commented Jun 6, 2019

jordanlewis commented May 31, 2018 •

edited

Loading