Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: implement immutable tables that automatically use follower reads #26301

Closed
jordanlewis opened this issue May 31, 2018 · 14 comments
Closed
Labels
A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@jordanlewis
Copy link
Member

jordanlewis commented May 31, 2018

Background

It's very common in real-world database schemas to have reference tables - infrequently-updated tables, often relatively small, that are used as foreign key tables in applications. Some examples:

  • a table of zipcodes. zipcodes are very infrequently deleted or added
  • the items table from tpcc. items in tpcc are never updated
  • a table of historical closing prices in a financial database. this table is updated once a day, but not during the day

These kinds of tables present some issues for CockroachDB. In CockroachDB, doing a read from a table requires going to the table's leaseholder, to make sure that the reads are consistent. In an application that plays to CockroachDB's strengths that has many, globally distributed datacenters, reference tables like this will cause poor performance in several situations:

  • inserts into tables that have foreign keys to a reference table will be at least as slow as a read to the leaseholder of the reference table, which might live across the globe
  • joins from tables to a reference table will require going to the leaseholder, which might live across the globe

This may be unacceptable for customers who desire low-latency reads against reference tables and are willing to sacrifice write performance against those tables.

We can support this in CockroachDB by providing users the ability to mark a table as immutable, and using that information to provide fast, transactional reads to any replica.

Proposed solution for CockroachDB: Immutable tables

Immutable tables are tables that cannot be modified. Tables can be set as immutable, and reset back to mutable, by performing a schema change with an ordinary ALTER TABLE command.

When a table is marked as immutable, it's possible to drastically improve read performance from it by altering the read semantics. There are (at least) two ways of doing this:

  1. Using inconsistent reads against the table. That's safe because we know that the table is immutable, implying that all of its replicas are up-to-date.
  2. Using follower reads (read from a follower with timestamp bound #16593) against the table, with a read timestamp somewhere in the past, divorced from the active transaction timestamp. That's safe because follower reads guarantee that follower replicas are up-to-date as of some interval of time in the past.

The first solution is harder to implement, because of the semantics of leaving the immutable state - we'd have to invent a way to wait until data has fully replicated to all replicas. It also might be unsafe altogether - see @nvanbenschoten's comment on this issue.

The second solution is easier, and should be possible as soon as follower reads are available, using the schema change infrastructure and an intermediate state similar to the intermediate WRITE_ONLY state used for index backfills. The rest of this issue proposes to use the second solution.

Details on immutable/mutable states and state change process

Table descriptors will be extended with a new field, mutability, that has three possible values:

  • mutable
    • Reads from a mutable table behave as normal.
    • Writes to a mutable table are permitted.
  • immutable-must-read-transactional
    • Reads from an immutable-must-read-transactional table behave as normal.
    • Writes to an immutable-must-read-transactional table are not permitted.
  • immutable
    • Reads from an immutable table are performed at a historical timestamp, t seconds in the past (where t is defined by the follower read safe past interval), and get served by a follower read
    • Writes to an immutable table are not permitted

By default, mutability is set to mutable. To transition a table from mutable to immutable, a user issues an ALTER TABLE t SET IMMUTABLE command (or similar - syntax to be decided later), which performs the following actions:

  1. Set table descriptor to immutable-must-read-transactional and wait until the cluster converges on the new descriptor.
  2. Wait t seconds to ensure that, once transitioned to immutable, all reads from t seconds ago occur before the last possible write to the reference table.
  3. Set table descriptor to immutable.

To transition a table from immutable to mutable, a user issues an ALTER TABLE t SET MUTABLE command, which performs the following actions:

  1. Set table descriptor to immutable-must-read-transactional and wait until the cluster converges on the new descriptor.
  2. Set table descriptor to mutable. (note we don't have to wait t seconds here, since we're moving in a "safe" direction, causing all future reads to be ordinary transactional reads.)
@jordanlewis jordanlewis added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. labels May 31, 2018
@tbg
Copy link
Member

tbg commented May 31, 2018

The immutable table may run the GC queue, which could make the historical timestamps unsafe. This is just a silly comment to make sure we're using a "relatively current" follower read timestamp, and not just some fixed past date.

One remaining problem is that when an immutable table needs to be updated, performance will fall off a cliff. But at least the operators get to choose when they do this, so hopefully it's workable. It also seems really hard to avoid that problem.

@nvanbenschoten
Copy link
Member

Using inconsistent reads against the table. That's safe because we know that the table is immutable, implying that all of its replicas are up-to-date.

I don't think this is true. Inconsistent reads provide no staleness bounds whatsoever, so it doesn't follow that if a table is immutable then all of its replicas are up-to-date. Follower reads, however, provide a strict staleness bound and will perform the more expensive leaseholder read if the local follower is not sufficiently up-to-date.

One remaining problem is that when an immutable table needs to be updated, performance will fall off a cliff.

Performance will certainly drop, but it will be no worse than the case today when partitioned tables have foreign keys pointing at non-partitioned tables. Also, only latencies should increase in this scenario It shouldn't create a throughput bottleneck that could suddenly topple a cluster. This is a much better situation to be in than the opposite, where changing an immutable table to mutable implies performing significantly more work and limiting available throughput.

@jordanlewis
Copy link
Member Author

Inconsistent reads provide no staleness bounds whatsoever, so it doesn't follow that if a table is immutable then all of its replicas are up-to-date.

Updated the description to take this into account - thanks.

@bdarnell
Copy link
Contributor

bdarnell commented Jun 1, 2018

Additionally, data from immutable tables can be cached aggressively, and the lookups themselves can be batched across transactions.

I'm concerned about the sharp performance difference of making an immutable table mutable. How can an admin tell whether they have the headroom to make the change?

It's worth considering other restrictions that could be imposed on a table descriptor that would allow foreign key-style reads to be sped up without needing to reverse those rules temporarily to make a change. For example, a "primary keys are append-only" rule would disallow deletions and changes to PK columns, and allow the existence of a PK to be cached and read non-transactionally for FK checks. If the local read says the FK doesn't exist, you could either fall back to a remote read or just fail the request (perhaps an option on the FK constraint. Failing the request would mean that it takes some time after the reference table is updated before it can be used, but that would generally be OK and wouldn't allow data that violates the FK to be inserted)

The more precise the restrictions the trickier it gets to use them for non-FK purposes (so maybe an "insert-only" flag would be better than one limited to PKs). But in any case as an admin I'd be more comfortable using flags like "insert-only" instead of "usually immutable, but can be made mutable at high performance cost and risk".

@knz
Copy link
Contributor

knz commented Jun 4, 2018

An alternative to the immutable-to-mutable transition:

  • keep the immutable data as-is

  • add two new (mutable) tables deleted_rows and new_or_changed_rows

  • implement a "smart view" which:

    • for read queries runs the query from new_or_changed_rows UNION immutable_rows EXCEPT deleted_rows
    • for delete queries adds the row to deleted_rows
    • for inserts/updates adds the row to new_or_changed_rows

Then we'd have a background job which creates an entire new table with the combination of all 3.

Then when the background job completes we replace the "smart view" by the new mutable descriptor.

@tbg
Copy link
Member

tbg commented Jun 4, 2018

I like this proposal, especially with Ben's twist about making these tables insert-only.

If the local read says the FK doesn't exist, you could either fall back to a remote read or just fail the request (perhaps an option on the FK constraint. Failing the request would mean that it takes some time after the reference table is updated before it can be used, but that would generally be OK and wouldn't allow data that violates the FK to be inserted)

Falling back to the remote read sounds like the better option if we get to aggressively cache the result.

@knz what you're proposing seems to add a lot of complexity in SQL-land. And don't you still have the problem that you can't guarantee that a deletion doesn't mess up the internal consistency?

@a-robinson
Copy link
Contributor

I don't think this is true. Inconsistent reads provide no staleness bounds whatsoever, so it doesn't follow that if a table is immutable then all of its replicas are up-to-date. Follower reads, however, provide a strict staleness bound and will perform the more expensive leaseholder read if the local follower is not sufficiently up-to-date.

Couldn't we get around that, though, by simply using the intermediate immutable-must-read-transactional state to wait until all replicas are up-to-date before changing to the mutable state and allowing inconsistent reads to start?

tbg added a commit to tbg/cockroach that referenced this issue Jun 4, 2018
Follower reads are consistent reads at historical timestamps from follower
replicas. They make the non-leader replicas in a range suitable sources for
historical reads.

The key enabling technology is the propagation of a **closed timestamp
heartbeats** from the range leaseholder to followers. The closed timestamp
heartbeat (CT heartbeat) is more than just a timestamp. It is a set of
conditions, that if met, guarantee that a follower replica has all state
necessary to satisfy reads at or before the CT.

Consistent historical reads are useful for analytics queries and in particular
allow such queries to be carried out more efficiently and, with appropriate
configuration, away from foreground traffic. But historical reads are also key
to a proposal for [reference-like
tables](cockroachdb#26301) aimed at cutting
down on foreign key check latencies particularly in geo-distributed clusters;
they help recover a reasonably recent consistent snapshot of a cluster after a
loss of quorum; and they are one of the ingredients for [Change Data
Capture](cockroachdb#25229).

Release note: None
@tbg
Copy link
Member

tbg commented Jun 5, 2018

Hmm, that's not a bad idea, though it requires guarantees that are basically true all the time but not necessarily always. If a replica set is fully replicated once, can there be some scenario in which it becomes non-fully replicated later? Yes, that could happen if the replica set is in the middle of a replica change (and a preemptive snapshot that wasn't the most recent data is sent). The newly added replica could need a second to catch up. And even if there isn't a replica change in progress, could there be an old preemptive snapshot somewhere else that gets reused in a later replica change?

@nvanbenschoten
Copy link
Member

Couldn't we get around that, though, by simply using the intermediate immutable-must-read-transactional state to wait until all replicas are up-to-date before changing to the mutable state and allowing inconsistent reads to start?

Yeah, I don't think this is insurmountable, but it's would require facilities that aren't currently provided by our storage layer. First, we'd need to define what all replicas being "up-to-date" means. How would we track this? We don't have good hook here. We could probably wait for all Ranges to quiesce as an indication of all replicas being up to date, but this relies on breaking the abstraction that only a single table will ever send load to a given range. We'd then also need the second facility that @tschottdorf alluded to - a guarantee that once all replicas in a range have reached a certain log entry, no replica in that range will ever regress. I wasn't aware of cases that break this guarantee, but it's not one we explicit make and it sounds like that's because it's not held in all cases.

The nice thing about using follower reads is that it provides both of these facilities while maintaining the abstraction of only thinking about the time domain in SQL and ignoring all replication concerns.

@justinj
Copy link
Contributor

justinj commented Jun 5, 2018

🔬🐶 Would it be complete outlandish to treat these tables as schema elements rather than as data? I know schema changes aren't transactional, but I think this could be made to work using the usual sort of schema change state machine (obviously with very poor mutation performance) and there's already schema change mechanisms for maintaining consistency between replicas that could possibly be used for this.

tbg added a commit to tbg/cockroach that referenced this issue Jun 28, 2018
Follower reads are consistent reads at historical timestamps from follower
replicas. They make the non-leader replicas in a range suitable sources for
historical reads.

The key enabling technology is the propagation of a **closed timestamp
heartbeats** from the range leaseholder to followers. The closed timestamp
heartbeat (CT heartbeat) is more than just a timestamp. It is a set of
conditions, that if met, guarantee that a follower replica has all state
necessary to satisfy reads at or before the CT.

Consistent historical reads are useful for analytics queries and in particular
allow such queries to be carried out more efficiently and, with appropriate
configuration, away from foreground traffic. But historical reads are also key
to a proposal for [reference-like
tables](cockroachdb#26301) aimed at cutting
down on foreign key check latencies particularly in geo-distributed clusters;
they help recover a reasonably recent consistent snapshot of a cluster after a
loss of quorum; and they are one of the ingredients for [Change Data
Capture](cockroachdb#25229).

Release note: None
craig bot pushed a commit that referenced this issue Jul 18, 2018
26362: RFC: follower reads r=bdarnell,nvanbenschoten a=tschottdorf

NB: this is extracted from #21056; please don't add new commentary on the
tech note there.

----

Follower reads are consistent reads at historical timestamps from follower
replicas. They make the non-leader replicas in a range suitable sources for
historical reads.

The key enabling technology is the propagation of **closed timestamp
heartbeats** from the range leaseholder to followers. The closed timestamp
heartbeat (CT heartbeat) is more than just a timestamp. It is a set of
conditions, that if met, guarantee that a follower replica has all state
necessary to satisfy reads at or before the CT.

Consistent historical reads are useful for analytics queries and in particular
allow such queries to be carried out more efficiently and, with appropriate
configuration, away from foreground traffic. But historical reads are also key
to a proposal for [reference-like tables](#26301) aimed at cutting
down on foreign key check latencies particularly in geo-distributed clusters;
they help recover a reasonably recent consistent snapshot of a cluster after a
loss of quorum; and they are one of the ingredients for [Change Data
Capture](#25229).

Release note: None

27699: storage: fix stopper race in compactor r=petermattis a=tschottdorf

Starting workers without a surrounding task is unfortunately often not
the right thing to do when the worker accesses other state that might
become invalidated once the stopper begins to stop. In this particular
case, the compactor might end up accessing the engine even though it
had already been closed.

I wasn't able to repro this failure in the first place, but pretty sure
this:
Fixes #27232.

Release note: None

27704: issues: fix email fallback r=petermattis a=tschottdorf

This was not my email address.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg tbg added this to the 2.2 milestone Jul 22, 2018
@tbg tbg added A-coreperf and removed A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. A-kv-distribution Relating to rebalancing and leasing. A-kv-client Relating to the KV client and the KV interface. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@jordanlewis
Copy link
Member Author

Coming back and looking at this again in anticipation of 2.2 work, I'm not sure how much extra performance this proposal would gain over naive follower reads. @tschottdorf care to comment? Would the follower reads implementation we're initially targetting already get us a good chunk of the possible gains here?

@bdarnell
Copy link
Contributor

bdarnell commented Oct 8, 2018

They're fairly different. Follower reads (as currently planned) only take effect for clients that opt in to potentially-stale data. This makes them kind of useless for foreign key checks, which require up-to-date information. Immutable tables optimize for reads by placing the entire burden on the (infrequent) writers.

Instead of follower reads, immutable tables compete with quorum leases. If we supported quorum leases with a quorum equal to the entire group size, then we'd have what we want here: each member would be able to serve consistent reads, but writes would have to talk to 100% of the replicas to commit (or wait for a timeout to expire the lease of any non-responsive node).

@andy-kimball
Copy link
Contributor

I think we should support Follower reads for scenarios where clients opt-in using AS OF SYSTEM TIME, but then implement something like #31955 for cases where consistency is required (or do quorum leases). The trouble with immutability is that it works well until the table needs to be updated. Even if that only happened infrequently, it could be quite disruptive to any running workloads, since the DBA would need to make the table mutable, make the update/insert/delete, then make the table immutable again. During that period of time, I'd expect to see latency spikes, since other transactions would be much more likely to block or retry.

@awoods187
Copy link
Contributor

Closing this in favor of #31955

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

10 participants