-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: implement immutable tables that automatically use follower reads #26301
Comments
The immutable table may run the GC queue, which could make the historical timestamps unsafe. This is just a silly comment to make sure we're using a "relatively current" follower read timestamp, and not just some fixed past date. One remaining problem is that when an immutable table needs to be updated, performance will fall off a cliff. But at least the operators get to choose when they do this, so hopefully it's workable. It also seems really hard to avoid that problem. |
I don't think this is true. Inconsistent reads provide no staleness bounds whatsoever, so it doesn't follow that if a table is immutable then all of its replicas are up-to-date. Follower reads, however, provide a strict staleness bound and will perform the more expensive leaseholder read if the local follower is not sufficiently up-to-date.
Performance will certainly drop, but it will be no worse than the case today when partitioned tables have foreign keys pointing at non-partitioned tables. Also, only latencies should increase in this scenario It shouldn't create a throughput bottleneck that could suddenly topple a cluster. This is a much better situation to be in than the opposite, where changing an immutable table to mutable implies performing significantly more work and limiting available throughput. |
Updated the description to take this into account - thanks. |
Additionally, data from immutable tables can be cached aggressively, and the lookups themselves can be batched across transactions. I'm concerned about the sharp performance difference of making an immutable table mutable. How can an admin tell whether they have the headroom to make the change? It's worth considering other restrictions that could be imposed on a table descriptor that would allow foreign key-style reads to be sped up without needing to reverse those rules temporarily to make a change. For example, a "primary keys are append-only" rule would disallow deletions and changes to PK columns, and allow the existence of a PK to be cached and read non-transactionally for FK checks. If the local read says the FK doesn't exist, you could either fall back to a remote read or just fail the request (perhaps an option on the FK constraint. Failing the request would mean that it takes some time after the reference table is updated before it can be used, but that would generally be OK and wouldn't allow data that violates the FK to be inserted) The more precise the restrictions the trickier it gets to use them for non-FK purposes (so maybe an "insert-only" flag would be better than one limited to PKs). But in any case as an admin I'd be more comfortable using flags like "insert-only" instead of "usually immutable, but can be made mutable at high performance cost and risk". |
An alternative to the immutable-to-mutable transition:
Then we'd have a background job which creates an entire new table with the combination of all 3. Then when the background job completes we replace the "smart view" by the new mutable descriptor. |
I like this proposal, especially with Ben's twist about making these tables insert-only.
Falling back to the remote read sounds like the better option if we get to aggressively cache the result. @knz what you're proposing seems to add a lot of complexity in SQL-land. And don't you still have the problem that you can't guarantee that a deletion doesn't mess up the internal consistency? |
Couldn't we get around that, though, by simply using the intermediate |
Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of a **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](cockroachdb#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](cockroachdb#25229). Release note: None
Hmm, that's not a bad idea, though it requires guarantees that are basically true all the time but not necessarily always. If a replica set is fully replicated once, can there be some scenario in which it becomes non-fully replicated later? Yes, that could happen if the replica set is in the middle of a replica change (and a preemptive snapshot that wasn't the most recent data is sent). The newly added replica could need a second to catch up. And even if there isn't a replica change in progress, could there be an old preemptive snapshot somewhere else that gets reused in a later replica change? |
Yeah, I don't think this is insurmountable, but it's would require facilities that aren't currently provided by our storage layer. First, we'd need to define what all replicas being "up-to-date" means. How would we track this? We don't have good hook here. We could probably wait for all Ranges to quiesce as an indication of all replicas being up to date, but this relies on breaking the abstraction that only a single table will ever send load to a given range. We'd then also need the second facility that @tschottdorf alluded to - a guarantee that once all replicas in a range have reached a certain log entry, no replica in that range will ever regress. I wasn't aware of cases that break this guarantee, but it's not one we explicit make and it sounds like that's because it's not held in all cases. The nice thing about using follower reads is that it provides both of these facilities while maintaining the abstraction of only thinking about the time domain in SQL and ignoring all replication concerns. |
🔬🐶 Would it be complete outlandish to treat these tables as schema elements rather than as data? I know schema changes aren't transactional, but I think this could be made to work using the usual sort of schema change state machine (obviously with very poor mutation performance) and there's already schema change mechanisms for maintaining consistency between replicas that could possibly be used for this. |
Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of a **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](cockroachdb#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](cockroachdb#25229). Release note: None
26362: RFC: follower reads r=bdarnell,nvanbenschoten a=tschottdorf NB: this is extracted from #21056; please don't add new commentary on the tech note there. ---- Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](#25229). Release note: None 27699: storage: fix stopper race in compactor r=petermattis a=tschottdorf Starting workers without a surrounding task is unfortunately often not the right thing to do when the worker accesses other state that might become invalidated once the stopper begins to stop. In this particular case, the compactor might end up accessing the engine even though it had already been closed. I wasn't able to repro this failure in the first place, but pretty sure this: Fixes #27232. Release note: None 27704: issues: fix email fallback r=petermattis a=tschottdorf This was not my email address. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Coming back and looking at this again in anticipation of 2.2 work, I'm not sure how much extra performance this proposal would gain over naive follower reads. @tschottdorf care to comment? Would the follower reads implementation we're initially targetting already get us a good chunk of the possible gains here? |
They're fairly different. Follower reads (as currently planned) only take effect for clients that opt in to potentially-stale data. This makes them kind of useless for foreign key checks, which require up-to-date information. Immutable tables optimize for reads by placing the entire burden on the (infrequent) writers. Instead of follower reads, immutable tables compete with quorum leases. If we supported quorum leases with a quorum equal to the entire group size, then we'd have what we want here: each member would be able to serve consistent reads, but writes would have to talk to 100% of the replicas to commit (or wait for a timeout to expire the lease of any non-responsive node). |
I think we should support Follower reads for scenarios where clients opt-in using |
Closing this in favor of #31955 |
Background
It's very common in real-world database schemas to have reference tables - infrequently-updated tables, often relatively small, that are used as foreign key tables in applications. Some examples:
items
table fromtpcc
. items in tpcc are never updatedThese kinds of tables present some issues for CockroachDB. In CockroachDB, doing a read from a table requires going to the table's leaseholder, to make sure that the reads are consistent. In an application that plays to CockroachDB's strengths that has many, globally distributed datacenters, reference tables like this will cause poor performance in several situations:
This may be unacceptable for customers who desire low-latency reads against reference tables and are willing to sacrifice write performance against those tables.
We can support this in CockroachDB by providing users the ability to mark a table as immutable, and using that information to provide fast, transactional reads to any replica.
Proposed solution for CockroachDB: Immutable tables
Immutable tables are tables that cannot be modified. Tables can be set as immutable, and reset back to mutable, by performing a schema change with an ordinary
ALTER TABLE
command.When a table is marked as immutable, it's possible to drastically improve read performance from it by altering the read semantics. There are (at least) two ways of doing this:
That's safe because we know that the table is immutable, implying that all of its replicas are up-to-date.The first solution is harder to implement, because of the semantics of leaving the immutable state - we'd have to invent a way to wait until data has fully replicated to all replicas. It also might be unsafe altogether - see @nvanbenschoten's comment on this issue.
The second solution is easier, and should be possible as soon as follower reads are available, using the schema change infrastructure and an intermediate state similar to the intermediate WRITE_ONLY state used for index backfills. The rest of this issue proposes to use the second solution.
Details on immutable/mutable states and state change process
Table descriptors will be extended with a new field,
mutability
, that has three possible values:mutable
mutable
table behave as normal.mutable
table are permitted.immutable-must-read-transactional
immutable-must-read-transactional
table behave as normal.immutable-must-read-transactional
table are not permitted.immutable
immutable
table are performed at a historical timestamp,t
seconds in the past (wheret
is defined by the follower read safe past interval), and get served by a follower readimmutable
table are not permittedBy default,
mutability
is set tomutable
. To transition a table frommutable
toimmutable
, a user issues anALTER TABLE t SET IMMUTABLE
command (or similar - syntax to be decided later), which performs the following actions:immutable-must-read-transactional
and wait until the cluster converges on the new descriptor.t
seconds to ensure that, once transitioned toimmutable
, all reads fromt
seconds ago occur before the last possible write to the reference table.immutable
.To transition a table from
immutable
tomutable
, a user issues anALTER TABLE t SET MUTABLE
command, which performs the following actions:immutable-must-read-transactional
and wait until the cluster converges on the new descriptor.mutable
. (note we don't have to waitt
seconds here, since we're moving in a "safe" direction, causing all future reads to be ordinary transactional reads.)The text was updated successfully, but these errors were encountered: