opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware #31955

andy-kimball · 2018-10-28T05:56:54Z

This idea was suggested by @drewdeally, who is already using it to address real customer scenarios. However, it currently requires application changes and manual effort. A much better experience could be made possible by adding optimizer support.

App Scenario
Most applications have write-rarely, read-mostly tables that contain "reference data". Reference data is often needed globally, and yet must have low read latency, as it is frequently joined with other tables that are more dynamic. Examples of reference data are a zip code table, a product catalog table, or a holiday calendar table.

Problem
In CockroachDB, a table can be replicated any number of times. However, for any given range of data within the table, only one node (the leaseholder) can serve consistent data to readers. This presents a difficult problem in geo-distributed scenarios; while nodes located near the leaseholder have low read latency, nodes further away can have latencies >100ms (e.g. US to Asia latency). This means that if a query joins to a reference table with a remote leaseholder, the query's latency can never be less than the RTT to that leaseholder.

Solution
CockroachDB already allows table indexes to be pinned to different geo localities (i.e. Europe, US, etc). Therefore, multiple identical indexes can be defined on the same table, where each index is pinned to a different locality.

CREATE INDEX us_idx ON products (id, name);
CREATE INDEX eu_idx ON products (id, name);
CREATE INDEX cn_idx ON products (id, name);

Then, each index is pinned to nodes in the corresponding locality, using replication zone configuration (requires Enterprise license to create per-index zones).

Now queries that are run in the US can join to products@us_idx, while queries run in Europe can join to products@eu_idx. Both queries experience local latencies. Furthermore, the reads are consistent, meaning that they always see the latest updates to the products table.

The tradeoff for the low-latency reads are more secondary indexes to store and update when writes occur.

Optimizer Support
The manual version of this capability requires the application to explicitly specify the right index to join against. This requires the application to know what region it's running in, and then to generate a different version of the query based on that.

The optimizer could be upgraded to incorporate latencies into its cost model. When multiple identical indexes are available, the optimizer would select the index having the nearest leaseholder. So the same query could use a different index depending on which Gateway received it. No changes to the application would be required; it can remain locality-agnostic.

CC: @petermattis, @RaduBerinde

The text was updated successfully, but these errors were encountered:

andy-kimball · 2018-10-28T05:57:25Z

CC: @awoods187

tbg · 2018-10-28T06:14:38Z

Related: #26301. I also thought the one-index-per-zone idea had been discussed there, but now I don't see it.

drewdeally · 2018-10-28T13:20:02Z

Robert Prior Mention cockroachdb/docs#3575 (comment)

awoods187 · 2018-10-29T15:26:52Z

@jordanlewis can you weigh in here in the context of #26301 ? @andy-kimball we've put #26301 on the SQL execution roadmap for 2.2. Do you think we need both?

andreimatei · 2018-10-29T15:31:45Z

It seems to me that the "multiple indexes" idea is just papering over the lack of quorum reads (or some more flexible read-leases system where multiple nodes can serve reads on a single index. Doesn't it seem that we really ought to be solving this problem at a lower, infrastructe level rather than have SQL work around it?

andy-kimball · 2018-10-29T16:05:15Z

@andreimatei, there are a couple of difficult problems that come to mind:

Adding quorum reads is a major project that I assume would require big modifications to Raft (or discarding Raft completely if we went towards more of an AcidLib approach). I assume we'd need multiple releases to make this a reality.
Even if we did tidy up some correctness issues reported by go vet #1, we'd still be left with the difficult problem of trying to make it useable for our customers. We already have some very complex abstractions and configurations they need to understand (zones, partitions, constraints, etc.) in order to set up a geo-distributed config. Adding more config possibilities at the replication level would require us to either make this even more complex by adding new abstractions, or to somehow modify the existing abstractions to incorporate the new info.

Also, the question here is not whether to use multiple indexes in this way; they're already being used this way in the field. The question is whether we recognize what's already happening, and take steps to make it easier. Furthermore, I think we eventually want to make the optimizer latency-aware independent of this issue, so I don't believe that would be wasted work, even if we implemented quorum reads.

Having recently done some experimentation with the geo-distributed features in CockroachDB, I think we're in danger of reaching the "abstraction breaking point", where the abstractions have become complex enough that the average developer can't understand how to effectively use them. If we reach that breaking point, only big companies with specially trained DBAs or specialized consulting teams can put together solutions. This is another reason I don't like "immutable tables", because it requires adding yet another abstraction on top of an already complex system. Instead, we should more fully exploit the surface area of the abstractions we already have.

andreimatei · 2018-10-29T16:44:01Z

I hear some of the things you're saying, and it might come down to a judgement choice, but the more I think about it, the more I personally lean towards saying that we'd be well advised to address this at the KV level. I think that would result in a more usable, understandable and easy to describe/model product.

Some responses:

Even if we did #1, we'd still be left with the difficult problem of trying to make it useable for our customers. We already have some very complex abstractions and configurations they need to understand ...

I see this as an argument against starting to push people on a high scale towards multiple indexes like you're proposing. Yes, this proposal doesn't require technically introducing any new concepts, but it sure does introduce a bunch of complexity for users. Multiple indexes with the same data? That's a mind-blowing concept. So I need to create the indexes, then muck with some new zone configs, then somehow ensure that all the pieces to the puzzle (index, zone config, optimizer using the right index) fell into place. And then if one of my regions goes away, what then? I need to remember to delete an index, do something with the zone configs, etc. To make this kind of stuff semi-seamless, I think we'd need some more serious integration of the feature into SQL (e.g. find a way to tie all indexes and their zone config into a table zone config, or something).

Contrast with having a KV-level feature that, one way or another, allows multiple nodes to serve reads for a range. You might call this a "new abstraction", but I'd call it a natural extension of the existing model - where people need to understand that one node is special. Multiple nodes serving reads makes the model simpler, not more complicated. One already needs to understand what nodes are involved in a write; there's not much new stuff to understand.

Also, the question here is not whether to use multiple indexes in this way; they're already being used this way in the field. The question is whether we recognize what's already happening, and take steps to make it easier. Furthermore, I think we eventually want to make the optimizer latency-aware independent of this issue, so I don't believe that would be wasted work, even if we implemented quorum reads.

Agreed. If the only thing being discussed here is whether the optimizer should understand index data placement, than I definitely wouldn't argue against that. But if that's all there is to this issue, then we might want to rename it more narrowly :P

This is another reason I don't like "immutable tables", because it requires adding yet another abstraction on top of an already complex system

This I tend to agree with, at least w.r.t. the form of that proposal that I remember.

andy-kimball · 2018-10-30T14:35:14Z

As you know, I've always been an advocate of enabling multiple read replicas at the consensus level. I agree that we can overcome the challenges I list (significant changes to core and reconciliation with existing abstractions). However, I also believe this will be a major effort that requires multiple releases. Do you agree or disagree with that?

I don't believe there needs to be an "either-or" choice here, as you seem to be arguing. Longer-term, we should support multiple read replicas at the consensus level. But I think it's a mistake to not simultaneously look at the shorter-term, specifically at how the product is already being used, and how we might better support that usage at a much lower development cost than would be required with the longer-term approach. Furthermore, if this alternate approach allowed us to avoid a problematic stop-gap feature (immutable tables), that makes an even stronger argument we should consider it.

I also modified the title of the issue to make it more clear that this issue is about adding latency awareness in the optimizer.

andreimatei · 2018-10-30T17:40:03Z

However, I also believe this will be a major effort that requires multiple releases. Do you agree or disagree with that?

I don't know, to be honest; haven't thought about it enough to have a good opinion. I'm thinking about it now for 3 minutes and I don't know exactly how I would do it, so it certainly doesn't appear to be the easiest thing. "Multiple releases" still sounds high to me.

I don't believe there needs to be an "either-or" choice here, as you seem to be arguing.

I think we're not exactly in disagreement. If the only thing being discussed is whether the optimizer should be aware of index locality, then that's unequivocally a good thing. If, on the other hand, we were to do more for making this multiple-indexes trick a first class citizen (e.g. adding dedicated support in index creation syntax or in table zone config language or even perhaps certain types of documentation), there'd probably be a line after which I, for one, would probably get off the bus.

Furthermore, if this alternate approach allowed us to avoid a problematic stop-gap feature (immutable tables), that makes an even stronger argument we should consider it.

Well, here I start feeling uneasy; yours is a political, not a technical statement, and it suggests moving these indexes into "first class citizen" territory. The particular problem of small, read-mostly, non-partitionable tables used as FKs by partitioned tables I think is fairly wide-spread and acute, and it deserves a documented solution in a relatively short term. I'm not convinced the solution we advertise for distributing these tables is the definition of multiple indexes on them. I also don't particularly like the proposal in #26301. I have my own controversial idea on the topic, but I'll discuss it separately.

andreimatei · 2018-11-02T02:06:45Z

TLDR: I've done a 180 and now love this idea. Except that I think that the indexes in question should be seen as an implementation detail and hidden from the user. Read requirements should be expressed in a table's zone config and indexes for all the columns created automatically (and hidden).

However, I also believe this will be a major effort that requires multiple releases. Do you agree or disagree with that?

After talking about it a little over dinner, I now think that this would indeed be a bigger deal than I was inclined to believe before. I now think that Raft, with its majority vote focus, is fundamentally unsuitable for what we'd need.

So then I did some soul searching to figure out what exactly about this present proposal rubs me the wrong way. I believe the root of the problem, as I see it, is that multiple indexes containing the same data is not something one should have to do. An index, as a first class SQL concept, is defined by its data. How can one have multiple indexes with the same data? That can't be what one wants. Plus a million other problems that fall from this awkward construct: what happens when the cluster's localities change? What happens when one wants to add a column to the table whose indexes are supposed to contain all the columns?

But then I learned to stop worrying and love this idea, but with a twist. Hear me out.
The focus on indexes, from a user perspective, is wrong, I believe. When talking about reference tables, what does the user want? She wants a non-partitioned table on which reads can be served by every locality in the cluster. Nothing comes for free, so something needs to be traded in: slower/less available writes. Generally the user doesn't want any indexes on this table (other than the PK). This table is referenced by FKs, not queried directly
Unfortunately our KV module doesn't provide such a thing, and therein lies the tragedy. It however provides, by God, distributed transactions. And, of course, this is what this proposal tries to exploit: it wants to use indexes as an awkward (in my opinion) way of distributing the writes to the table and replicating all the table's data.
As I now see it, the idea is awkward from the user's perspective, but not from the system's perspective. From a software engineering standpoint, the KV layer provides the abstraction that it does with the limitations that we know. It thus forces the upper layers to work around them, by explicitly writing the data in multiple places. And that's fine. But just because the upper layers of crdb needs to work around the KV limitations doesn't mean that the user needs to work around them. So let's take just one more step; a natural one. Instead of asking the user to define 10 identical indexes containing all the table's columns, how about we let the user express the distribution requirements in a natural way and then we do all the work for them and keep the dirty laundry inside. So what if one could express in a table's zone config the requirements for when reads should be served from, and then crdb automagically takes advantage of our distributed transactions to write the data in multiple places. And an easy way to do that without changing too much stuff is by creating a bunch of indexes. These indexes would be generally hidden from the user (they wouldn't show up in show create table, for example). Their purpose would be properly understood by crdb - and so they'd be automatically updated when the table's schema changes (e.g. a new column is added), or when the table's zone config changes, or if we're smart enough and we make the zone configs expressive enough, even when nodes with new localities are added to the cluster.

So basically before I was expressing concerns that this proposal is a slippery slope and we might go to far. But now I think I see the light - it wasn't going far enough! Asking the user to create a bunch of identical indexes is the uncanny valley. If these indexes are seen just as an implementation detail of the goal of distributing writes, and so hidden from the user instead of in the user's face, then conceptually this makes perfect sense.

andreimatei · 2018-11-02T15:25:00Z

One thing that occured to me is that, besides teaching the optimizer what index to use for select queries, we also need to teach the execution about what index to use for FK checks on insert queries.

petermattis · 2018-11-02T15:34:09Z

One thing that occured to me is that, besides teaching the optimizer what index to use for select queries, we also need to teach the execution about what index to use for FK checks on insert queries.

In 2.2, the optimizer will be planning DML statements so this should fall at naturally (🤞 ).

andreimatei · 2018-11-07T22:25:52Z

I was discussing these issues with @knz and he brought up a thought that I've been having as well: we'd probably be well-advised to move the SQL module away from considering a PK special in any way. Currently, the distinction between a table's PK and its indexes is pervasive in SQL code. The PK has a different data layout than the indexes (e.g. it alone supports column families), and it alone is considered for FK checks and for things like index joins (unless this changed recently) but there's no particular reason for any of that other than path-dependent evolution.
We already have major problems because of this narrow-mindedness - we can't change a table's PK. The code just can't do it; it's too special.
I think we'd be well advised to move towards removing everything that makes a PK special; SQL should think of a table as a set of indexes, each with its own columns, data layout, replication and leasing constraints.

andy-kimball · 2019-05-09T23:04:58Z

This is part of the 19.1 release for both read-only and mutation statements. We still require the user to create manual identical indexes, so in a future release we should consider how to make this more intuitive/automatic. I'm closing this issue, since the original goal has been reached. See our docs for more information.

andreimatei · 2019-05-10T01:27:16Z

I think @RaduBerinde was just telling me that FK checks do not yet choose what index to use. No?

RaduBerinde · 2019-05-10T01:57:06Z

This was mostly about reads which don't involve any FK work.

andy-kimball added the A-sql-optimizer SQL logical planning and optimizations. label Oct 28, 2018

andy-kimball self-assigned this Oct 28, 2018

petermattis added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 28, 2018

andy-kimball mentioned this issue Oct 29, 2018

sql: implement immutable tables that automatically use follower reads #26301

Closed

andy-kimball changed the title ~~opt: Support consistent, low-latency reads anywhere in the world~~ opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware Oct 30, 2018

andy-kimball mentioned this issue Nov 29, 2018

sql: when RETURNING expression is present, only fetch the specific columns used #30618

Closed

nstewart mentioned this issue Jan 19, 2019

Create example that requires location aware indexing cockroachdb/movr#55

Closed

andy-kimball closed this as completed May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware #31955

opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware #31955

andy-kimball commented Oct 28, 2018

andy-kimball commented Oct 28, 2018

tbg commented Oct 28, 2018

drewdeally commented Oct 28, 2018 via email •

edited

Loading

awoods187 commented Oct 29, 2018

andreimatei commented Oct 29, 2018

andy-kimball commented Oct 29, 2018

andreimatei commented Oct 29, 2018

andy-kimball commented Oct 30, 2018

andreimatei commented Oct 30, 2018

andreimatei commented Nov 2, 2018 •

edited

Loading

andreimatei commented Nov 2, 2018

petermattis commented Nov 2, 2018

andreimatei commented Nov 7, 2018

andy-kimball commented May 9, 2019

andreimatei commented May 10, 2019

RaduBerinde commented May 10, 2019

opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware #31955

opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware #31955

Comments

andy-kimball commented Oct 28, 2018

andy-kimball commented Oct 28, 2018

tbg commented Oct 28, 2018

drewdeally commented Oct 28, 2018 via email • edited Loading

awoods187 commented Oct 29, 2018

andreimatei commented Oct 29, 2018

andy-kimball commented Oct 29, 2018

andreimatei commented Oct 29, 2018

andy-kimball commented Oct 30, 2018

andreimatei commented Oct 30, 2018

andreimatei commented Nov 2, 2018 • edited Loading

andreimatei commented Nov 2, 2018

petermattis commented Nov 2, 2018

andreimatei commented Nov 7, 2018

andy-kimball commented May 9, 2019

andreimatei commented May 10, 2019

RaduBerinde commented May 10, 2019

drewdeally commented Oct 28, 2018 via email •

edited

Loading

andreimatei commented Nov 2, 2018 •

edited

Loading