sql,storage: splits may split SQL rows #43094

tbg · 2019-12-10T19:56:36Z

SQL currently does not tolerate it when the column families of a row span more than one range. This restriction seems superficially reasonable: splitting a column family over a range is not generally desired, and so it makes sense to avoid it; that the SQL layer crashes if it happens seems less due to some deep restriction rather than implicit assumptions. Besides, we do rely on this invariant for some less interesting items such as counting the number of rows in a backup (or something like that, @dt knows the details).

Unfortunately, it is very hard to uphold this invariant since splitting is inherently a KV operation: it operates on an arbitrary roachpb.RKey which has no SQL schema information attached to it. SQL keys are not self-describing: one can not look at a key and decide whether it could possibly split a SQL row between two column families. Essentially this means that the caller to AdminSplit must promise that the split key is "safe", i.e. that in no possible world is or could there ever be a row whose column families fall on opposite sides of the split. It's hopefully clear that from an architectural point of view, this is already not a good idea, and predictably, it has brought us into a number of pickles, most recently in #43078.

It also becomes a real problem since the KV layer issues splits autonomously based on size and load.

For size-based splits, we do an awkward dance which mostly works out because we can orient ourselves around the keys that actually exist in the range, and (basically) make the assumption that they're all SQL row keys. Under this assumption, we can read the column family suffix and determine whether the split is safe.

For load-based splits, it's more complicated. The split point for load-based splits is determined by sampling from the spans of incoming requests. These keys are more often than not real SQL row keys because when SQL scans a full row, it does not construct the full key - it omits the column prefix. Concretely, in kv, rows are encoded on disk as /Table/53/1/<id>/0 but the SQL layer will read them via a scan starting at /Table/53/1/<id>.
The latest iteration on this layer cake of hacks calls EnsureSafeSplitKey and on errors assumes that the key is not between rows. This should mostly be true, but it's a very subtle invariant that, even if not violated in practice, emits a pungent smell that is hard to ignore. As far as KV is concerned, a client may as well issue a number of scans starting at /Table/53/1/<id>/0/somebogussuffix from which you can't decode the column family ID but which squarely splits the row for id in half (if it has more than one column family, something not true in KV).

See:

#42056 - general trove of information
#16344 - old attempt at cleaning this up
#42833 - first recent really broken attempt
#43078 - second less, but still broken, attempt

Since this area of code is active now, we should figure out how to fix the mess we got ourselves into.

A few options come to mind:

drop any guarantees that prevent splitting between column families. Rationale: as long as KV is in charge of splitting, it can't guarantee it without subtle and historically repeatedly violated invariants. Better rip off the band aid now.
put a higher-level system in charge of splitting. That system needs to know about SQL schemas etc and will only ever split at "safe" keys. Rationale: it's the sane way to get the invariant that is currently being assumed.
keep splitting in KV, but only ever split at existing keys. Rationale: it's a more reasonable assumption that if the SQL layer is using a range for table data, there can't be any "non-table data" interspersed with rows. Con: it's an awkward restriction that is sure to be hard to work with in practice, for example we can't split off empty tables any more.

Curious about other options. Of the ones listed here, in the short term, the first one seems like the only reasonable one.

Someone should also think through interleaved tables and whether they can add a new toxin to the mix. I hope not because (or so I assume) interleaved tables can never "partition" the columns of a parent row (?).

@nvanbenschoten could you bring this issue to the next KV weekly/generally drive it forward?

Jira issue: CRDB-5298

Jan 2024, combing related issue: #109857 into this one. It's text was:

EnsureSafeSplitKey, and the GetRowPrefixLength function that implements it, today typically insists that a key have what appears to be a valid column family ID as its suffix, so that it can decode its length then use that to truncate the family ID from the row key and return it. It has special cases for keys that consist of only a table ID, or only a table and index ID, which it also considers to be have a valid "RowPrefixLength" of their entire length.

However consider a table with a primary key like (region string, id int). A key like /Table/101/1/"us-east" is very much a valid split key for this table -- probably a very good one at that -- but it does not end in a valid column family ID+length, but also does not consist of only a table ID or table and index ID, so it causes EnsureSafeSplitKey to return an error.

Should EnsureSafeSplitKey instead assume that any key which does not appear to end in a column family ID is a safe split key?

Doing so would more correctly handle prefixes like this, particularly when handling compound primary keys (though like the current implementation, it would still suffer from cases of mistaken identity where an indexed integer value is mistaken for a column family ID). However, if someone were to append bytes to a column family key - by calling Next or PrefixEnd -- the resulting key, which is not a safe split keys it still falls between two column family keys -- would no longer appear to be a column family key, and thus would be erroneously considered to be a safe split key. Thus switching the default assumption of EnsureSafeSplitKey, to be that a key is a safe split key unless it appears to end with a column family ID, would require being sure that they keys passed to it are not derived by calling Next or PrefixEnd on real keys.

A more ambitious option might be to fix the underlying problem of being unable to tell a when a key contains a column family or not, by switching the column family ID to be encoded with a unique wire tag instead of Int. However unless all KVs in all clusters were actively rewritten in a migration to use the new tag, we would still need to fall back to the current guessing approach, so this may not be worth the churn.

The text was updated successfully, but these errors were encountered:

jordanlewis · 2019-12-10T20:13:10Z

What are the concrete consequences of accidentally splitting rows across ranges? Can we work to prevent those consequences instead?

Given that by definition it'll be rare to have a row split across two ranges, is it really that big of a deal to lift this restriction? From SQL execution's perspective, Ranges are completely abstracted away - we do all of our operations via interfaces that concatenate data from multiple ranges as you'd expect.

I think given how painful this is to fix, we ought to consider going in the other direction and loosening our requirements to not have split rows.

tbg · 2019-12-11T12:26:11Z

@jordanlewis what you're suggesting is option 1 and it's my favorite as well. All of the consequences live above KV.

See @dt's comment for bulk i/o and scroll down from there to see the one place we found where SQL makes assumptions (~ kvfetcher considers row to be done when hitting a range boundary)

jordanlewis · 2019-12-16T15:28:03Z

Ah I see. I think we could probably fix this then...

lunevalex · 2020-08-10T17:15:57Z

@tbg @jordanlewis now that we increased the range size to 512MB, do we still want to consider doing this at some point in the future?

ajwerner · 2020-08-10T20:29:17Z

Another proposal that isn't in the list is to propagate information to KV about valid split points within key ranges. As we look towards 21.1 we are looking towards revamping the ways in which KV is informed about zone configs and split points. Today we propagate the entire system config which is not scalable and poorly engineered. Perhaps it wouldn't be so onerous for the system config (or something around it) to indicate a boolean function of a key to determine whether it forms a valid split point.

tbg · 2020-08-18T10:02:03Z

Yes, there's still a problem here. It's just that the duct tape is working at the moment.

tbg · 2023-05-17T16:48:20Z

We now hit this in production again: https://github.com/cockroachlabs/support/issues/2302

Previously, the load based range splitter could suggest split keys which were in-between SQL rows. This violated assumptions made in SQL code, which require that rows are never split across ranges. This patch updates the unweighted split finder to only retain safe sample keys. The weighted split finder already does this (e4f003b). Note that the weighted split finder is used by default (>=23.1), whilst the unweighted split finder is used when `kv.allocator.load_based_rebalancing.objective` is set to `qps` (default `cpu`). Informs: cockroachdb#43094 Fixes: cockroachdb#103483 Release note: None

michae2 · 2023-12-21T17:27:57Z

Can we close this, given that #103483 was fixed?

dt · 2024-01-02T22:48:46Z

I think this is still relevant / open.

We've seen a new manifestation of this bug in almost every release I can remember, either in restore, load-based-splits, schema changes, etc, since the underlying issue -- that a kv split can be created between KVs of a row -- still exists. I'd argue that #103483 was closed since we fixed an instance of the bug, but the underlying bug in the design persists and I think this issue is the closest we have to tracking that?

#109857 we could probably fold into this.

dt · 2024-01-02T22:51:23Z

I wonder if we should fold #112702 into this too.

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 10, 2019

tbg assigned nvanbenschoten Dec 10, 2019

solongordon mentioned this issue Jan 23, 2020

sql: fakeSpanResolver can "split" SQL rows across nodes #44323

Closed

lunevalex added the A-kv-distribution Relating to rebalancing and leasing. label Aug 10, 2020

jlinder added T-sql-queries SQL Queries Team T-kv KV Team labels Jun 16, 2021

nvanbenschoten removed their assignment May 17, 2023

irfansharif added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 17, 2023

kvoli mentioned this issue May 18, 2023

split: [DNM] only retain safe split keys #103565

Closed

michae2 mentioned this issue Dec 4, 2023

keys: EnsureSafeSplitKey is too strict #109857

Closed

michae2 added this to SQL Queries Dec 4, 2023

github-project-automation bot moved this to Triage in SQL Queries Dec 4, 2023

michae2 moved this from Triage to New Backlog in SQL Queries Dec 4, 2023

michae2 added the P-3 Issues/test failures with no fix SLA label Dec 21, 2023

exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Mar 15, 2024

yuzefovich added the T-sql-queries SQL Queries Team label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql,storage: splits may split SQL rows #43094

sql,storage: splits may split SQL rows #43094

tbg commented Dec 10, 2019 •

edited by dt

Loading

jordanlewis commented Dec 10, 2019

tbg commented Dec 11, 2019

jordanlewis commented Dec 16, 2019

lunevalex commented Aug 10, 2020

ajwerner commented Aug 10, 2020

tbg commented Aug 18, 2020

tbg commented May 17, 2023 •

edited by dt

Loading

michae2 commented Dec 21, 2023

dt commented Jan 2, 2024

dt commented Jan 2, 2024

sql,storage: splits may split SQL rows #43094

sql,storage: splits may split SQL rows #43094

Comments

tbg commented Dec 10, 2019 • edited by dt Loading

jordanlewis commented Dec 10, 2019

tbg commented Dec 11, 2019

jordanlewis commented Dec 16, 2019

lunevalex commented Aug 10, 2020

ajwerner commented Aug 10, 2020

tbg commented Aug 18, 2020

tbg commented May 17, 2023 • edited by dt Loading

michae2 commented Dec 21, 2023

dt commented Jan 2, 2024

dt commented Jan 2, 2024

tbg commented Dec 10, 2019 •

edited by dt

Loading

tbg commented May 17, 2023 •

edited by dt

Loading