Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,storage: splits may split SQL rows #43094

Open
tbg opened this issue Dec 10, 2019 · 10 comments
Open

sql,storage: splits may split SQL rows #43094

tbg opened this issue Dec 10, 2019 · 10 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-kv KV Team T-sql-queries SQL Queries Team

Comments

@tbg
Copy link
Member

tbg commented Dec 10, 2019

SQL currently does not tolerate it when the column families of a row span more than one range. This restriction seems superficially reasonable: splitting a column family over a range is not generally desired, and so it makes sense to avoid it; that the SQL layer crashes if it happens seems less due to some deep restriction rather than implicit assumptions. Besides, we do rely on this invariant for some less interesting items such as counting the number of rows in a backup (or something like that, @dt knows the details).

Unfortunately, it is very hard to uphold this invariant since splitting is inherently a KV operation: it operates on an arbitrary roachpb.RKey which has no SQL schema information attached to it. SQL keys are not self-describing: one can not look at a key and decide whether it could possibly split a SQL row between two column families. Essentially this means that the caller to AdminSplit must promise that the split key is "safe", i.e. that in no possible world is or could there ever be a row whose column families fall on opposite sides of the split. It's hopefully clear that from an architectural point of view, this is already not a good idea, and predictably, it has brought us into a number of pickles, most recently in #43078.

It also becomes a real problem since the KV layer issues splits autonomously based on size and load.

For size-based splits, we do an awkward dance which mostly works out because we can orient ourselves around the keys that actually exist in the range, and (basically) make the assumption that they're all SQL row keys. Under this assumption, we can read the column family suffix and determine whether the split is safe.

For load-based splits, it's more complicated. The split point for load-based splits is determined by sampling from the spans of incoming requests. These keys are more often than not real SQL row keys because when SQL scans a full row, it does not construct the full key - it omits the column prefix. Concretely, in kv, rows are encoded on disk as /Table/53/1/<id>/0 but the SQL layer will read them via a scan starting at /Table/53/1/<id>.
The latest iteration on this layer cake of hacks calls EnsureSafeSplitKey and on errors assumes that the key is not between rows. This should mostly be true, but it's a very subtle invariant that, even if not violated in practice, emits a pungent smell that is hard to ignore. As far as KV is concerned, a client may as well issue a number of scans starting at /Table/53/1/<id>/0/somebogussuffix from which you can't decode the column family ID but which squarely splits the row for id in half (if it has more than one column family, something not true in KV).

See:

#42056 - general trove of information
#16344 - old attempt at cleaning this up
#42833 - first recent really broken attempt
#43078 - second less, but still broken, attempt

Since this area of code is active now, we should figure out how to fix the mess we got ourselves into.

A few options come to mind:

  • drop any guarantees that prevent splitting between column families. Rationale: as long as KV is in charge of splitting, it can't guarantee it without subtle and historically repeatedly violated invariants. Better rip off the band aid now.
  • put a higher-level system in charge of splitting. That system needs to know about SQL schemas etc and will only ever split at "safe" keys. Rationale: it's the sane way to get the invariant that is currently being assumed.
  • keep splitting in KV, but only ever split at existing keys. Rationale: it's a more reasonable assumption that if the SQL layer is using a range for table data, there can't be any "non-table data" interspersed with rows. Con: it's an awkward restriction that is sure to be hard to work with in practice, for example we can't split off empty tables any more.

Curious about other options. Of the ones listed here, in the short term, the first one seems like the only reasonable one.

Someone should also think through interleaved tables and whether they can add a new toxin to the mix. I hope not because (or so I assume) interleaved tables can never "partition" the columns of a parent row (?).

@nvanbenschoten could you bring this issue to the next KV weekly/generally drive it forward?

Jira issue: CRDB-5298

Jan 2024, combing related issue: #109857 into this one. It's text was:

EnsureSafeSplitKey, and the GetRowPrefixLength function that implements it, today typically insists that a key have what appears to be a valid column family ID as its suffix, so that it can decode its length then use that to truncate the family ID from the row key and return it. It has special cases for keys that consist of only a table ID, or only a table and index ID, which it also considers to be have a valid "RowPrefixLength" of their entire length.

However consider a table with a primary key like (region string, id int). A key like /Table/101/1/"us-east" is very much a valid split key for this table -- probably a very good one at that -- but it does not end in a valid column family ID+length, but also does not consist of only a table ID or table and index ID, so it causes EnsureSafeSplitKey to return an error.

Should EnsureSafeSplitKey instead assume that any key which does not appear to end in a column family ID is a safe split key?

Doing so would more correctly handle prefixes like this, particularly when handling compound primary keys (though like the current implementation, it would still suffer from cases of mistaken identity where an indexed integer value is mistaken for a column family ID). However, if someone were to append bytes to a column family key - by calling Next or PrefixEnd -- the resulting key, which is not a safe split keys it still falls between two column family keys -- would no longer appear to be a column family key, and thus would be erroneously considered to be a safe split key. Thus switching the default assumption of EnsureSafeSplitKey, to be that a key is a safe split key unless it appears to end with a column family ID, would require being sure that they keys passed to it are not derived by calling Next or PrefixEnd on real keys.

A more ambitious option might be to fix the underlying problem of being unable to tell a when a key contains a column family or not, by switching the column family ID to be encoded with a unique wire tag instead of Int. However unless all KVs in all clusters were actively rewritten in a migration to use the new tag, we would still need to fall back to the current guessing approach, so this may not be worth the churn.

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 10, 2019
@jordanlewis
Copy link
Member

What are the concrete consequences of accidentally splitting rows across ranges? Can we work to prevent those consequences instead?

Given that by definition it'll be rare to have a row split across two ranges, is it really that big of a deal to lift this restriction? From SQL execution's perspective, Ranges are completely abstracted away - we do all of our operations via interfaces that concatenate data from multiple ranges as you'd expect.

I think given how painful this is to fix, we ought to consider going in the other direction and loosening our requirements to not have split rows.

@tbg
Copy link
Member Author

tbg commented Dec 11, 2019

@jordanlewis what you're suggesting is option 1 and it's my favorite as well. All of the consequences live above KV.

See @dt's comment for bulk i/o and scroll down from there to see the one place we found where SQL makes assumptions (~ kvfetcher considers row to be done when hitting a range boundary)

@jordanlewis
Copy link
Member

Ah I see. I think we could probably fix this then...

@lunevalex
Copy link
Collaborator

@tbg @jordanlewis now that we increased the range size to 512MB, do we still want to consider doing this at some point in the future?

@ajwerner
Copy link
Contributor

Another proposal that isn't in the list is to propagate information to KV about valid split points within key ranges. As we look towards 21.1 we are looking towards revamping the ways in which KV is informed about zone configs and split points. Today we propagate the entire system config which is not scalable and poorly engineered. Perhaps it wouldn't be so onerous for the system config (or something around it) to indicate a boolean function of a key to determine whether it forms a valid split point.

@tbg
Copy link
Member Author

tbg commented Aug 18, 2020

Yes, there's still a problem here. It's just that the duct tape is working at the moment.

@jlinder jlinder added T-sql-queries SQL Queries Team T-kv KV Team labels Jun 16, 2021
@tbg
Copy link
Member Author

tbg commented May 17, 2023

We now hit this in production again: https://github.com/cockroachlabs/support/issues/2302

@nvanbenschoten nvanbenschoten removed their assignment May 17, 2023
@irfansharif irfansharif added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 17, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue May 18, 2023
Previously, the load based range splitter could suggest split keys which
were in-between SQL rows. This violated assumptions made in SQL code,
which require that rows are never split across ranges.

This patch updates the unweighted split finder to only retain safe
sample keys. The weighted split finder already does this (e4f003b). Note
that the weighted split finder is used by default (>=23.1), whilst the
unweighted split finder is used when
`kv.allocator.load_based_rebalancing.objective` is set to `qps` (default
`cpu`).

Informs: cockroachdb#43094
Fixes: cockroachdb#103483

Release note: None
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Dec 4, 2023
@michae2 michae2 moved this from Triage to New Backlog in SQL Queries Dec 4, 2023
@michae2 michae2 added the P-3 Issues/test failures with no fix SLA label Dec 21, 2023
@michae2
Copy link
Collaborator

michae2 commented Dec 21, 2023

Can we close this, given that #103483 was fixed?

@dt
Copy link
Member

dt commented Jan 2, 2024

I think this is still relevant / open.

We've seen a new manifestation of this bug in almost every release I can remember, either in restore, load-based-splits, schema changes, etc, since the underlying issue -- that a kv split can be created between KVs of a row -- still exists. I'd argue that #103483 was closed since we fixed an instance of the bug, but the underlying bug in the design persists and I think this issue is the closest we have to tracking that?

#109857 we could probably fold into this.

@dt
Copy link
Member

dt commented Jan 2, 2024

I wonder if we should fold #112702 into this too.

@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Mar 15, 2024
@yuzefovich yuzefovich added the T-sql-queries SQL Queries Team label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-kv KV Team T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

10 participants