-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: SQL syntax for splitting and relocating ranges #14146
Conversation
I'm not terribly fond of some of the naming ( Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 30 at r1 (raw file):
Could docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):
Is the relocation a one-off perturbation of replica/leaseholder placement? If you try and place all of the leaseholders on a single node, Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 21 at r1 (raw file):
nit: backup doesn't do any of this. you could use "restore" throughout if you like docs/RFCS/split_sql_syntax.md, line 25 at r1 (raw file):
this particular algorithm wasn't rigorously tested. we should make sure it's necessary before committing to it docs/RFCS/split_sql_syntax.md, line 96 at r1 (raw file):
restore wants to rebalance both replicas and leases. it mostly doesn't care about the specifics, but will eventually want to restore "close" to wherever the backup data is. how do you imagine this working here? your TODO below about maybe specifying zone targets might be the answer to both questions Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 25 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
sure, the main point is to avoid choosing a syntax which forces us to split sequentially docs/RFCS/split_sql_syntax.md, line 30 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
I don't see why not. But it won't help the restore usecase where we already have keys. docs/RFCS/split_sql_syntax.md, line 96 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
yes I think it would help if the relocation string is expressive enough. it could even have a way to specify "relocate at random" docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
In test clus ters that are set up using "manual replication" this won't be a problem. Comments from Reviewable |
docs/RFCS/split_sql_syntax.md
Outdated
For debugging/testing purposes, we will also need a `pretty_key` function that | ||
takes a `BYTES` key generated by `key_prefix` and returns a pretty-printed string. | ||
|
||
### 2. `KVSPLIT` statement ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend having these two statements as a special form of ALTER
. I have seen clients classify DDL vs non-DDL statemetns by looking at the first word, and a new word is asking for trouble.
Probably this new KVSPLIT
will simply end up replacing our current ALTER ... SPLIT
. I think this would be reasonable. Like Peter suggests, our current SPLIT node can be extended with a SELECT clause, the grammar allows this and the corresponding changes would not be too complicated.
docs/RFCS/split_sql_syntax.md
Outdated
- Start Date: 2017-03-14 | ||
- Authors: Radu Berinde | ||
- RFC PR: (PR # after acceptance of initial draft) | ||
- Cockroach Issue: #13665 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also extends/supersedes #8990.
docs/RFCS/split_sql_syntax.md
Outdated
`BYTES` keys and executes the splits. Examples: | ||
|
||
```sql | ||
KVSPLIT VALUES ('..'), ('..') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify the meaning of ..
docs/RFCS/split_sql_syntax.md
Outdated
|
||
The order in which we perform splits matters for achieving parallelism: if we | ||
execute the splits in order, we can only do one split at a time because every | ||
split runs on an existing range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this argument I think we should first split on the 2nd and next-to-last key first, and only then the middle keys in parallel, otherwise there would be many operations splitting on the sides where the ranges are touching already existing ranges.
docs/RFCS/split_sql_syntax.md
Outdated
`KVRELOCATE <select_clause>` | ||
|
||
The `KVRELOCATE` statement takes a select clause that returns two columns: a | ||
`BYTES` key column, and a `STRING` relocation string. The string is a comma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend an []int
relocation array. This can then be generated with array_agg
if needed.
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file): Previously, knz (kena) wrote…
A big distinction between the existing The case where we have keys is restore. docs/RFCS/split_sql_syntax.md, line 79 at r1 (raw file): Previously, knz (kena) wrote…
the values are (raw binary) keys, I don't think an actual example of an escaped binary key would be more helpful :) What should I put in there to make it more clear? Comments from Reviewable |
Reviewed 1 of 1 files at r1. docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file): Previously, RaduBerinde wrote…
Oh I see! Once this is implemented I would recommend extending the existing SPLIT to also support SELECT clauses, in addition. docs/RFCS/split_sql_syntax.md, line 79 at r1 (raw file): Previously, RaduBerinde wrote…
Oh just add a SQL comment in your example:
Comments from Reviewable |
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file): Previously, knz (kena) wrote…
One option would be to remove BTW I still like the idea of putting the new statements under Comments from Reviewable |
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 21 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
Does restore need to split a table before the table descriptor is restored? Some of the discussion seems to imply yes, but if the table descriptor has been restored (perhaps without a name), then docs/RFCS/split_sql_syntax.md, line 25 at r1 (raw file): Previously, RaduBerinde wrote…
FYI, I tested this in docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file): Previously, RaduBerinde wrote…
Or docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file): Previously, knz (kena) wrote…
I'm not following your reasoning here, @knz. docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, RaduBerinde wrote…
How long would the change "stick"? Seems difficult to make that work with automatic rebalancing. It might be useful to add a knob to enable/disable rebalancing on real clusters for ease of testing. Comments from Reviewable |
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 30 at r1 (raw file): Previously, RaduBerinde wrote…
With restore, we don't just have keys, we have split points. Wouldn't restore just split at the same points that the backed-up table was split? (it would probably perform those splits according to the binary-search-like algorithm above to maximize parallelism, but it's not making decisions about where to split) docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
But restore also does a lot of custom KV-level stuff; its splitting functionality doesn't necessarily need to be exposed as SQL commands. I think that splits performed via SQL would be best expressed as PK (or other index) values, although I'm willing to be convinced here. docs/RFCS/split_sql_syntax.md, line 79 at r1 (raw file): Previously, RaduBerinde wrote…
You could show examples with docs/RFCS/split_sql_syntax.md, line 81 at r1 (raw file):
This reads as pretty magical. In user terms, I think the most common use case for pre-splitting will be "my keys are/will be uniformly distributed across either the entire bytes keyspace or the integer range from 1 to N, and I want to produce K splits". How would that translate to the proposed syntax, and should we consider a shorthand syntax for this case? docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file): Previously, knz (kena) wrote…
One point to remember here is that the cost of a split is (roughly) proportional to the size of the left side (or the smaller side with a small change to the code. We currently hard-code the assumption that the left side is smaller because that's how the docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Outside of distsql testing, do we really want (or want to offer) direct control over replication like this? I don't think we do, and the best end-user use for a "relocate" command would be more of a "scatter" to force the just-split ranges to be distributed across all eligible nodes (perhaps we could do this by adding 32MB to the effective size of each range and let the rebalancing system do its thing automatically. Comments from Reviewable |
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
This is true, though the size of the ranges is immaterial of we're pre-splitting an empty table. docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Note that restore also wants to wait for the relocation to finish. It is insufficient to ask the system to "scatter" the ranges and then immediately proceed with the restore. Comments from Reviewable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat.
docs/RFCS/split_sql_syntax.md
Outdated
INDEX vw (v, w) | ||
) | ||
|
||
SELECT key_prefix('t', 1, 2) -- returns /t/primary/1/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key_prefix
just sounds too generic for what this is actually doing. Not sure what @petermattis had in mind with his comment, but I'd prefer renaming as follows:
key_prefix
–> distkv_key_prefix
pretty_key
–> distkv_pretty_key
KVSPLIT
–> DISTKV_SPLIT
KVRELOCATE
–> DISTKV_RELOCATE
This has the advantage of introducing a new namespace which is convincingly outside of any standard SQL dialect, while not being confusingly non-specific like internal
or the like.
Thanks for the feedback everyone! Given your comments, I'm inclined to rework this to drop the functionality for splitting on keys (and instead augment the existing Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
@petermattis The intention was that @bdarnell - We can add syntax for triggering a "scatter". But are you suggesting we drop the direct control syntax altogether? (if yes, what is the alternative proposal for achieving the test setup goals?) Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, RaduBerinde wrote…
I think the direct control syntax will be useful for testing. If we get anxious about exposing this to a user we can enable it via an env var and default it to disabled. Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):
Yeah, I'd rather not expose things like node/store IDs (these should really be store IDs, not node IDs) or even the proposed Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I don't think Restore wants to (needs to?) scatter newly created empty ranges (both the replicas and leaseholders) and wait for the rebalancing to finish before starting the restore of actual data. This doesn't need a SQL interface, but it would be nice to use the same mechanism that DistSQL testing is using. Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
I agree that we should expose some way of scattering a group of replicas; my objection is to the version of the command that gives the user direct control over where those replicas (temporarily) end up. Comments from Reviewable |
docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think having direct control (to be able to reproduce the exact data distribution pattern) can be very useful to test specific cases, and (perhaps more importantly) for having reproducible benchmarks. There's much more noise in comparing the before and after of a change if the data pattern is randomized every time. I have absolutely no problem with hiding this command behind an env or session var. Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
right, this is the advantage of separating splitting from relocating - you don't need to worry about the ordering of split wrt data movement docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file): Previously, RaduBerinde wrote…
On the "how do we get new ranges to stick" question, the problem, I guess, is that, when confronted with large numbers of new ranges over a short time period, the replication and lease rebalancing queues can get confused and start doing unstable things because they don't have up-to-date info about the other nodes. One solution, which I think might have been implied here, is that, when we use this functionality for tests and benchmarks, we disable these queues. Comments from Reviewable |
f140902
to
0298a84
Compare
Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 65 at r2 (raw file): Previously, knz (kena) wrote…
Ack. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed. docs/RFCS/split_sql_syntax.md, line 21 at r2 (raw file):
After viewing this RFC, I feel very confident that customers are going to use all three features! so we docs/RFCS/split_sql_syntax.md, line 42 at r2 (raw file):
So a User can use this feature? Or is it enabled via an env var? docs/RFCS/split_sql_syntax.md, line 99 at r2 (raw file):
It's not clear what, "ranges to be relocated to a random set of replicas" means? Comments from Reviewable |
d82d286
to
50a7374
Compare
Updated, thanks for the comments! I wasn't 100% sure this warrants an RFC but it absolutely did. Lesson learned: when in doubt, RFC! Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed. docs/RFCS/sql_split_syntax.md, line 81 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think that it is hard to come up with a general syntax which generates split points as sql values and works for all types; if the commands were key-oriented, that would be simpler. For a simple case like an integer column, it would be something along the lines of docs/RFCS/sql_split_syntax.md, line 42 at r2 (raw file): Previously, vivekmenezes wrote…
Yes, it can be used. docs/RFCS/sql_split_syntax.md, line 54 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Fixed. docs/RFCS/sql_split_syntax.md, line 99 at r2 (raw file): Previously, vivekmenezes wrote…
Rephrased. docs/RFCS/sql_split_syntax.md, line 134 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions. docs/RFCS/sql_split_syntax.md, line 105 at r3 (raw file):
I added some info about Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, some commit checks pending. docs/RFCS/sql_split_syntax.md, line 105 at r3 (raw file): Previously, RaduBerinde wrote…
Yeah, I think we should just ignore the error. We should probably just make this change in the core: make AdminSplit a no-op when there is already a split at the requested point. Comments from Reviewable |
docs/RFCS/sql_split_syntax.md, line 105 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
👍 Comments from Reviewable |
This error is not useful; indeed many callers go through hoops to ignore it. Extending SPLIT AT to handle multiple split points makes this error even more annoying. Removing this error in favor of a silent no-op (based on discussion in cockroachdb#14146). The backup test is changed to read the meta descriptor instead on relying on the old behavior to verify splits.
Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. docs/RFCS/sql_split_syntax.md, line 81 at r1 (raw file): Previously, RaduBerinde wrote…
Are there any thoughts/suggestions what we can improve in this RFC for this use case? docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file): Previously, andreimatei (Andrei Matei) wrote…
What do folks think about this? Is this something we should explore as part of this RFC? Comments from Reviewable |
This error is not useful; indeed many callers go through hoops to ignore it. Extending SPLIT AT to handle multiple split points makes this error even more annoying. Removing this error in favor of a silent no-op (based on discussion in cockroachdb#14146). The backup test is changed to read the meta descriptor instead on relying on the old behavior to verify splits.
`SPLIT AT` now takes an arbitrary select statement. Existing uses must switch to using `VALUES`; e.g. `ALTER TABLE t SPLIT AT (x, y)` becomes `ALTER TABLE t SPLIT AT VALUES (x, y)`. Part of cockroachdb#13665, implements part of RFC cockroachdb#14146.
Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. docs/RFCS/sql_split_syntax.md, line 81 at r1 (raw file): Previously, RaduBerinde wrote…
I was thinking that if there were a good way to specify the distribution of keys (perhaps via a function to generate random keys from the expected distribution), you could pass that function and the desired number of splits. But while that works well for UUIDs, it's not much better than the existing proposal ( docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file): Previously, RaduBerinde wrote…
I think that for a random scattering of nodes, we don't care how long the changes stick because the user has no expectations about the end result (and there is no pressure that's going to force these ranges to get bunched back together once they've moved). For testing-only direct control, I think we should probably just disable those queues for those tests. I don't think there's a need at this point to give fine-grained control to disable the queues for certain ranges or for certain time limits. Comments from Reviewable |
Thanks! |
`SPLIT AT` now takes an arbitrary select statement. Existing uses must switch to using `VALUES`; e.g. `ALTER TABLE t SPLIT AT (x, y)` becomes `ALTER TABLE t SPLIT AT VALUES (x, y)`. Part of cockroachdb#13665, implements part of RFC cockroachdb#14146.
docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I had switched the syntax to take a list of store IDs but I guess I made the assumption that store IDs are unique across hosts? Is that the case? If not, what would be the correct format? A list of "nodeID:storeID" pairs? Comments from Reviewable |
docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file): Previously, RaduBerinde wrote…
Ah, they are, never mind. Comments from Reviewable |
This change is