rfc: SQL syntax for splitting and relocating ranges #14146

RaduBerinde · 2017-03-14T21:30:38Z

This change is

petermattis · 2017-03-14T21:52:57Z

I'm not terribly fond of some of the naming (KVSPLIT, KVRELOCATE, key_prefix), but the functionality all seems well thought out.

Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 30 at r1 (raw file):

more usecases for pre-spliting ranges (for benchmarks, importing, etc).

We currently have the `ALTER TABLE/INDEX SPLIT AT` statement which allows

Could ALTER {TABLE,INDEX} SPLIT AT be extended to also accept a select expression? Seems feasible, but only briefly looked at the grammar.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

TODO: a better, more extensive format? more flexibility in specifying targets
(e.g. zones)?

Is the relocation a one-off perturbation of replica/leaseholder placement? If you try and place all of the leaseholders on a single node, replicateQueue will then rebalance. Is this a problem?

Comments from Reviewable

danhhz · 2017-03-14T21:59:30Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 21 at r1 (raw file):

the replicas and range leaders of each split.

The secondary motivation is for backup/restore, which needs to introduce a

nit: backup doesn't do any of this. you could use "restore" throughout if you like

docs/RFCS/split_sql_syntax.md, line 25 at r1 (raw file):

follows: the keys are sorted; a split is introduced on the middle key, then
the splits to the left and respectively to the right are processed recursively
(in parallel). We want the new syntax to implement this algorithm so that

this particular algorithm wasn't rigorously tested. we should make sure it's necessary before committing to it

docs/RFCS/split_sql_syntax.md, line 96 at r1 (raw file):

### 3. `KVRELOCATE` statement ###

restore wants to rebalance both replicas and leases. it mostly doesn't care about the specifics, but will eventually want to restore "close" to wherever the backup data is. how do you imagine this working here?

your TODO below about maybe specifying zone targets might be the answer to both questions

Comments from Reviewable

RaduBerinde · 2017-03-14T22:38:20Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 25 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

this particular algorithm wasn't rigorously tested. we should make sure it's necessary before committing to it

sure, the main point is to avoid choosing a syntax which forces us to split sequentially

docs/RFCS/split_sql_syntax.md, line 30 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Could ALTER {TABLE,INDEX} SPLIT AT be extended to also accept a select expression? Seems feasible, but only briefly looked at the grammar.

I don't see why not. But it won't help the restore usecase where we already have keys.

docs/RFCS/split_sql_syntax.md, line 96 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

restore wants to rebalance both replicas and leases. it mostly doesn't care about the specifics, but will eventually want to restore "close" to wherever the backup data is. how do you imagine this working here?

your TODO below about maybe specifying zone targets might be the answer to both questions

yes I think it would help if the relocation string is expressive enough. it could even have a way to specify "relocate at random"

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is the relocation a one-off perturbation of replica/leaseholder placement? If you try and place all of the leaseholders on a single node, replicateQueue will then rebalance. Is this a problem?

In test clus ters that are set up using "manual replication" this won't be a problem.
For restore, we want the system to subsequently rebalance as needed. I'm sure there are other usecases where we may want to make the change "stick"; that could be specified in the relocation string if we have a way of implementing it.

Comments from Reviewable

knz · 2017-03-14T22:35:34Z

docs/RFCS/split_sql_syntax.md

+For debugging/testing purposes, we will also need a `pretty_key` function that
+takes a `BYTES` key generated by `key_prefix` and returns a pretty-printed string.
+
+### 2. `KVSPLIT` statement ###


I would recommend having these two statements as a special form of ALTER. I have seen clients classify DDL vs non-DDL statemetns by looking at the first word, and a new word is asking for trouble.

Probably this new KVSPLIT will simply end up replacing our current ALTER ... SPLIT. I think this would be reasonable. Like Peter suggests, our current SPLIT node can be extended with a SELECT clause, the grammar allows this and the corresponding changes would not be too complicated.

knz · 2017-03-14T22:35:56Z

docs/RFCS/split_sql_syntax.md

+- Start Date: 2017-03-14
+- Authors: Radu Berinde
+- RFC PR: (PR # after acceptance of initial draft)
+- Cockroach Issue: #13665


Also extends/supersedes #8990.

knz · 2017-03-14T22:36:56Z

docs/RFCS/split_sql_syntax.md

+`BYTES` keys and executes the splits. Examples:
+
+```sql
+KVSPLIT VALUES ('..'), ('..')


Please clarify the meaning of ..

knz · 2017-03-14T22:38:36Z

docs/RFCS/split_sql_syntax.md

+
+The order in which we perform splits matters for achieving parallelism: if we
+execute the splits in order, we can only do one split at a time because every
+split runs on an existing range.


By this argument I think we should first split on the 2nd and next-to-last key first, and only then the middle keys in parallel, otherwise there would be many operations splitting on the sides where the ranges are touching already existing ranges.

knz · 2017-03-14T22:39:37Z

docs/RFCS/split_sql_syntax.md

+`KVRELOCATE <select_clause>`
+
+The `KVRELOCATE` statement takes a select clause that returns two columns: a
+`BYTES` key column, and a `STRING` relocation string. The string is a comma


I recommend an []int relocation array. This can then be generated with array_agg if needed.

RaduBerinde · 2017-03-14T23:28:31Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file):

Previously, knz (kena) wrote…

I would recommend having these two statements as a special form of ALTER. I have seen clients classify DDL vs non-DDL statemetns by looking at the first word, and a new word is asking for trouble.

Probably this new KVSPLIT will simply end up replacing our current ALTER ... SPLIT. I think this would be reasonable. Like Peter suggests, our current SPLIT node can be extended with a SELECT clause, the grammar allows this and the corresponding changes would not be too complicated.

A big distinction between the existing SPLIT AT and the proposal is the possibility of splitting at keys (rather than PK values, or index columns). That would make it a general operation rather than something that applies to a table or index (like SPLIT AT).

The case where we have keys is restore.

docs/RFCS/split_sql_syntax.md, line 79 at r1 (raw file):

Previously, knz (kena) wrote…

Please clarify the meaning of ..

the values are (raw binary) keys, I don't think an actual example of an escaped binary key would be more helpful :) What should I put in there to make it more clear?

Comments from Reviewable

knz · 2017-03-14T23:53:07Z

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file):

Previously, RaduBerinde wrote…

A big distinction between the existing SPLIT AT and the proposal is the possibility of splitting at keys (rather than PK values, or index columns). That would make it a general operation rather than something that applies to a table or index (like SPLIT AT).

The case where we have keys is restore.

Oh I see!
Then yes a new KVSPLIT alternative is warranted.

Once this is implemented I would recommend extending the existing SPLIT to also support SELECT clauses, in addition.

docs/RFCS/split_sql_syntax.md, line 79 at r1 (raw file):

Previously, RaduBerinde wrote…

the values are (raw binary) keys, I don't think an actual example of an escaped binary key would be more helpful :) What should I put in there to make it more clear?

Oh just add a SQL comment in your example:

KVSPLIT VALUES (b'....') -- can use the byte representation of a KV key directly

Comments from Reviewable

RaduBerinde · 2017-03-15T00:01:00Z

Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file):

Previously, knz (kena) wrote…

Oh I see!
Then yes a new KVSPLIT alternative is warranted.

Once this is implemented I would recommend extending the existing SPLIT to also support SELECT clauses, in addition.

One option would be to remove SPLIT AT altogether, to avoid having two ways of doing the same thing (ALTER TABLE t SPLIT AT SELECT x, y, z FROM ... would be equivalent to KVSPLIT SELECT key_prefix('t', x, y, z) FROM ...).

BTW I still like the idea of putting the new statements under ALTER, maybe ALTER KV SPLIT, ALTER KV RELOCATE?

Comments from Reviewable

petermattis · 2017-03-15T00:41:18Z

Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 21 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: backup doesn't do any of this. you could use "restore" throughout if you like

Does restore need to split a table before the table descriptor is restored? Some of the discussion seems to imply yes, but if the table descriptor has been restored (perhaps without a name), then ALTER TABLE [<id>] SPLIT AT would work.

docs/RFCS/split_sql_syntax.md, line 25 at r1 (raw file):

Previously, RaduBerinde wrote…

sure, the main point is to avoid choosing a syntax which forces us to split sequentially

FYI, I tested this in kv and it reduced the time to create a large number of splits by ~10x. For some reason I can't remember I never sent out that PR, though.

docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file):

Previously, RaduBerinde wrote…

One option would be to remove SPLIT AT altogether, to avoid having two ways of doing the same thing (ALTER TABLE t SPLIT AT SELECT x, y, z FROM ... would be equivalent to KVSPLIT SELECT key_prefix('t', x, y, z) FROM ...).

BTW I still like the idea of putting the new statements under ALTER, maybe ALTER KV SPLIT, ALTER KV RELOCATE?

Or ALTER crdb_internal.kv {SPLIT,RELOCATE}. I'm not sure why we'd need it, but having a crdb_internal.kv virtual table which exposed the entire KV space could be interesting.

docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file):

Previously, knz (kena) wrote…

By this argument I think we should first split on the 2nd and next-to-last key first, and only then the middle keys in parallel, otherwise there would be many operations splitting on the sides where the ranges are touching already existing ranges.

I'm not following your reasoning here, @knz.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, RaduBerinde wrote…

In test clus ters that are set up using "manual replication" this won't be a problem.
For restore, we want the system to subsequently rebalance as needed. I'm sure there are other usecases where we may want to make the change "stick"; that could be specified in the relocation string if we have a way of implementing it.

How long would the change "stick"? Seems difficult to make that work with automatic rebalancing. It might be useful to add a knob to enable/disable rebalancing on real clusters for ease of testing.

Comments from Reviewable

bdarnell · 2017-03-15T02:40:26Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 30 at r1 (raw file):

Previously, RaduBerinde wrote…

I don't see why not. But it won't help the restore usecase where we already have keys.

With restore, we don't just have keys, we have split points. Wouldn't restore just split at the same points that the backed-up table was split? (it would probably perform those splits according to the binary-search-like algorithm above to maximize parallelism, but it's not making decisions about where to split)

docs/RFCS/split_sql_syntax.md, line 71 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Or ALTER crdb_internal.kv {SPLIT,RELOCATE}. I'm not sure why we'd need it, but having a crdb_internal.kv virtual table which exposed the entire KV space could be interesting.

But restore also does a lot of custom KV-level stuff; its splitting functionality doesn't necessarily need to be exposed as SQL commands. I think that splits performed via SQL would be best expressed as PK (or other index) values, although I'm willing to be convinced here.

docs/RFCS/split_sql_syntax.md, line 79 at r1 (raw file):

Previously, RaduBerinde wrote…

the values are (raw binary) keys, I don't think an actual example of an escaped binary key would be more helpful :) What should I put in there to make it more clear?

You could show examples with key_prefix as below (this would show the comma-separated values approach as opposed to the selects below).

docs/RFCS/split_sql_syntax.md, line 81 at r1 (raw file):

KVSPLIT VALUES ('..'), ('..')
KVSPLIT SELECT key_prefix('t', 1, 2)
KVSPLIT SELECT key_prefix('t', i*10) FROM GENERATE_SERIES(1, 5) AS x(i)

This reads as pretty magical. In user terms, I think the most common use case for pre-splitting will be "my keys are/will be uniformly distributed across either the entire bytes keyspace or the integer range from 1 to N, and I want to produce K splits". How would that translate to the proposed syntax, and should we consider a shorthand syntax for this case?

docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file):

Previously, knz (kena) wrote…

By this argument I think we should first split on the 2nd and next-to-last key first, and only then the middle keys in parallel, otherwise there would be many operations splitting on the sides where the ranges are touching already existing ranges.

One point to remember here is that the cost of a split is (roughly) proportional to the size of the left side (or the smaller side with a small change to the code. We currently hard-code the assumption that the left side is smaller because that's how the splitQueue works). So splitting into equal parts is ideal for maximizing parallelism, but splitting from the left is the best for throughput. Some hybrid solution is probably going to be the best overall.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

How long would the change "stick"? Seems difficult to make that work with automatic rebalancing. It might be useful to add a knob to enable/disable rebalancing on real clusters for ease of testing.

Outside of distsql testing, do we really want (or want to offer) direct control over replication like this? I don't think we do, and the best end-user use for a "relocate" command would be more of a "scatter" to force the just-split ranges to be distributed across all eligible nodes (perhaps we could do this by adding 32MB to the effective size of each range and let the rebalancing system do its thing automatically.

Comments from Reviewable

petermattis · 2017-03-15T12:16:42Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

One point to remember here is that the cost of a split is (roughly) proportional to the size of the left side (or the smaller side with a small change to the code. We currently hard-code the assumption that the left side is smaller because that's how the splitQueue works). So splitting into equal parts is ideal for maximizing parallelism, but splitting from the left is the best for throughput. Some hybrid solution is probably going to be the best overall.

This is true, though the size of the ranges is immaterial of we're pre-splitting an empty table.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Outside of distsql testing, do we really want (or want to offer) direct control over replication like this? I don't think we do, and the best end-user use for a "relocate" command would be more of a "scatter" to force the just-split ranges to be distributed across all eligible nodes (perhaps we could do this by adding 32MB to the effective size of each range and let the rebalancing system do its thing automatically.

Note that restore also wants to wait for the relocation to finish. It is insufficient to ask the system to "scatter" the ranges and then immediately proceed with the restore.

Comments from Reviewable

spencerkimball

Neat.

spencerkimball · 2017-03-15T13:37:46Z

docs/RFCS/split_sql_syntax.md

+  INDEX vw (v, w)
+)
+
+SELECT key_prefix('t', 1, 2)     -- returns /t/primary/1/2


key_prefix just sounds too generic for what this is actually doing. Not sure what @petermattis had in mind with his comment, but I'd prefer renaming as follows:

key_prefix –> distkv_key_prefix
pretty_key –> distkv_pretty_key
KVSPLIT –> DISTKV_SPLIT
KVRELOCATE –> DISTKV_RELOCATE

This has the advantage of introducing a new namespace which is convincingly outside of any standard SQL dialect, while not being confusingly non-specific like internal or the like.

RaduBerinde · 2017-03-15T15:28:35Z

Thanks for the feedback everyone! Given your comments, I'm inclined to rework this to drop the functionality for splitting on keys (and instead augment the existing SPLIT AT); restore can continue to use its current code or it can switch to using PK split points.

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Note that restore also wants to wait for the relocation to finish. It is insufficient to ask the system to "scatter" the ranges and then immediately proceed with the restore.

@petermattis The intention was that KVRELOCATE would wait until the relocations are performed. I should specify this.

@bdarnell - We can add syntax for triggering a "scatter". But are you suggesting we drop the direct control syntax altogether? (if yes, what is the alternative proposal for achieving the test setup goals?)

Comments from Reviewable

petermattis · 2017-03-16T02:13:38Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, RaduBerinde wrote…

@petermattis The intention was that KVRELOCATE would wait until the relocations are performed. I should specify this.

@bdarnell - We can add syntax for triggering a "scatter". But are you suggesting we drop the direct control syntax altogether? (if yes, what is the alternative proposal for achieving the test setup goals?)

I think the direct control syntax will be useful for testing. If we get anxious about exposing this to a user we can enable it via an env var and default it to disabled.

Comments from Reviewable

bdarnell · 2017-03-16T03:33:38Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

But are you suggesting we drop the direct control syntax altogether? (if yes, what is the alternative proposal for achieving the test setup goals?)

Yeah, I'd rather not expose things like node/store IDs (these should really be store IDs, not node IDs) or even the proposed key_prefix() to users via SQL. I thought TestCluster was the solution for DistSQL testing. Is that not enough? Does it need a SQL interface? I guess I'm OK with that, as long as we can come up with sensible definitions of what this means and how long the change sticks, and bury it so users won't stumble across it and try to micromanage their replication.

Comments from Reviewable

petermattis · 2017-03-16T14:50:26Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

But are you suggesting we drop the direct control syntax altogether? (if yes, what is the alternative proposal for achieving the test setup goals?)

Yeah, I'd rather not expose things like node/store IDs (these should really be store IDs, not node IDs) or even the proposed key_prefix() to users via SQL. I thought TestCluster was the solution for DistSQL testing. Is that not enough? Does it need a SQL interface? I guess I'm OK with that, as long as we can come up with sensible definitions of what this means and how long the change sticks, and bury it so users won't stumble across it and try to micromanage their replication.

I don't think TestCluster is sufficient for various manual testing situations. Consider a user who has a small-ish table that isn't adequately distributed to show a benefit to DistSQL. I'd like to give them commands to manually split and "scatter" the replicas. I imagine we'll want to do this ourselves during performance testing. For reproducibility, the scattering should be deterministic.

Restore wants to (needs to?) scatter newly created empty ranges (both the replicas and leaseholders) and wait for the rebalancing to finish before starting the restore of actual data. This doesn't need a SQL interface, but it would be nice to use the same mechanism that DistSQL testing is using.

Comments from Reviewable

bdarnell · 2017-03-16T14:55:34Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't think TestCluster is sufficient for various manual testing situations. Consider a user who has a small-ish table that isn't adequately distributed to show a benefit to DistSQL. I'd like to give them commands to manually split and "scatter" the replicas. I imagine we'll want to do this ourselves during performance testing. For reproducibility, the scattering should be deterministic.

Restore wants to (needs to?) scatter newly created empty ranges (both the replicas and leaseholders) and wait for the rebalancing to finish before starting the restore of actual data. This doesn't need a SQL interface, but it would be nice to use the same mechanism that DistSQL testing is using.

I agree that we should expose some way of scattering a group of replicas; my objection is to the version of the command that gives the user direct control over where those replicas (temporarily) end up.

Comments from Reviewable

RaduBerinde · 2017-03-16T16:05:07Z

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I agree that we should expose some way of scattering a group of replicas; my objection is to the version of the command that gives the user direct control over where those replicas (temporarily) end up.

I think having direct control (to be able to reproduce the exact data distribution pattern) can be very useful to test specific cases, and (perhaps more importantly) for having reproducible benchmarks. There's much more noise in comparing the before and after of a change if the data pattern is randomized every time.

I have absolutely no problem with hiding this command behind an env or session var.

Comments from Reviewable

andreimatei · 2017-03-16T16:55:04Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 93 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is true, though the size of the ranges is immaterial of we're pre-splitting an empty table.

right, this is the advantage of separating splitting from relocating - you don't need to worry about the ordering of split wrt data movement

docs/RFCS/split_sql_syntax.md, line 113 at r1 (raw file):

Previously, RaduBerinde wrote…

I think having direct control (to be able to reproduce the exact data distribution pattern) can be very useful to test specific cases, and (perhaps more importantly) for having reproducible benchmarks. There's much more noise in comparing the before and after of a change if the data pattern is randomized every time.

I have absolutely no problem with hiding this command behind an env or session var.

On the "how do we get new ranges to stick" question, the problem, I guess, is that, when confronted with large numbers of new ranges over a short time period, the replication and lease rebalancing queues can get confused and start doing unstable things because they don't have up-to-date info about the other nodes. One solution, which I think might have been implied here, is that, when we use this functionality for tests and benchmarks, we disable these queues.
Perhaps another option would be to teach the queues to simply ignore a key span (comprising all the new ranges of a table, say) for some configurable number of minutes - until we expect things to stabilize to the point where, if the scatter is even, the queues realize that there's no work for them to do. We could have a new rpc saying "ignore [keyStart->keyEnd) for rebalancing purposes for 5 minutes".

Comments from Reviewable

petermattis · 2017-03-17T15:16:44Z

Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 65 at r2 (raw file):

Previously, knz (kena) wrote…

Only the new syntax would be unambiguously able to support both constants (with VALUES) and select clauses.
I am confident that the hassle to adapt the load generators and jepsen tests will be smaller than the one needed to make the grammar comply.

Ack.

Comments from Reviewable

vivekmenezes · 2017-03-17T16:28:13Z

Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

docs/RFCS/split_sql_syntax.md, line 21 at r2 (raw file):

The main motivation is to allow setting up tests, benchmarks, and reproducible
testbeds, especially for DistSQL. One set of some sample tests that we want to

After viewing this RFC, I feel very confident that customers are going to use all three features! so we
better either not present them to them or make sure they are well documented.

docs/RFCS/split_sql_syntax.md, line 42 at r2 (raw file):

hardcoding each split in the test file; and we want to be able to easily change
the test table sizes. In addition, the `SPLIT AT` statements don't support
control of replication.

So a User can use this feature? Or is it enabled via an env var?

docs/RFCS/split_sql_syntax.md, line 99 at r2 (raw file):

A new pair of statements similar to `SPLIT AT` are introduced. Each has two
forms. The first form causes all the ranges for that table or index to be
relocated to a random set of replicas (in accordance with the zone config):

It's not clear what, "ranges to be relocated to a random set of replicas" means?

Comments from Reviewable

RaduBerinde · 2017-03-17T21:00:43Z

Updated, thanks for the comments! I wasn't 100% sure this warrants an RFC but it absolutely did. Lesson learned: when in doubt, RFC!

Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

docs/RFCS/sql_split_syntax.md, line 81 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This reads as pretty magical. In user terms, I think the most common use case for pre-splitting will be "my keys are/will be uniformly distributed across either the entire bytes keyspace or the integer range from 1 to N, and I want to produce K splits". How would that translate to the proposed syntax, and should we consider a shorthand syntax for this case?

I think that it is hard to come up with a general syntax which generates split points as sql values and works for all types; if the commands were key-oriented, that would be simpler.

For a simple case like an integer column, it would be something along the lines of ALTER TABLE t SPLIT AT (SELECT (i*N/K)::int FROM GENERATE_SERIES(1,K)); ALTER TABLE t SCATTER.

docs/RFCS/sql_split_syntax.md, line 42 at r2 (raw file):

Previously, vivekmenezes wrote…

So a User can use this feature? Or is it enabled via an env var?

Yes, it can be used.

docs/RFCS/sql_split_syntax.md, line 54 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you're missing SPLIT AT from the above.

Fixed.

docs/RFCS/sql_split_syntax.md, line 99 at r2 (raw file):

Previously, vivekmenezes wrote…

It's not clear what, "ranges to be relocated to a random set of replicas" means?

Rephrased.

docs/RFCS/sql_split_syntax.md, line 134 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd be fine without the session var if it were renamed to something like ALTER TABLE TESTING_RELOCATE (and left undocumented).

Done.

Comments from Reviewable

RaduBerinde · 2017-03-19T00:56:44Z

Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions.

docs/RFCS/sql_split_syntax.md, line 105 at r3 (raw file):

If a split already exists, `SPLIT AT` returns an error.

TBD: what is the correct behavior with multiple split points? Ignore this error?

I added some info about SPLIT AT return values and semantics. Any thoughts on "range already split" errors?

Comments from Reviewable

bdarnell · 2017-03-19T00:58:57Z

Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, some commit checks pending.

docs/RFCS/sql_split_syntax.md, line 105 at r3 (raw file):

Previously, RaduBerinde wrote…

I added some info about SPLIT AT return values and semantics. Any thoughts on "range already split" errors?

Yeah, I think we should just ignore the error. We should probably just make this change in the core: make AdminSplit a no-op when there is already a split at the requested point.

Comments from Reviewable

RaduBerinde · 2017-03-19T01:07:49Z

docs/RFCS/sql_split_syntax.md, line 105 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, I think we should just ignore the error. We should probably just make this change in the core: make AdminSplit a no-op when there is already a split at the requested point.

👍

Comments from Reviewable

This error is not useful; indeed many callers go through hoops to ignore it. Extending SPLIT AT to handle multiple split points makes this error even more annoying. Removing this error in favor of a silent no-op (based on discussion in cockroachdb#14146). The backup test is changed to read the meta descriptor instead on relying on the old behavior to verify splits.

RaduBerinde · 2017-03-20T15:20:23Z

Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.

docs/RFCS/sql_split_syntax.md, line 81 at r1 (raw file):

Previously, RaduBerinde wrote…

I think that it is hard to come up with a general syntax which generates split points as sql values and works for all types; if the commands were key-oriented, that would be simpler.

For a simple case like an integer column, it would be something along the lines of ALTER TABLE t SPLIT AT (SELECT (i*N/K)::int FROM GENERATE_SERIES(1,K)); ALTER TABLE t SCATTER.

Are there any thoughts/suggestions what we can improve in this RFC for this use case?

docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

On the "how do we get new ranges to stick" question, the problem, I guess, is that, when confronted with large numbers of new ranges over a short time period, the replication and lease rebalancing queues can get confused and start doing unstable things because they don't have up-to-date info about the other nodes. One solution, which I think might have been implied here, is that, when we use this functionality for tests and benchmarks, we disable these queues.
Perhaps another option would be to teach the queues to simply ignore a key span (comprising all the new ranges of a table, say) for some configurable number of minutes - until we expect things to stabilize to the point where, if the scatter is even, the queues realize that there's no work for them to do. We could have a new rpc saying "ignore [keyStart->keyEnd) for rebalancing purposes for 5 minutes".

What do folks think about this? Is this something we should explore as part of this RFC?

Comments from Reviewable

This error is not useful; indeed many callers go through hoops to ignore it. Extending SPLIT AT to handle multiple split points makes this error even more annoying. Removing this error in favor of a silent no-op (based on discussion in cockroachdb#14146). The backup test is changed to read the meta descriptor instead on relying on the old behavior to verify splits.

`SPLIT AT` now takes an arbitrary select statement. Existing uses must switch to using `VALUES`; e.g. `ALTER TABLE t SPLIT AT (x, y)` becomes `ALTER TABLE t SPLIT AT VALUES (x, y)`. Part of cockroachdb#13665, implements part of RFC cockroachdb#14146.

bdarnell · 2017-03-20T20:49:20Z

Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.

docs/RFCS/sql_split_syntax.md, line 81 at r1 (raw file):

Previously, RaduBerinde wrote…

Are there any thoughts/suggestions what we can improve in this RFC for this use case?

I was thinking that if there were a good way to specify the distribution of keys (perhaps via a function to generate random keys from the expected distribution), you could pass that function and the desired number of splits. But while that works well for UUIDs, it's not much better than the existing proposal (SPLIT AT SELECT uuid_v4() FROM generate_series(1, k)), and it doesn't generalize very well (using the same pattern for unique_rowid() wouldn't make much sense). So I think the proposal is fine and we'll just need to provide some cookbook examples for common patterns.

docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file):

Previously, RaduBerinde wrote…

What do folks think about this? Is this something we should explore as part of this RFC?

I think that for a random scattering of nodes, we don't care how long the changes stick because the user has no expectations about the end result (and there is no pressure that's going to force these ranges to get bunched back together once they've moved). For testing-only direct control, I think we should probably just disable those queues for those tests. I don't think there's a need at this point to give fine-grained control to disable the queues for certain ranges or for certain time limits.

Comments from Reviewable

RaduBerinde · 2017-03-20T21:15:09Z

Thanks!

`SPLIT AT` now takes an arbitrary select statement. Existing uses must switch to using `VALUES`; e.g. `ALTER TABLE t SPLIT AT (x, y)` becomes `ALTER TABLE t SPLIT AT VALUES (x, y)`. Part of cockroachdb#13665, implements part of RFC cockroachdb#14146.

RaduBerinde · 2017-03-22T13:17:38Z

docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think that for a random scattering of nodes, we don't care how long the changes stick because the user has no expectations about the end result (and there is no pressure that's going to force these ranges to get bunched back together once they've moved). For testing-only direct control, I think we should probably just disable those queues for those tests. I don't think there's a need at this point to give fine-grained control to disable the queues for certain ranges or for certain time limits.

I had switched the syntax to take a list of store IDs but I guess I made the assumption that store IDs are unique across hosts? Is that the case? If not, what would be the correct format? A list of "nodeID:storeID" pairs?

Comments from Reviewable

RaduBerinde · 2017-03-22T13:27:34Z

docs/RFCS/sql_split_syntax.md, line 113 at r1 (raw file):

Previously, RaduBerinde wrote…

I had switched the syntax to take a list of store IDs but I guess I made the assumption that store IDs are unique across hosts? Is that the case? If not, what would be the correct format? A list of "nodeID:storeID" pairs?

Ah, they are, never mind.

Comments from Reviewable

RaduBerinde requested review from danhhz, andreimatei, knz, asubiotto, bdarnell and petermattis March 14, 2017 21:30

knz reviewed Mar 14, 2017

View reviewed changes

spencerkimball reviewed Mar 15, 2017

View reviewed changes

RaduBerinde force-pushed the rfc-split branch 6 times, most recently from f140902 to 0298a84 Compare March 16, 2017 19:43

RaduBerinde force-pushed the rfc-split branch 3 times, most recently from d82d286 to 50a7374 Compare March 17, 2017 21:00

RaduBerinde force-pushed the rfc-split branch from 50a7374 to c16fa7f Compare March 19, 2017 00:53

rfc: SQL syntax for splitting and relocating ranges

98b9d18

RaduBerinde force-pushed the rfc-split branch from c16fa7f to 98b9d18 Compare March 19, 2017 00:56

RaduBerinde mentioned this pull request Mar 20, 2017

sql, storage: tolerate existing splits in AdminSplit and SPLIT AT #14273

Merged

RaduBerinde mentioned this pull request Mar 20, 2017

sql: change SPLIT AT to take a select statement #14281

Merged

RaduBerinde merged commit 6e580a1 into cockroachdb:master Mar 20, 2017

RaduBerinde deleted the rfc-split branch March 20, 2017 21:45

RaduBerinde mentioned this pull request Mar 22, 2017

distsql: testing infrastructure #13665

Closed

petermattis mentioned this pull request Mar 29, 2017

storage: mechanism for more aggressive rebalancing after splits #10967

Closed

RaduBerinde changed the title ~~rfc: SQL syntax for splitting a relocating ranges~~ rfc: SQL syntax for splitting and relocating ranges Mar 29, 2017

RaduBerinde mentioned this pull request Apr 11, 2017

sql: basic SCATTER implementation #14796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: SQL syntax for splitting and relocating ranges #14146

rfc: SQL syntax for splitting and relocating ranges #14146

RaduBerinde commented Mar 14, 2017 •

edited by rjnn

Loading

petermattis commented Mar 14, 2017

danhhz commented Mar 14, 2017

RaduBerinde commented Mar 14, 2017

knz Mar 14, 2017

knz Mar 14, 2017

knz Mar 14, 2017

knz Mar 14, 2017

knz Mar 14, 2017

RaduBerinde commented Mar 14, 2017

knz commented Mar 14, 2017

RaduBerinde commented Mar 15, 2017

petermattis commented Mar 15, 2017

bdarnell commented Mar 15, 2017

petermattis commented Mar 15, 2017

spencerkimball left a comment

spencerkimball Mar 15, 2017

RaduBerinde commented Mar 15, 2017

petermattis commented Mar 16, 2017

bdarnell commented Mar 16, 2017

petermattis commented Mar 16, 2017

bdarnell commented Mar 16, 2017

RaduBerinde commented Mar 16, 2017

andreimatei commented Mar 16, 2017

petermattis commented Mar 17, 2017

vivekmenezes commented Mar 17, 2017

RaduBerinde commented Mar 17, 2017

RaduBerinde commented Mar 19, 2017

bdarnell commented Mar 19, 2017

RaduBerinde commented Mar 19, 2017

RaduBerinde commented Mar 20, 2017

bdarnell commented Mar 20, 2017

RaduBerinde commented Mar 20, 2017

RaduBerinde commented Mar 22, 2017

RaduBerinde commented Mar 22, 2017

rfc: SQL syntax for splitting and relocating ranges #14146

rfc: SQL syntax for splitting and relocating ranges #14146

Conversation

RaduBerinde commented Mar 14, 2017 • edited by rjnn Loading

petermattis commented Mar 14, 2017

danhhz commented Mar 14, 2017

RaduBerinde commented Mar 14, 2017

knz Mar 14, 2017

Choose a reason for hiding this comment

knz Mar 14, 2017

Choose a reason for hiding this comment

knz Mar 14, 2017

Choose a reason for hiding this comment

knz Mar 14, 2017

Choose a reason for hiding this comment

knz Mar 14, 2017

Choose a reason for hiding this comment

RaduBerinde commented Mar 14, 2017

knz commented Mar 14, 2017

RaduBerinde commented Mar 15, 2017

petermattis commented Mar 15, 2017

bdarnell commented Mar 15, 2017

petermattis commented Mar 15, 2017

spencerkimball left a comment

Choose a reason for hiding this comment

spencerkimball Mar 15, 2017

Choose a reason for hiding this comment

RaduBerinde commented Mar 15, 2017

petermattis commented Mar 16, 2017

bdarnell commented Mar 16, 2017

petermattis commented Mar 16, 2017

bdarnell commented Mar 16, 2017

RaduBerinde commented Mar 16, 2017

andreimatei commented Mar 16, 2017

petermattis commented Mar 17, 2017

vivekmenezes commented Mar 17, 2017

RaduBerinde commented Mar 17, 2017

RaduBerinde commented Mar 19, 2017

bdarnell commented Mar 19, 2017

RaduBerinde commented Mar 19, 2017

RaduBerinde commented Mar 20, 2017

bdarnell commented Mar 20, 2017

RaduBerinde commented Mar 20, 2017

RaduBerinde commented Mar 22, 2017

RaduBerinde commented Mar 22, 2017

RaduBerinde commented Mar 14, 2017 •

edited by rjnn

Loading