-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add domiciling doc with PLACEMENT RESTRICTED #12845
Add domiciling doc with PLACEMENT RESTRICTED #12845
Conversation
Files changed:
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site settings. |
476dae3
to
539854a
Compare
60fc000
to
738472e
Compare
Hey all, Since there are multiple POVs around multiregion, domiciling, If you disagree with my reasoning for adding and/or just don't
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 11 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @knz, @RichardJCai, @rmloveland, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):
Using CockroachDB as part of your approach to data domiciling has several limitations: - Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes.
Is this usage of "data that is indexed" sufficiently clear? Might this be read as forbidding all indexes on a table that must be domiciled? I would say "columns that are indexed" instead of "data" to make it clear that we're really making a column-level statement here.
"The replicas underlying the index" are domiciled as expected. What is distributed across the cluster is metadata about the index including a sample of values from the index.
Whether or not this is a problem for domiciling is going to depend on the specific law/regulations involved. This is the only place in this set of docs where we "recommend" something; everywhere else we just present the facts of what happens. Do we want to make this recommendation here?
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes. - Some metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism. This might result in potential "leakage" outside of the desired domicile if your schema includes primary keys, table names, etc., that may reveal information about their contents.
This feels like it's mainly redundant with the previous warning, except for the comment about table names. As I said in DOC-1242, I don't understand what this is trying to say when it comes to table names.
v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):
{% include_cached copy-clipboard.html %} ~~~ sql show range from table users for row ('europe-west1', 'amsterdam', 'ae147ae1-47ae-4800-8000-000000000022');
Digging down to the location of a single row seems like an unnecessarily complex journey. It seems like the main flow of the document should be to identify the database with constraint violations and then apply PLACEMENT RESTRICTED.
Finding the location of a single row may be an interesting exercise (to me it's interesting mainly to see how poor the visibility is for all of this - verifying that the violating replicas are non-voting is especially opaque), but is this something users really want to learn how to do? It seems like it should be something like an appendix if it's included at all.
v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):
--- title: Data Domiciling with Separate Databases
Is this page worth keeping? When would someone use this instead of PLACEMENT RESTRICTED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
I do wish we had an easier way to verify our data domiciling solutions without having to do a SHOW RANGES FOR ROW
query
This page has instructions for data domiciling in [multi-region clusters](multiregion-overview.html) using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. At a high level, this process involves: | ||
|
||
1. Controlling the placement of specific row or table data using regional or global tables - specifically, the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables), [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables), and [`GLOBAL`](multiregion-overview.html#global-tables) table localities. | ||
1. Further restricting where the data in those regional tables is stored using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement, which constrains the replicas for a row or table to be stored in only the [home regions](set-locality.html#crdb_region) associated with those rows or tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the set locality portion doesn't exactly explain what a home region is either. Is there an appendix or something that specifies what exactly a home region is?
|
||
### Step 1. Start a simulated multi-region cluster | ||
|
||
{% include {{page.version.version}}/sql/start-a-multi-region-demo-cluster.md %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cockroach demo --global --nodes 9 --no-example-database --insecure
may be slow right now due to cockroachdb/cockroach#76305, should we also include the --multitenant=false
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 11 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):
Using CockroachDB as part of your approach to data domiciling has several limitations: - Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes.
Is this really true? At least for RBR tables, we partition indexes in the same way as we partition the base table (and apply the same zone configs). The end result is that the data should be domiciled properly (provided PLACEMENT RESTRICTED
is used).
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes. - Some metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism. This might result in potential "leakage" outside of the desired domicile if your schema includes primary keys, table names, etc., that may reveal information about their contents.
Nit: Instead of "that may reveal information about their contents", does it make more sense to say "that may reveal sensitive information" or "that may reveal personally identifiable information".
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):
- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes. - Some metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism. This might result in potential "leakage" outside of the desired domicile if your schema includes primary keys, table names, etc., that may reveal information about their contents. - [Zone configs](configure-replication-zones.html) can be used for data placement but these features were historically built for performance, not for domiciling. Further, the way CockroachDB performs replica placement is currently heuristic-based, so the zone config settings may not be obeyed at all times.
I'm not certain of this, but I worry that this may be an overstatement. I thought that if you specify hard constraints, that if a zone config can't be satisfied, the system won't place a replica that violates the constraints.
_includes/v21.2/sql/data-domiciling-limitations.md, line 7 at r1 (raw file):
- [Zone configs](configure-replication-zones.html) can be used for data placement but these features were historically built for performance, not for domiciling. Further, the way CockroachDB performs replica placement is currently heuristic-based, so the zone config settings may not be obeyed at all times. - Make sure your [log files](logging-overview.html) are stored according to the legal requirements of the jurisdictions you're operating in. For example, it may be easiest to store your log files for a multi-region cluster in the locality with the most restrictive subset of regulations. - If you start a node with a [`--locality`](cockroach-start.html#locality) flag that says the node is in region _A_, but the node is actually running in some region _B_, this approach will not work. A CockroachDB node only knows its locality based on the text supplied to the `--locality` flag; it can not ensure that it is actually running in that physical location.
The this in "this approach" is a bit ambiguous. I think you mean "data domiciling based on the inferred node placement", or something like that.
_includes/v21.2/sql/multiregion-movr-global.md, line 8 at r1 (raw file):
~~~ Next, alter the `user_promo_codes` table to have a foreign key into the `promo_codes` table. This step is necessary to modify the MovR schema design to take full advantage of the multi-region features in v21.1+.
I find the second sentence here a bit confusing. How does this allow users to take full advantage of the new abstractions?
Also, Nit: I'd drop the 21.1+ from this sentence, as I'm assuming this will only show up in future versions of the docs.
_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 1 at r1 (raw file):
All of the tables except `promo_codes` are geographically specific, and updated very frequently. For these tables, the right table locality for optimizing access to their data is [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables).
Nit: are the tables geographically specific? Or do the tables contain rows which are partitioned by geo (or by region)?
_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 21 at r1 (raw file):
END ) STORED; ALTER TABLE rides ALTER COLUMN REGION SET NOT NULL;
Nit: for bonus points you could make this column hidden. Same comment for the rest of the tables.
v21.2/data-domiciling.md, line 18 at r1 (raw file):
- [Data Domiciling with `PLACEMENT RESTRICTED`](data-domiciling-with-placement-restricted.html) - [Data Domiciling with Separate Databases](data-domiciling-with-separate-databases.html) - [Install CockroachDB](install-cockroachdb.html)
Nit: This one seems out of place to me. Is there a reason you added install in here?
v21.2/data-domiciling-with-placement-restricted.md, line 13 at r1 (raw file):
This page has instructions for data domiciling in [multi-region clusters](multiregion-overview.html) using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. At a high level, this process involves: 1. Controlling the placement of specific row or table data using regional or global tables - specifically, the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables), [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables), and [`GLOBAL`](multiregion-overview.html#global-tables) table localities.
Isn't it the two REGIONAL table types which control data placement? I'd think that with GLOBAL
tables, data placement is not controlled.
v21.2/data-domiciling-with-placement-restricted.md, line 14 at r1 (raw file):
Previously, RichardJCai (Richard Cai) wrote…
nit: I think the set locality portion doesn't exactly explain what a home region is either. Is there an appendix or something that specifies what exactly a home region is?
Nit: should this be "partition or table" instead of "row or table"?
v21.2/data-domiciling-with-placement-restricted.md, line 27 at r1 (raw file):
## Example In the following example, you will go through the process of configuring the [MovR](movr.html) data set using [multi-region SQL statements](multiregion-overview.html). Then, as part of implementing a data domiciling strategy, you will apply stricter-than-default replica placement settings using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. Finally, you will verify that the resulting replica placements are as expected using a combination of [replication reports](query-replication-reports.html), the [`SHOW RANGE FOR ROW`](show-range-for-row.html) statement, and queries against the [`crdb_internal.ranges_no_leases`](crdb-internal.html#ranges_no_leases) table.
Nit: stricter-than-default seems a bit awkward to me. Any chance you considered "restricted replica settings"?
v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):
In the following example, you will go through the process of configuring the [MovR](movr.html) data set using [multi-region SQL statements](multiregion-overview.html). Then, as part of implementing a data domiciling strategy, you will apply stricter-than-default replica placement settings using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. Finally, you will verify that the resulting replica placements are as expected using a combination of [replication reports](query-replication-reports.html), the [`SHOW RANGE FOR ROW`](show-range-for-row.html) statement, and queries against the [`crdb_internal.ranges_no_leases`](crdb-internal.html#ranges_no_leases) table. For the purposes of this example, the data domiciling requirement is to configure a multi-region deployment of the [MovR database](movr.html) such that data for EMEA-based users, vehicles, etc. is being stored on CockroachDB nodes running in EMEA localities.
Nit: EMEA, or EU?
v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):
Previously, RichardJCai (Richard Cai) wrote…
nit:
cockroach demo --global --nodes 9 --no-example-database --insecure
may be slow right now due to cockroachdb/cockroach#76305, should we also include the--multitenant=false
flag?
I doubt we'll go out with multi-tenant demo as the default. We'll probably revert it pretty soon.
v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):
Next, run a [replication report](query-replication-reports.html) to see which ranges are still not in compliance with your desired domiciling: that data on EMEA-based entities (users, etc.) does not leave EMEA-based nodes. With the default settings, you should expect some replicas in the cluster to be violating this constraint. This is because [non-voting replicas](architecture/replication-layer.html#non-voting-replicas) are enabled by default in [multi-region clusters](multiregion-overview.html) to enable stale reads of data in [regional tables](regional-tables.html) from outside those tables' [home regions](set-locality.html#crdb_region). For many use cases, this ispreferred, but it keeps you from meeting the domiciling requirements for this example.
Why do non-voters prevent us from satisfying the constraints?
Nit: is preferred
v21.2/data-domiciling-with-placement-restricted.md, line 91 at r1 (raw file):
~~~ Based on this output, you can see that plenty of replicas are out of compliance (see the `violating_ranges` column) for the reason described above: the presence of non-voting replicas in other regions to enable fast stale reads from those regions.
Is this just a temporary statement because the rebalancer hasn't completed its work yet?
v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):
~~~ This output shows that the `movr` database has ranges out of compliance, which you saw previously. Unfortunately, this output does not contain the table or index names due to a limitation of the replication reports: non-voting replicas are not associated with any tables or indexes by the reports. For more information, see [cockroachdb/cockroach#75821](https://github.com/cockroachdb/cockroach/issues/75821).
Nit: I don't necessarily have a solution for this problem, but referencing issues in our documentation seems a bit messy to me. Is there some way around us airing our dirty laundry here? Maybe just remove this paragraph?
v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Digging down to the location of a single row seems like an unnecessarily complex journey. It seems like the main flow of the document should be to identify the database with constraint violations and then apply PLACEMENT RESTRICTED.
Finding the location of a single row may be an interesting exercise (to me it's interesting mainly to see how poor the visibility is for all of this - verifying that the violating replicas are non-voting is especially opaque), but is this something users really want to learn how to do? It seems like it should be something like an appendix if it's included at all.
+1
v21.2/data-domiciling-with-separate-databases.md, line 54 at r1 (raw file):
You will need to make sure that user data associated with EU users is only added to the `eu_users` database. How exactly you will accomplish that is beyond the scope of this document, but you will likely need to add some logic to your application and/or to your load balancing infrastructure to make sure that when your application code is [inserting](insert.html) or [updating](update.html) EU user data, the data only ever hits the `eu_users` database. For example, you can set the target database in your [connection string](connection-parameters.html). For example:
Nit: You say that accomplishing this is beyond the scope of this document, and then go onto describe some of the steps that are required 😄
Maybe here you should say here "The complete steps to ensure this are beyond the scope of this document..."
v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):
{% include {{page.version.version}}/sql/data-domiciling-limitations.md %} - Cross-region writes are slower than intra-region writes. This may be an issue depending on your application's performance needs, since following the advice above would result in having different databases' data stored in different regions.
Hmm, is this really an artifact of the separate databases approach? Or is it relevant with any MR setup?
I'd also think that we may want to talk about the cross database restrictions (like having to enable cross-db FKs and views manually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me so long to get to this. Looks good overall. Some minor comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reviews Ben, Richard, and Adam! I have revised stuff and also replied to some comments in a few areas. PTAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):
Updated to reference columns being indexed, and correct the claims about what is distributed across the cluster, and to remove the recommendation.
Data domiciling does not work on columns that are indexed, since metadata about the index is distributed across the nodes in the cluster, including a sample of values from the index.
The recommendation in particular was based on notes from a convo I had with Daniel Holt about domiciling stuff, but I may have misheard / misunderstood him, and also that would probably not make it through the legal review that is coming after this one :-)
_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Is this really true? At least for RBR tables, we partition indexes in the same way as we partition the base table (and apply the same zone configs). The end result is that the data should be domiciled properly (provided
PLACEMENT RESTRICTED
is used).
(I assume your question here is for Ben)
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Agree it's a bit redundant but I would argue that some repetition could be helpful since an average reader who's not a CRDB engineer might not deduce all of these consequences from the statement above. Also it lets us link to some specific features/docs that folks who care about domiciling should be familiar with
I removed the sentence about table names because I don't fully understand what it was trying to say either - and it may be superseded by the note about specifying where log files are stored, since that was something else Daniel Holt mentioned in a conversation that I think (?) takes care of any concerns in this approximate area
Resulting edit is
- Additional metadata about the objects in a schema may be stored in system ranges, system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the multi-region SQL statements, or the low-level zone configs mechanism.
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: Instead of "that may reveal information about their contents", does it make more sense to say "that may reveal sensitive information" or "that may reveal personally identifiable information".
Maybe, but if we stick with the removal of this sentence altogether based on Ben's feedback above, do you agree with what I wrote above that this is basically covered by storing one's log files carefully?
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
I'm not certain of this, but I worry that this may be an overstatement. I thought that if you specify hard constraints, that if a zone config can't be satisfied, the system won't place a replica that violates the constraints.
I definitely don't know the answer to this, but it's basically what @knz wrote in DOC-1242 so I added it. From a docs POV it is quite useful as context so the reader knows these features were def not built ground-up with these use cases in mind, so there is more work to do here / this should only be part of the solution
_includes/v21.2/sql/data-domiciling-limitations.md, line 7 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
The this in "this approach" is a bit ambiguous. I think you mean "data domiciling based on the inferred node placement", or something like that.
Thank you! that is a superior formulation, went with it!
_includes/v21.2/sql/multiregion-movr-global.md, line 8 at r1 (raw file):
Agreed - this is copypasta from the original docs broken out into this new "include" file, so I didn't edit it
Updated to the following here and in v21.1 docs - what do you think?
Next, alter the
user_promo_codes
table to have a foreign key into the globalpromo_codes
table. This will enable fast reads of thepromo_codes.code
column from any region in the cluster.
_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 1 at r1 (raw file):
Aha! this is also copypasta from the original doc and we should update it
Updated to the below here and in v21.1 docs
All of the tables except
promo_codes
contain rows which are partitioned by region ...
_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 21 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: for bonus points you could make this column hidden. Same comment for the rest of the tables.
I will defer since I don't wanna re-test all this right now - filed https://cockroachlabs.atlassian.net/browse/DOC-2660
(unfortunately we don't have automatic testing of the SQL in our docs - see #7067)
v21.2/data-domiciling.md, line 18 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: This one seems out of place to me. Is there a reason you added install in here?
IDK, probably some neurons misfiring - removed it!
v21.2/data-domiciling-with-placement-restricted.md, line 13 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Isn't it the two REGIONAL table types which control data placement? I'd think that with
GLOBAL
tables, data placement is not controlled.
Indeed - removed the mention of global tables from this bullet
v21.2/data-domiciling-with-placement-restricted.md, line 14 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: should this be "partition or table" instead of "row or table"?
Thanks @RichardJCai , filed https://cockroachlabs.atlassian.net/browse/DOC-2661 to address this shortcoming. It's sort of scattered around on different pages and could be way more explicit
Adam - thanks, updated to "partition or table"
v21.2/data-domiciling-with-placement-restricted.md, line 27 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: stricter-than-default seems a bit awkward to me. Any chance you considered "restricted replica settings"?
Ya that is awkward - updated to use your suggested term, thanks
v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: EMEA, or EU?
I went with EMEA because when I spoke to Daniel Holt he said EMEA a lot? So I kinda had that in my notes. But yeah I think EU specifically is what is meant here - went ahead and updated all uses to EU
v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
I doubt we'll go out with multi-tenant demo as the default. We'll probably revert it pretty soon.
Thanks for the clarification. UX-wise I would not expect as a normie to have to care about what CRDB is doing internally w.r.t multi-tenancy when I'm just doing muh cockroach demo
etc so I'm glad that default will change.
But I will admit to some sadness that I will ever have to care about that flag at all (not as an employee who wants/needs to support our m-t. strategy but as in this instance an avg user of the binary who is not doing a fancy multi-tenant deployment). But I will save my moaning for https://cockroachlabs.atlassian.net/browse/DOC-1181 :-)
v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):
Fixed the nit thanks!
Why do non-voters prevent us from satisfying the constraints?
Since above we say
For the purposes of this example, the data domiciling requirement is to configure a multi-region deployment of the MovR database such that data for EU-based users, vehicles, etc. is being stored on CockroachDB nodes running in EU localities.
my understanding is that the non-voters are stored on non-EU nodes to enable stale reads even though from my user POV I have said "store these rows in the EU" via REGIONAL BY ROW, which is causing me to not meet that requirement
... all of which will make me sad and/or cost me $ potentially?
Sorry if I'm misunderstanding though!
Or are you saying we should rephrase this to use a different term than "satisfying the constraints"? we could also re-use the term from the goal statement and say "prevents us from meeting the requirement"
Or are you saying we should explicate more by adding "... since the non-voters are stored on non-EU nodes to enable fast stale reads" or something similar?
v21.2/data-domiciling-with-placement-restricted.md, line 91 at r1 (raw file):
Apparently not: I tested this by running everything up to this point and then going to lunch. When I came back I ran this. Output is ~the same. I really think it's the non-voters (which seems to be supported by the replica_localities
in the queries further below)
> SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
zone_id | subzone_id | type | config | report_id | violation_start | violating_ranges
----------+------------+------------+------------------------+-----------+-------------------------------+-------------------
52 | 0 | constraint | +region=europe-west1:1 | 1 | 2022-02-16 17:42:20.343827+00 | 16
52 | 0 | constraint | +region=us-east1:1 | 1 | 2022-02-16 17:40:20.338341+00 | 72
52 | 0 | constraint | +region=us-west1:1 | 1 | 2022-02-16 17:40:20.338341+00 | 72
(3 rows)
v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: I don't necessarily have a solution for this problem, but referencing issues in our documentation seems a bit messy to me. Is there some way around us airing our dirty laundry here? Maybe just remove this paragraph?
I don't really like it either but I think our docs need to at least address some potential user surprise by explaining even at a high level why some of these features are not super tightly woven together yet (despite our best efforts intentions etc). I initially added the issue link because we sometimes add issue links to things that are "known limitations" a la http://www.cockroachlabs.com/docs/dev/known-limitations.html
During testing/writing, I was def struck by the fact that this query returned NULLs here, so I think we need to at least say something about it and thus hesitant to remove the paragraph. I removed the issue link though! What do you think?
v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
+1
I agree it's complex but I really believe we need to show how to verify that the DB is doing what we've just asked it to do, otherwise I think we're only taking folks on half the journey. Especially since the replication reports are still experimental and haven't been updated with all the non-voting replica stuff (for example see cockroachdb/cockroach#69335 and cockroachdb/cockroach#75821)
So if I'm on the line at my job for implementing domiciling and ensuring compliance and thus possibly $$$ / legal ramifications I don't really fully trust just those reports
and the first thing I wanted to do after typing PLACEMENT RESTRICTED and checking the reports was to find out "how do I additionally verify this thing is actually putting the stuff where it's supposed to, given that these reports are experimental and maybe not getting updated etc?"
esp since some laws talk about fines per violation (IANAL etc) - to me that means I really do need to know about how to check any given row at any time, whether for some kind of finer-grained reporting, or just in case someone asks "how do you really know where this data is?"
that's why I went to SHOW RANGE FOR ROW - but that's also experimental
then when I was asking in #kv about how to verify the replica placement beyond the experimental reports and SHOW RANGE] @aayushshah15 recommended doing the crdb_internal
query in addition to the above
This feels pretty belt and suspenders and overall not great but in my head the thinking was "as a user, I'm on the line for this, these reports and the SHOW RANGE are experimental and not necessarily updated/super supported so I need to really make sure about this replica placement using whatever I can find" - and these were the tools that I could find
v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Is this page worth keeping? When would someone use this instead of PLACEMENT RESTRICTED?
That's a good question. Maybe not? I don't know the answer to why someone would use this instead. Was hesitant to remove since it is an option that exists and is already documented and was vetted by Legal team, so I figured it might be worth keeping around without changing it in case it's needed
I think removing it is also fine tho - @vy-ton do you have an opinion about whether we should keep/remove this? I erred on side of keeping since we already had it as an option and it was reviewed etc. It may be the "bad" option but it's an option
v21.2/data-domiciling-with-separate-databases.md, line 54 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: You say that accomplishing this is beyond the scope of this document, and then go onto describe some of the steps that are required 😄
Maybe here you should say here "The complete steps to ensure this are beyond the scope of this document..."
LOL - updated to use your suggested phrase
v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):
Hmm, is this really an artifact of the separate databases approach? Or is it relevant with any MR setup?
If the latter I can move it back to the general limitations. Though I think maybe it makes sense to remove this since as you say it's just kind of generally true about MR things. I went with removal since it doesn't seem super domiciling-specific - any objection?
I'd also think that we may want to talk about the cross database restrictions (like having to enable cross-db FKs and views manually)
Ideally but I think I'd rather not add much to this text since it's already had legal vetting and/or we may remove it now (see above convo)
FWIW we do have that info in our FK docs for if/when folks hit the cross-db issue (again not ideal but I'm trying not to touch this one hardly at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments. I think I'm mostly in a good place right now. Just a couple of remaining comments open.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
(I assume your question here is for Ben)
Sorry, my question was written before Ben's comments, but only committed after his. I don't think we need to keep sensitive data outside of indexes, as they should be domiciled correctly, at least when using RBR tables.
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Maybe, but if we stick with the removal of this sentence altogether based on Ben's feedback above, do you agree with what I wrote above that this is basically covered by storing one's log files carefully?
Removing the sentence works for me.
Where do you talk about storing the logs carefully? I can't seem to locate that part.
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
I definitely don't know the answer to this, but it's basically what @knz wrote in DOC-1242 so I added it. From a docs POV it is quite useful as context so the reader knows these features were def not built ground-up with these use cases in mind, so there is more work to do here / this should only be part of the solution
I think saying that "the zone config settings may not be obeyed at all times" is still an overstatement. It may be better to say that "zone config settings may leave some decision making capacity to the system, and it may not place replicas according to your domiciling needs" (or something like that). Then, we can link to this for more detail: https://www.cockroachlabs.com/docs/v21.2/configure-replication-zones#types-of-constraints.
_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 21 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
I will defer since I don't wanna re-test all this right now - filed https://cockroachlabs.atlassian.net/browse/DOC-2660
(unfortunately we don't have automatic testing of the SQL in our docs - see #7067)
OK, that's fair.
v21.2/data-domiciling.md, line 18 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
IDK, probably some neurons misfiring - removed it!
:-)
v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
I went with EMEA because when I spoke to Daniel Holt he said EMEA a lot? So I kinda had that in my notes. But yeah I think EU specifically is what is meant here - went ahead and updated all uses to EU
SGTM. I know that the terms are used interchangeably a lot, but if we're dealing with only EU regions, better to be specific. Thanks!
v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Thanks for the clarification. UX-wise I would not expect as a normie to have to care about what CRDB is doing internally w.r.t multi-tenancy when I'm just doing muh
cockroach demo
etc so I'm glad that default will change.But I will admit to some sadness that I will ever have to care about that flag at all (not as an employee who wants/needs to support our m-t. strategy but as in this instance an avg user of the binary who is not doing a fancy multi-tenant deployment). But I will save my moaning for https://cockroachlabs.atlassian.net/browse/DOC-1181 :-)
The average users shouldn't have to worry about it at all. We'll revert the default to false shortly, and then it'll be of no interest to most customers. Eventually, we'll change the default to true, and the hope is that customers will never have to disable it.
v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):
Sorry, wasn't clear. I was saying this:
Or are you saying we should rephrase this to use a different term than "satisfying the constraints"? we could also re-use the term from the goal statement and say "prevents us from meeting the requirement"
I see it's been addressed above. Thanks.
v21.2/data-domiciling-with-placement-restricted.md, line 91 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Apparently not: I tested this by running everything up to this point and then going to lunch. When I came back I ran this. Output is ~the same. I really think it's the non-voters (which seems to be supported by the
replica_localities
in the queries further below)> SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0; zone_id | subzone_id | type | config | report_id | violation_start | violating_ranges ----------+------------+------------+------------------------+-----------+-------------------------------+------------------- 52 | 0 | constraint | +region=europe-west1:1 | 1 | 2022-02-16 17:42:20.343827+00 | 16 52 | 0 | constraint | +region=us-east1:1 | 1 | 2022-02-16 17:40:20.338341+00 | 72 52 | 0 | constraint | +region=us-west1:1 | 1 | 2022-02-16 17:40:20.338341+00 | 72 (3 rows)
It strikes me as odd that the non-voters would cause violations. We might need someone from KV to confirm that this is expected. Maybe @aayushshah15?
v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
I don't really like it either but I think our docs need to at least address some potential user surprise by explaining even at a high level why some of these features are not super tightly woven together yet (despite our best efforts intentions etc). I initially added the issue link because we sometimes add issue links to things that are "known limitations" a la http://www.cockroachlabs.com/docs/dev/known-limitations.html
During testing/writing, I was def struck by the fact that this query returned NULLs here, so I think we need to at least say something about it and thus hesitant to remove the paragraph. I removed the issue link though! What do you think?
Sure, this looks better. Might want to change to "due to a current limitation of the replication reports".
v21.2/data-domiciling-with-separate-databases.md, line 54 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
LOL - updated to use your suggested phrase
LGTM.
v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Hmm, is this really an artifact of the separate databases approach? Or is it relevant with any MR setup?
If the latter I can move it back to the general limitations. Though I think maybe it makes sense to remove this since as you say it's just kind of generally true about MR things. I went with removal since it doesn't seem super domiciling-specific - any objection?
I'd also think that we may want to talk about the cross database restrictions (like having to enable cross-db FKs and views manually)
Ideally but I think I'd rather not add much to this text since it's already had legal vetting and/or we may remove it now (see above convo)
FWIW we do have that info in our FK docs for if/when folks hit the cross-db issue (again not ideal but I'm trying not to touch this one hardly at all)
For simplicity, I'm fine removing and leaning on the existing documentation for the limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again Adam! I updated everything I could based on your latest feedback
Think we're still waiting on the one question about the non-voters' status w.r.t. replication reports (FWIW it makes sense to me as a user that they would be considered violating within that context, but I defer to y'all on the technical matters/design)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Sorry, my question was written before Ben's comments, but only committed after his. I don't think we need to keep sensitive data outside of indexes, as they should be domiciled correctly, at least when using RBR tables.
OK thanks for clarifying - I've removed that bullet
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Removing the sentence works for me.
Where do you talk about storing the logs carefully? I can't seem to locate that part.
Cool!
Re: storing logs, a couple bullets down we say "make sure your log files are stored according to the legal requirements of ..."
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
I think saying that "the zone config settings may not be obeyed at all times" is still an overstatement. It may be better to say that "zone config settings may leave some decision making capacity to the system, and it may not place replicas according to your domiciling needs" (or something like that). Then, we can link to this for more detail: https://www.cockroachlabs.com/docs/v21.2/configure-replication-zones#types-of-constraints.
Thanks for clarifying. I have updated to use that phrasing, and added the link to the zone config docs
v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
SGTM. I know that the terms are used interchangeably a lot, but if we're dealing with only EU regions, better to be specific. Thanks!
Thanks for pointing it out - I don't have a lot of expertise in such terms, so agree re: specificity
v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
The average users shouldn't have to worry about it at all. We'll revert the default to false shortly, and then it'll be of no interest to most customers. Eventually, we'll change the default to true, and the hope is that customers will never have to disable it.
👍
v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Sorry, wasn't clear. I was saying this:
Or are you saying we should rephrase this to use a different term than "satisfying the constraints"? we could also re-use the term from the goal statement and say "prevents us from meeting the requirement"
I see it's been addressed above. Thanks.
Oh ok - thanks for clarifying!
v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Sure, this looks better. Might want to change to "due to a current limitation of the replication reports".
Updated to say "current limitation"
v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
For simplicity, I'm fine removing and leaning on the existing documentation for the limitations.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there :-)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Cool!
Re: storing logs, a couple bullets down we say "make sure your log files are stored according to the legal requirements of ..."
I think it might still be a good idea to have a parenthetical in the first bullet where we give some examples of what attributes of a schema are stored in metadata (table and index names, PK info, etc). Aside from that, I'm good here.
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Thanks for clarifying. I have updated to use that phrasing, and added the link to the zone config docs
LGTM. One nit: I'd change "...so the zone config setting may leave" to "...and the zone config setting may leave". The fact that ZCs are heuristic based isn't the reason that the domiciling requirements might not be met, it's because ZCs can either be fully restricted or partially restricted. The latter leaves the door open for the system to make some decisions on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, made the couple of requested tweaks :-)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
I think it might still be a good idea to have a parenthetical in the first bullet where we give some examples of what attributes of a schema are stored in metadata (table and index names, PK info, etc). Aside from that, I'm good here.
Added back "For example, this metadata may include table and index names and primary key information"
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
LGTM. One nit: I'd change "...so the zone config setting may leave" to "...and the zone config setting may leave". The fact that ZCs are heuristic based isn't the reason that the domiciling requirements might not be met, it's because ZCs can either be fully restricted or partially restricted. The latter leaves the door open for the system to make some decisions on its own.
Updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still one partially unresolved issue about how to present the details in step 3. You might need to get a follow-up from @bdarnell on his thoughts there.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
v21.2/data-domiciling-with-placement-restricted.md, line 97 at r4 (raw file):
{{site.data.alerts.end}} Next, run the query suggested in [the Replication Reports documentation](query-replication-reports.html#find-out-which-of-your-tables-have-a-constraint-violation) that should show which database and table names contain the `violating_ranges`.
In regards to Ben's comment below, I'm wondering if for the sake of this example, we want to this part (and everything that follows) in step 3 out to a separate document? That way we'd have the details there if, like you say, customers are required to validate replica placement, but it's not cluttering up the example for those that aren't as paranoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Agree it's a bit redundant but I would argue that some repetition could be helpful since an average reader who's not a CRDB engineer might not deduce all of these consequences from the statement above. Also it lets us link to some specific features/docs that folks who care about domiciling should be familiar with
I removed the sentence about table names because I don't fully understand what it was trying to say either - and it may be superseded by the note about specifying where log files are stored, since that was something else Daniel Holt mentioned in a conversation that I think (?) takes care of any concerns in this approximate area
Resulting edit is
- Additional metadata about the objects in a schema may be stored in system ranges, system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the multi-region SQL statements, or the low-level zone configs mechanism.
In r4 I still see "this metadata may include table and index names".
I suspect that all the talk about table names is about some confusion that we have internally about the difference between personally-identifiable information and the broader category of "user information". For example, if Movr is a CockroachCloud customer and they're planning to launch a new food delivery app, they'll start to have tables named things like "restaurants". That's sensitive confidential information that could be revealed by a simple listing of table names, so for our CockroachCloud operations we must treat such metadata as private, hide it from our SREs to the extent possible, etc. But the table names alone are not PII, and they wouldn't generally trigger data domiciling laws like GDPR (nor does most metadata)
What's important for domiciling purposes is the data itself, not the metadata, and this is insufficiently clear in your "resulting edit". We need to be clear that a subset (or a sample) of data from indexed columns may appear in meta ranges or other system tables.
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r4 (raw file):
- Additional metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. For example, this metadata may include table and index names, and primary key information. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism. - [Zone configs](configure-replication-zones.html) can be used for data placement but these features were historically built for performance, not for domiciling. Further, the way CockroachDB performs replica placement is currently heuristic-based, and the zone config settings may leave some decision making capacity to the system, and it may not place replicas according to your domiciling needs. For more information, see [Configure Replication Zones](configure-replication-zones.html#types-of-constraints). - Make sure your [log files](logging-overview.html) are stored according to the legal requirements of the jurisdictions you're operating in. For example, it may be easiest to store your log files for a multi-region cluster in the locality with the most restrictive subset of regulations.
If there is a "most restrictive subset", you'd probably just put all your data there, not just logs. The premise of this domiciling functionality is that the various domiciling laws are mutually exclusive - european users' data must be in europe, users from region X must be in that region, etc.
Instead of centralizing logs in one region, you're more likely to want to keep logs in the region where they were generated. The is some cross-region leakage just like with system tables, but the vast majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough for you, you can use the log redaction functionality to strip all raw data from the logs, or limit your log retention entirely.
v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):
So if I'm on the line at my job for implementing domiciling and ensuring compliance and thus possibly $$$ / legal ramifications I don't really fully trust just those reports
But is this really better? You pick one arbitrary row and show where it is. But how does this give me confidence that all my data is in the right place? If I don't trust a report that's designed to report constraint violations, how am I going to duct tape a bunch of queries together to get something I trust more?
My suggestion is to move all the stuff about finding the location of a single row out of this page and to the architecture section about the distribution layer. That seems like a better place for this kind of under-the-hood optional detail.
v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
That's a good question. Maybe not? I don't know the answer to why someone would use this instead. Was hesitant to remove since it is an option that exists and is already documented and was vetted by Legal team, so I figured it might be worth keeping around without changing it in case it's needed
I think removing it is also fine tho - @vy-ton do you have an opinion about whether we should keep/remove this? I erred on side of keeping since we already had it as an option and it was reviewed etc. It may be the "bad" option but it's an option
I would at least remove it from the main sidebar (or push it down to an "archive" section at the bottom so it's not as prominent as the new page). I think we could remove it completely as long as the page is still there in the 21.1 version of the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Adam and Ben. Made more updates based on your comments below. In particular removed a bunch of the "show range for a single row" stuff, and moved all back to a single page so we no longer show the weaker "separate databases" approach in v21.2+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)
_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
In r4 I still see "this metadata may include table and index names".
Sorry - should be fixed now to say what it has in the quote
I suspect that all the talk about table names is about some confusion ...
Yes, definitely confusion on my end - thanks for explaining this
What's important for domiciling purposes is the data itself, not the metadata, and this is insufficiently clear in your "resulting edit". We need to be clear that a subset (or a sample) of data from indexed columns may appear in meta ranges or other system tables.
Updated to the below. Does this capture what you are saying properly?
"When columns are indexed, a subset (or sample) of data from the indexed columns may appear in meta ranges or other system tables. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the multi-region SQL statements, or the low-level zone configs mechanism."
_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
If there is a "most restrictive subset", you'd probably just put all your data there, not just logs. The premise of this domiciling functionality is that the various domiciling laws are mutually exclusive - european users' data must be in europe, users from region X must be in that region, etc.
Instead of centralizing logs in one region, you're more likely to want to keep logs in the region where they were generated. The is some cross-region leakage just like with system tables, but the vast majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough for you, you can use the log redaction functionality to strip all raw data from the logs, or limit your log retention entirely.
Thanks, updated the logging bullet to be basically your comment here. What do you think?
"If your log files are kept in the region where they were generated, there is some cross-region leakage (as with the system tables described previously), but the majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough, you can use the log redaction functionality to strip all raw data from the logs. You can also limit your log retention entirely."
v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):
If I don't trust a report that's designed to report constraint violations, how am I going to duct tape a bunch of queries together to get something I trust more?
Good point - guess I was down the rabbit hole a bit there. I have removed the stuff about finding a single row and just left in the replication reports. Please let me know what you think of this. It's definitely shorter now and hopefully clearer too
My suggestion is to move all the stuff about finding the location of a single row out of this page and to the architecture section about the distribution layer. That seems like a better place for this kind of under-the-hood optional detail.
Makes sense - removed all the stuff about finding the location of a single row and filed https://cockroachlabs.atlassian.net/browse/DOC-2797 to capture maybe adding more docs on this. We already have docs on SHOW RANGE FOR ROW so it's not clear if we will want to bother adding the info about querying crdb_internal.ranges_no_leases
or not. According to the crdb_internal docs that table is not supported for prod use
v21.2/data-domiciling-with-placement-restricted.md, line 97 at r4 (raw file):
Previously, ajstorm (Adam Storm) wrote…
In regards to Ben's comment below, I'm wondering if for the sake of this example, we want to this part (and everything that follows) in step 3 out to a separate document? That way we'd have the details there if, like you say, customers are required to validate replica placement, but it's not cluttering up the example for those that aren't as paranoid.
Based on Ben's comment, I removed all the stuff about finding the replica for a single row. Do you think that's an acceptable middle ground? I think it def makes things much shorter and clearer while still giving a user the tools to check stuff with replication reports
v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):
I think we could remove it completely as long as the page is still there in the 21.1 version of the docs.
That works for me. I deleted it from the v21.2 docs (it's still there in 21.1) and moved all this info back to the original single 'Data Domiciling with CockroachDB' page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
v21.2/data-domiciling.md, line 255 at r5 (raw file):
Using CockroachDB as part of your approach to data domiciling has several limitations: - When columns are [indexed](indexes.html), a subset (or sample) of data from the indexed columns may appear in [meta ranges](architecture/distribution-layer.html#meta-ranges) or other system tables. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism.
Pick one of "subset" or "sample"; i didn't mean for you to take that part verbatim :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
v21.2/data-domiciling-with-placement-restricted.md, line 97 at r4 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Based on Ben's comment, I removed all the stuff about finding the replica for a single row. Do you think that's an acceptable middle ground? I think it def makes things much shorter and clearer while still giving a user the tools to check stuff with replication reports
Yeah, that SGTM too, and is similar to what I was suggesting we remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r1, 2 of 7 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
v21.2/data-domiciling.md, line 3 at r5 (raw file):
--- title: Data Domiciling with CockroachDB summary: Learn how to implement data domiciling using CockroachDB's multi-region SQL capabilities and the ALTER DATABASE ... PLACEMENT RESTRICTED statement.
Do we want to be even more explicit and say "a part of your dd approach"
v21.2/data-domiciling.md, line 29 at r5 (raw file):
This page assumes you are already familiar with: - CockroachDB's [multi-region SQL abstractions](multiregion-overview.html).
Do we need to say something like, if you arent using the MR abstractions this approach won't help?
v21.2/data-domiciling.md, line 81 at r5 (raw file):
### Step 3. View noncompliant replicas Next, run a [replication report](query-replication-reports.html) to see which ranges are still not in compliance with your desired domiciling: that data on EU-based entities (users, etc.) does not leave EU-based nodes.
Do we need to explain that it takes time to move leaseholders?
v21.2/data-domiciling.md, line 152 at r5 (raw file):
### Step 4. Apply stricter replica placement settings To ensure that data on EU-based users, vehicles, etc. from [`REGIONAL BY ROW` tables](regional-tables.html#regional-by-row-tables) is stored only on EU-based nodes in the cluster, you must disable the use of [non-voting replicas](architecture/replication-layer.html#non-voting-replicas) on all of these tables. You can do this using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement.
Is it worth mentioning that this cluster setting will do this for all regions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ben! Made that last small update.
Thanks for looking Andy! Made some updates based on your comments - PTAL
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @awoods187, @bdarnell, @knz, @RichardJCai, and @vy-ton)
v21.2/data-domiciling.md, line 3 at r5 (raw file):
Previously, awoods187 (Andy Woods) wrote…
Do we want to be even more explicit and say "a part of your dd approach"
Makes sense! How about "Learn how to use CockroachDB's multi-region SQL capabilities and the ALTER DATABASE ... PLACEMENT RESTRICTED statement as part of your data domiciling approach"
v21.2/data-domiciling.md, line 29 at r5 (raw file):
Previously, awoods187 (Andy Woods) wrote…
Do we need to say something like, if you arent using the MR abstractions this approach won't help?
I added a followup sentence "If you are not using them, the instructions on this page will not apply." What do you think?
v21.2/data-domiciling.md, line 81 at r5 (raw file):
Previously, awoods187 (Andy Woods) wrote…
Do we need to explain that it takes time to move leaseholders?
I added a little paragraph with some words about this and a link to the 'Performance considerations' section of Migrate to Multi-region SQL
let me know what you think!
v21.2/data-domiciling.md, line 152 at r5 (raw file):
Previously, awoods187 (Andy Woods) wrote…
Is it worth mentioning that this cluster setting will do this for all regions?
Do you mean in the sense that it will turn off non-voting replicas for all regional tables in the entire database?
If so I tweaked the end of the sentence slightly to say "you must disable ... non-voting replicas on all of the regional tables in this database"
Does that capture what you mean?
v21.2/data-domiciling.md, line 255 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Pick one of "subset" or "sample"; i didn't mean for you to take that part verbatim :)
Ha ok :-) went with "subset"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
v21.2/data-domiciling.md, line 3 at r5 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Makes sense! How about "Learn how to use CockroachDB's multi-region SQL capabilities and the ALTER DATABASE ... PLACEMENT RESTRICTED statement as part of your data domiciling approach"
LGTM
v21.2/data-domiciling.md, line 152 at r5 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Do you mean in the sense that it will turn off non-voting replicas for all regional tables in the entire database?
If so I tweaked the end of the sentence slightly to say "you must disable ... non-voting replicas on all of the regional tables in this database"
Does that capture what you mean?
yes, exactly
5929010
to
87dd1a1
Compare
@ericharmeling this has been through technical and legal review and is RFAL from a (your) docs team POV - let me know if questions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I have a couple of non-blocking suggestions. Otherwise, LGTM.
Also - don't forget to copy over to 22.1 too.
Reviewed 2 of 7 files at r2, 1 of 7 files at r5, 7 of 7 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)
v21.2/data-domiciling.md, line 20 at r7 (raw file):
This page has instructions for data domiciling in [multi-region clusters](multiregion-overview.html) using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. At a high level, this process involves: 1. Controlling the placement of specific row or table data using regional tables - specifically, the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables), [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables).
`
Suggestion:
Controlling the placement of specific row or table data using regional tables with the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables) and [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables) clauses.
v21.2/data-domiciling.md, line 150 at r7 (raw file):
~~~ This output shows that the `movr` database has ranges out of compliance, which you saw previously. Unfortunately, this output does not contain the table or index names due to a current limitation of the replication reports: non-voting replicas are not associated with any tables or indexes by the reports.
Should this replication report limitation be listed in the Limitation section of this page? If so, I suggest linking to that section, and adding the limitation in a new bullet point.
If you think the limitation should instead be listed on a different page (the replication layer architecture page, for example), then perhaps add the limitation to that page and have the ref link to there?
Suggestion:
[limitation](#limitations)
v21.2/data-domiciling.md, line 261 at r7 (raw file):
- If your [log files](logging-overview.html) are kept in the region where they were generated, there is some cross-region leakage (like the system tables described previously), but the majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough, you can use the [log redaction functionality](configure-logs.html#redact-logs) to strip all raw data from the logs. You can also limit your log retention entirely. - If you start a node with a [`--locality`](cockroach-start.html#locality) flag that says the node is in region _A_, but the node is actually running in some region _B_, data domiciling based on the inferred node placement will not work. A CockroachDB node only knows its locality based on the text supplied to the `--locality` flag; it can not ensure that it is actually running in that physical location.
See my comment above re: the replication reports limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review Eric!
I made that one edit you suggested, and I would like to punt the replication reports thing to be part of addressing DOC-2864 for reasons mentioned below
(And yes, will def copy over to v22.1- thanks for the reminder! have that on the ol' todo list - didn't want to add until after all was approved so folks don't have to scroll through)
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @awoods187, @bdarnell, @ericharmeling, @knz, @RichardJCai, and @vy-ton)
v21.2/data-domiciling.md, line 152 at r5 (raw file):
Previously, awoods187 (Andy Woods) wrote…
yes, exactly
Done.
v21.2/data-domiciling.md, line 20 at r7 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
`
✔️
v21.2/data-domiciling.md, line 150 at r7 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
Should this replication report limitation be listed in the Limitation section of this page? If so, I suggest linking to that section, and adding the limitation in a new bullet point.
If you think the limitation should instead be listed on a different page (the replication layer architecture page, for example), then perhaps add the limitation to that page and have the ref link to there?
Good point. We have an issue https://cockroachlabs.atlassian.net/browse/DOC-2864 to update the replication report docs to note this limitation
I'd rather not list the replication report limitation as a limitation on this page since the limitations section on this page is meant to refer to limitations of the approach to data domiciling described here, not limitations of that one sub-tool (the replication reports) we are using to check our work
... but I'm totes into cross-linking this paragraph with the doc update that will fix DOC-2864 by mentioning that limitation of the replication reports, as soon as that gets in
let me know if that makes sense or if you strongly disagree
v21.2/data-domiciling.md, line 261 at r7 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
See my comment above re: the replication reports limitation.
responded above!
87dd1a1
to
41630ff
Compare
Fixes DOC-1178 Fixes DOC-1242 Summary of changes: - Update the 'Data Domiciling' page with info that explains how to accomplish domiciling with the 'ALTER DATABASE ... PLACEMENT RESTRICTED' feature, and how to check that replicas are where they are supposed to be using the Replication Reports. - As a bonus, break out some info from the multi-region latency demo tutorial into includes, since we are now also using them from an example on the updated domiciling page. - Apply these changes to both v21.2 and v22.1.
41630ff
to
8a1fa31
Compare
Fixes DOC-1178
Fixes DOC-1242
Summary of changes:
Break out a new 'Data Domiciling Overview' page with some basic info, that
then links out to:
A new 'Data Domiciling with PLACEMENT RESTRICTED' page that explains
~how to accomplish domiciling with this feature, and how to check
that replicas are where they are supposed to be
A 'Data Domiciling with separate databases' page that is a lightly
copy-edited version of the exisiting domiciling page