Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add domiciling doc with PLACEMENT RESTRICTED #12845

Merged

Conversation

rmloveland
Copy link
Contributor

@rmloveland rmloveland commented Jan 26, 2022

Fixes DOC-1178

Fixes DOC-1242

Summary of changes:

  • Break out a new 'Data Domiciling Overview' page with some basic info, that
    then links out to:

    • A new 'Data Domiciling with PLACEMENT RESTRICTED' page that explains
      ~how to accomplish domiciling with this feature, and how to check
      that replicas are where they are supposed to be

    • A 'Data Domiciling with separate databases' page that is a lightly
      copy-edited version of the exisiting domiciling page

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@github-actions
Copy link

github-actions bot commented Jan 26, 2022

@netlify
Copy link

netlify bot commented Jan 26, 2022

Netlify Preview

Name Link
🔨 Latest commit 8a1fa31
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/623dd63de522df0009eef7f6
😎 Deploy Preview https://deploy-preview-12845--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@rmloveland rmloveland force-pushed the 20220120-data-domiciling-with-placement-restricted branch 3 times, most recently from 476dae3 to 539854a Compare February 1, 2022 20:34
@rmloveland rmloveland force-pushed the 20220120-data-domiciling-with-placement-restricted branch 3 times, most recently from 60fc000 to 738472e Compare February 8, 2022 19:17
@rmloveland rmloveland marked this pull request as ready for review February 8, 2022 19:18
@rmloveland rmloveland changed the title Add domiciling doc with PLACEMENT RESTRICTED (WIP) Add domiciling doc with PLACEMENT RESTRICTED Feb 8, 2022
@rmloveland
Copy link
Contributor Author

Hey all,

Since there are multiple POVs around multiregion, domiciling,
replicas, etc., for each of you there's a little note below re:
why I added you to this review.

If you disagree with my reasoning for adding and/or just don't
have time to look at this, please feel free to remove yourself
from the review - thanks!

  • Vy
    • FYI for overall progress/status as PM
  • Adam
    • Want your feedback since it touches multiregion/domiciling stuff
  • Richard
    • Want your feedback since it touches multiregion/domiciling stuff
  • knz
    • To ensure this addresses your comments in DOC-1242 re: sufficiently disclaiming capabilities, doesn't work on indexes
  • Ben
    • To ensure this addresses your comments in DOC-1242 re: sufficient disclaimer, zone config heuristics
  • Aayush
    • Since this refers to replica placement/allocator stuff in general and uses your recommended tactic of querying crdb_internal.ranges_no_leases to check replica placement
  • Andrei
    • Since this uses replication reports which you worked on way back when

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @knz, @RichardJCai, @rmloveland, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):

Using CockroachDB as part of your approach to data domiciling has several limitations:

- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes.

Is this usage of "data that is indexed" sufficiently clear? Might this be read as forbidding all indexes on a table that must be domiciled? I would say "columns that are indexed" instead of "data" to make it clear that we're really making a column-level statement here.

"The replicas underlying the index" are domiciled as expected. What is distributed across the cluster is metadata about the index including a sample of values from the index.

Whether or not this is a problem for domiciling is going to depend on the specific law/regulations involved. This is the only place in this set of docs where we "recommend" something; everywhere else we just present the facts of what happens. Do we want to make this recommendation here?


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes.
- Some metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism. This might result in potential "leakage" outside of the desired domicile if your schema includes primary keys, table names, etc., that may reveal information about their contents.

This feels like it's mainly redundant with the previous warning, except for the comment about table names. As I said in DOC-1242, I don't understand what this is trying to say when it comes to table names.


v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):

{% include_cached copy-clipboard.html %}
~~~ sql
show range from table users for row ('europe-west1', 'amsterdam', 'ae147ae1-47ae-4800-8000-000000000022');

Digging down to the location of a single row seems like an unnecessarily complex journey. It seems like the main flow of the document should be to identify the database with constraint violations and then apply PLACEMENT RESTRICTED.

Finding the location of a single row may be an interesting exercise (to me it's interesting mainly to see how poor the visibility is for all of this - verifying that the violating replicas are non-voting is especially opaque), but is this something users really want to learn how to do? It seems like it should be something like an appendix if it's included at all.


v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):

---
title: Data Domiciling with Separate Databases

Is this page worth keeping? When would someone use this instead of PLACEMENT RESTRICTED?

Copy link

@RichardJCai RichardJCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

I do wish we had an easier way to verify our data domiciling solutions without having to do a SHOW RANGES FOR ROW query

This page has instructions for data domiciling in [multi-region clusters](multiregion-overview.html) using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. At a high level, this process involves:

1. Controlling the placement of specific row or table data using regional or global tables - specifically, the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables), [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables), and [`GLOBAL`](multiregion-overview.html#global-tables) table localities.
1. Further restricting where the data in those regional tables is stored using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement, which constrains the replicas for a row or table to be stored in only the [home regions](set-locality.html#crdb_region) associated with those rows or tables.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the set locality portion doesn't exactly explain what a home region is either. Is there an appendix or something that specifies what exactly a home region is?


### Step 1. Start a simulated multi-region cluster

{% include {{page.version.version}}/sql/start-a-multi-region-demo-cluster.md %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cockroach demo --global --nodes 9 --no-example-database --insecure may be slow right now due to cockroachdb/cockroach#76305, should we also include the --multitenant=false flag?

Copy link
Contributor

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 11 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):

Using CockroachDB as part of your approach to data domiciling has several limitations:

- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes.

Is this really true? At least for RBR tables, we partition indexes in the same way as we partition the base table (and apply the same zone configs). The end result is that the data should be domiciled properly (provided PLACEMENT RESTRICTED is used).


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes.
- Some metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism. This might result in potential "leakage" outside of the desired domicile if your schema includes primary keys, table names, etc., that may reveal information about their contents.

Nit: Instead of "that may reveal information about their contents", does it make more sense to say "that may reveal sensitive information" or "that may reveal personally identifiable information".


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):

- Data domiciling does not work on data that is [indexed](indexes.html), since the replicas underlying the index are distributed across the nodes in the cluster. Therefore, we recommend keeping data that must be domiciled out of indexes.
- Some metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism. This might result in potential "leakage" outside of the desired domicile if your schema includes primary keys, table names, etc., that may reveal information about their contents.
- [Zone configs](configure-replication-zones.html) can be used for data placement but these features were historically built for performance, not for domiciling. Further, the way CockroachDB performs replica placement is currently heuristic-based, so the zone config settings may not be obeyed at all times.

I'm not certain of this, but I worry that this may be an overstatement. I thought that if you specify hard constraints, that if a zone config can't be satisfied, the system won't place a replica that violates the constraints.


_includes/v21.2/sql/data-domiciling-limitations.md, line 7 at r1 (raw file):

- [Zone configs](configure-replication-zones.html) can be used for data placement but these features were historically built for performance, not for domiciling. Further, the way CockroachDB performs replica placement is currently heuristic-based, so the zone config settings may not be obeyed at all times.
- Make sure your [log files](logging-overview.html) are stored according to the legal requirements of the jurisdictions you're operating in. For example, it may be easiest to store your log files for a multi-region cluster in the locality with the most restrictive subset of regulations.
- If you start a node with a [`--locality`](cockroach-start.html#locality) flag that says the node is in region _A_, but the node is actually running in some region _B_, this approach will not work. A CockroachDB node only knows its locality based on the text supplied to the `--locality` flag; it can not ensure that it is actually running in that physical location.

The this in "this approach" is a bit ambiguous. I think you mean "data domiciling based on the inferred node placement", or something like that.


_includes/v21.2/sql/multiregion-movr-global.md, line 8 at r1 (raw file):

~~~

Next, alter the `user_promo_codes` table to have a foreign key into the `promo_codes` table. This step is necessary to modify the MovR schema design to take full advantage of the multi-region features in v21.1+.

I find the second sentence here a bit confusing. How does this allow users to take full advantage of the new abstractions?

Also, Nit: I'd drop the 21.1+ from this sentence, as I'm assuming this will only show up in future versions of the docs.


_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 1 at r1 (raw file):

All of the tables except `promo_codes` are geographically specific, and updated very frequently. For these tables, the right table locality for optimizing access to their data is [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables).

Nit: are the tables geographically specific? Or do the tables contain rows which are partitioned by geo (or by region)?


_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 21 at r1 (raw file):

      END
    ) STORED;
    ALTER TABLE rides ALTER COLUMN REGION SET NOT NULL;

Nit: for bonus points you could make this column hidden. Same comment for the rest of the tables.


v21.2/data-domiciling.md, line 18 at r1 (raw file):

- [Data Domiciling with `PLACEMENT RESTRICTED`](data-domiciling-with-placement-restricted.html)
- [Data Domiciling with Separate Databases](data-domiciling-with-separate-databases.html)
- [Install CockroachDB](install-cockroachdb.html)

Nit: This one seems out of place to me. Is there a reason you added install in here?


v21.2/data-domiciling-with-placement-restricted.md, line 13 at r1 (raw file):

This page has instructions for data domiciling in [multi-region clusters](multiregion-overview.html) using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. At a high level, this process involves:

1. Controlling the placement of specific row or table data using regional or global tables - specifically, the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables), [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables), and [`GLOBAL`](multiregion-overview.html#global-tables) table localities.

Isn't it the two REGIONAL table types which control data placement? I'd think that with GLOBAL tables, data placement is not controlled.


v21.2/data-domiciling-with-placement-restricted.md, line 14 at r1 (raw file):

Previously, RichardJCai (Richard Cai) wrote…

nit: I think the set locality portion doesn't exactly explain what a home region is either. Is there an appendix or something that specifies what exactly a home region is?

Nit: should this be "partition or table" instead of "row or table"?


v21.2/data-domiciling-with-placement-restricted.md, line 27 at r1 (raw file):

## Example

In the following example, you will go through the process of configuring the [MovR](movr.html) data set using [multi-region SQL statements](multiregion-overview.html). Then, as part of implementing a data domiciling strategy, you will apply stricter-than-default replica placement settings using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. Finally, you will verify that the resulting replica placements are as expected using a combination of [replication reports](query-replication-reports.html), the [`SHOW RANGE FOR ROW`](show-range-for-row.html) statement, and queries against the [`crdb_internal.ranges_no_leases`](crdb-internal.html#ranges_no_leases) table.

Nit: stricter-than-default seems a bit awkward to me. Any chance you considered "restricted replica settings"?


v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):

In the following example, you will go through the process of configuring the [MovR](movr.html) data set using [multi-region SQL statements](multiregion-overview.html). Then, as part of implementing a data domiciling strategy, you will apply stricter-than-default replica placement settings using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. Finally, you will verify that the resulting replica placements are as expected using a combination of [replication reports](query-replication-reports.html), the [`SHOW RANGE FOR ROW`](show-range-for-row.html) statement, and queries against the [`crdb_internal.ranges_no_leases`](crdb-internal.html#ranges_no_leases) table.

For the purposes of this example, the data domiciling requirement is to configure a multi-region deployment of the [MovR database](movr.html) such that data for EMEA-based users, vehicles, etc. is being stored on CockroachDB nodes running in EMEA localities.

Nit: EMEA, or EU?


v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):

Previously, RichardJCai (Richard Cai) wrote…

nit: cockroach demo --global --nodes 9 --no-example-database --insecure may be slow right now due to cockroachdb/cockroach#76305, should we also include the --multitenant=false flag?

I doubt we'll go out with multi-tenant demo as the default. We'll probably revert it pretty soon.


v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):

Next, run a [replication report](query-replication-reports.html) to see which ranges are still not in compliance with your desired domiciling: that data on EMEA-based entities (users, etc.) does not leave EMEA-based nodes.

With the default settings, you should expect some replicas in the cluster to be violating this constraint. This is because [non-voting replicas](architecture/replication-layer.html#non-voting-replicas) are enabled by default in [multi-region clusters](multiregion-overview.html) to enable stale reads of data in [regional tables](regional-tables.html) from outside those tables' [home regions](set-locality.html#crdb_region). For many use cases, this ispreferred, but it keeps you from meeting the domiciling requirements for this example.

Why do non-voters prevent us from satisfying the constraints?

Nit: is preferred


v21.2/data-domiciling-with-placement-restricted.md, line 91 at r1 (raw file):

~~~

Based on this output, you can see that plenty of replicas are out of compliance (see the `violating_ranges` column) for the reason described above: the presence of non-voting replicas in other regions to enable fast stale reads from those regions.

Is this just a temporary statement because the rebalancer hasn't completed its work yet?


v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):

~~~

This output shows that the `movr` database has ranges out of compliance, which you saw previously. Unfortunately, this output does not contain the table or index names due to a limitation of the replication reports: non-voting replicas are not associated with any tables or indexes by the reports. For more information, see [cockroachdb/cockroach#75821](https://github.com/cockroachdb/cockroach/issues/75821).

Nit: I don't necessarily have a solution for this problem, but referencing issues in our documentation seems a bit messy to me. Is there some way around us airing our dirty laundry here? Maybe just remove this paragraph?


v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Digging down to the location of a single row seems like an unnecessarily complex journey. It seems like the main flow of the document should be to identify the database with constraint violations and then apply PLACEMENT RESTRICTED.

Finding the location of a single row may be an interesting exercise (to me it's interesting mainly to see how poor the visibility is for all of this - verifying that the violating replicas are non-voting is especially opaque), but is this something users really want to learn how to do? It seems like it should be something like an appendix if it's included at all.

+1


v21.2/data-domiciling-with-separate-databases.md, line 54 at r1 (raw file):

You will need to make sure that user data associated with EU users is only added to the `eu_users` database.

How exactly you will accomplish that is beyond the scope of this document, but you will likely need to add some logic to your application and/or to your load balancing infrastructure to make sure that when your application code is [inserting](insert.html) or [updating](update.html) EU user data, the data only ever hits the `eu_users` database. For example, you can set the target database in your [connection string](connection-parameters.html). For example:

Nit: You say that accomplishing this is beyond the scope of this document, and then go onto describe some of the steps that are required 😄

Maybe here you should say here "The complete steps to ensure this are beyond the scope of this document..."


v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):

{% include {{page.version.version}}/sql/data-domiciling-limitations.md %}

- Cross-region writes are slower than intra-region writes. This may be an issue depending on your application's performance needs, since following the advice above would result in having different databases' data stored in different regions.

Hmm, is this really an artifact of the separate databases approach? Or is it relevant with any MR setup?

I'd also think that we may want to talk about the cross database restrictions (like having to enable cross-db FKs and views manually)

Copy link
Contributor

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me so long to get to this. Looks good overall. Some minor comments.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)

Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reviews Ben, Richard, and Adam! I have revised stuff and also replied to some comments in a few areas. PTAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):
Updated to reference columns being indexed, and correct the claims about what is distributed across the cluster, and to remove the recommendation.

Data domiciling does not work on columns that are indexed, since metadata about the index is distributed across the nodes in the cluster, including a sample of values from the index.

The recommendation in particular was based on notes from a convo I had with Daniel Holt about domiciling stuff, but I may have misheard / misunderstood him, and also that would probably not make it through the legal review that is coming after this one :-)


_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Is this really true? At least for RBR tables, we partition indexes in the same way as we partition the base table (and apply the same zone configs). The end result is that the data should be domiciled properly (provided PLACEMENT RESTRICTED is used).

(I assume your question here is for Ben)


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):
Agree it's a bit redundant but I would argue that some repetition could be helpful since an average reader who's not a CRDB engineer might not deduce all of these consequences from the statement above. Also it lets us link to some specific features/docs that folks who care about domiciling should be familiar with

I removed the sentence about table names because I don't fully understand what it was trying to say either - and it may be superseded by the note about specifying where log files are stored, since that was something else Daniel Holt mentioned in a conversation that I think (?) takes care of any concerns in this approximate area

Resulting edit is

  • Additional metadata about the objects in a schema may be stored in system ranges, system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the multi-region SQL statements, or the low-level zone configs mechanism.

_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: Instead of "that may reveal information about their contents", does it make more sense to say "that may reveal sensitive information" or "that may reveal personally identifiable information".

Maybe, but if we stick with the removal of this sentence altogether based on Ben's feedback above, do you agree with what I wrote above that this is basically covered by storing one's log files carefully?


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

I'm not certain of this, but I worry that this may be an overstatement. I thought that if you specify hard constraints, that if a zone config can't be satisfied, the system won't place a replica that violates the constraints.

I definitely don't know the answer to this, but it's basically what @knz wrote in DOC-1242 so I added it. From a docs POV it is quite useful as context so the reader knows these features were def not built ground-up with these use cases in mind, so there is more work to do here / this should only be part of the solution


_includes/v21.2/sql/data-domiciling-limitations.md, line 7 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

The this in "this approach" is a bit ambiguous. I think you mean "data domiciling based on the inferred node placement", or something like that.

Thank you! that is a superior formulation, went with it!


_includes/v21.2/sql/multiregion-movr-global.md, line 8 at r1 (raw file):
Agreed - this is copypasta from the original docs broken out into this new "include" file, so I didn't edit it

Updated to the following here and in v21.1 docs - what do you think?

Next, alter the user_promo_codes table to have a foreign key into the global promo_codes table. This will enable fast reads of the promo_codes.code column from any region in the cluster.


_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 1 at r1 (raw file):
Aha! this is also copypasta from the original doc and we should update it

Updated to the below here and in v21.1 docs

All of the tables except promo_codes contain rows which are partitioned by region ...


_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 21 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: for bonus points you could make this column hidden. Same comment for the rest of the tables.

I will defer since I don't wanna re-test all this right now - filed https://cockroachlabs.atlassian.net/browse/DOC-2660

(unfortunately we don't have automatic testing of the SQL in our docs - see #7067)


v21.2/data-domiciling.md, line 18 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: This one seems out of place to me. Is there a reason you added install in here?

IDK, probably some neurons misfiring - removed it!


v21.2/data-domiciling-with-placement-restricted.md, line 13 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Isn't it the two REGIONAL table types which control data placement? I'd think that with GLOBAL tables, data placement is not controlled.

Indeed - removed the mention of global tables from this bullet


v21.2/data-domiciling-with-placement-restricted.md, line 14 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: should this be "partition or table" instead of "row or table"?

Thanks @RichardJCai , filed https://cockroachlabs.atlassian.net/browse/DOC-2661 to address this shortcoming. It's sort of scattered around on different pages and could be way more explicit

Adam - thanks, updated to "partition or table"


v21.2/data-domiciling-with-placement-restricted.md, line 27 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: stricter-than-default seems a bit awkward to me. Any chance you considered "restricted replica settings"?

Ya that is awkward - updated to use your suggested term, thanks


v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: EMEA, or EU?

I went with EMEA because when I spoke to Daniel Holt he said EMEA a lot? So I kinda had that in my notes. But yeah I think EU specifically is what is meant here - went ahead and updated all uses to EU


v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

I doubt we'll go out with multi-tenant demo as the default. We'll probably revert it pretty soon.

Thanks for the clarification. UX-wise I would not expect as a normie to have to care about what CRDB is doing internally w.r.t multi-tenancy when I'm just doing muh cockroach demo etc so I'm glad that default will change.

But I will admit to some sadness that I will ever have to care about that flag at all (not as an employee who wants/needs to support our m-t. strategy but as in this instance an avg user of the binary who is not doing a fancy multi-tenant deployment). But I will save my moaning for https://cockroachlabs.atlassian.net/browse/DOC-1181 :-)


v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):
Fixed the nit thanks!

Why do non-voters prevent us from satisfying the constraints?

Since above we say

For the purposes of this example, the data domiciling requirement is to configure a multi-region deployment of the MovR database such that data for EU-based users, vehicles, etc. is being stored on CockroachDB nodes running in EU localities.

my understanding is that the non-voters are stored on non-EU nodes to enable stale reads even though from my user POV I have said "store these rows in the EU" via REGIONAL BY ROW, which is causing me to not meet that requirement

... all of which will make me sad and/or cost me $ potentially?

Sorry if I'm misunderstanding though!

Or are you saying we should rephrase this to use a different term than "satisfying the constraints"? we could also re-use the term from the goal statement and say "prevents us from meeting the requirement"

Or are you saying we should explicate more by adding "... since the non-voters are stored on non-EU nodes to enable fast stale reads" or something similar?


v21.2/data-domiciling-with-placement-restricted.md, line 91 at r1 (raw file):
Apparently not: I tested this by running everything up to this point and then going to lunch. When I came back I ran this. Output is ~the same. I really think it's the non-voters (which seems to be supported by the replica_localities in the queries further below)

> SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;

  zone_id | subzone_id |    type    |         config         | report_id |        violation_start        | violating_ranges
----------+------------+------------+------------------------+-----------+-------------------------------+-------------------
       52 |          0 | constraint | +region=europe-west1:1 |         1 | 2022-02-16 17:42:20.343827+00 |               16
       52 |          0 | constraint | +region=us-east1:1     |         1 | 2022-02-16 17:40:20.338341+00 |               72
       52 |          0 | constraint | +region=us-west1:1     |         1 | 2022-02-16 17:40:20.338341+00 |               72
(3 rows)

v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: I don't necessarily have a solution for this problem, but referencing issues in our documentation seems a bit messy to me. Is there some way around us airing our dirty laundry here? Maybe just remove this paragraph?

I don't really like it either but I think our docs need to at least address some potential user surprise by explaining even at a high level why some of these features are not super tightly woven together yet (despite our best efforts intentions etc). I initially added the issue link because we sometimes add issue links to things that are "known limitations" a la http://www.cockroachlabs.com/docs/dev/known-limitations.html

During testing/writing, I was def struck by the fact that this query returned NULLs here, so I think we need to at least say something about it and thus hesitant to remove the paragraph. I removed the issue link though! What do you think?


v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

+1

I agree it's complex but I really believe we need to show how to verify that the DB is doing what we've just asked it to do, otherwise I think we're only taking folks on half the journey. Especially since the replication reports are still experimental and haven't been updated with all the non-voting replica stuff (for example see cockroachdb/cockroach#69335 and cockroachdb/cockroach#75821)

So if I'm on the line at my job for implementing domiciling and ensuring compliance and thus possibly $$$ / legal ramifications I don't really fully trust just those reports

and the first thing I wanted to do after typing PLACEMENT RESTRICTED and checking the reports was to find out "how do I additionally verify this thing is actually putting the stuff where it's supposed to, given that these reports are experimental and maybe not getting updated etc?"

esp since some laws talk about fines per violation (IANAL etc) - to me that means I really do need to know about how to check any given row at any time, whether for some kind of finer-grained reporting, or just in case someone asks "how do you really know where this data is?"

that's why I went to SHOW RANGE FOR ROW - but that's also experimental

then when I was asking in #kv about how to verify the replica placement beyond the experimental reports and SHOW RANGE] @aayushshah15 recommended doing the crdb_internal query in addition to the above

This feels pretty belt and suspenders and overall not great but in my head the thinking was "as a user, I'm on the line for this, these reports and the SHOW RANGE are experimental and not necessarily updated/super supported so I need to really make sure about this replica placement using whatever I can find" - and these were the tools that I could find


v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is this page worth keeping? When would someone use this instead of PLACEMENT RESTRICTED?

That's a good question. Maybe not? I don't know the answer to why someone would use this instead. Was hesitant to remove since it is an option that exists and is already documented and was vetted by Legal team, so I figured it might be worth keeping around without changing it in case it's needed

I think removing it is also fine tho - @vy-ton do you have an opinion about whether we should keep/remove this? I erred on side of keeping since we already had it as an option and it was reviewed etc. It may be the "bad" option but it's an option


v21.2/data-domiciling-with-separate-databases.md, line 54 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: You say that accomplishing this is beyond the scope of this document, and then go onto describe some of the steps that are required 😄

Maybe here you should say here "The complete steps to ensure this are beyond the scope of this document..."

LOL - updated to use your suggested phrase


v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):

Hmm, is this really an artifact of the separate databases approach? Or is it relevant with any MR setup?

If the latter I can move it back to the general limitations. Though I think maybe it makes sense to remove this since as you say it's just kind of generally true about MR things. I went with removal since it doesn't seem super domiciling-specific - any objection?

I'd also think that we may want to talk about the cross database restrictions (like having to enable cross-db FKs and views manually)

Ideally but I think I'd rather not add much to this text since it's already had legal vetting and/or we may remove it now (see above convo)

FWIW we do have that info in our FK docs for if/when folks hit the cross-db issue (again not ideal but I'm trying not to touch this one hardly at all)

Copy link
Contributor

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments. I think I'm mostly in a good place right now. Just a couple of remaining comments open.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

(I assume your question here is for Ben)

Sorry, my question was written before Ben's comments, but only committed after his. I don't think we need to keep sensitive data outside of indexes, as they should be domiciled correctly, at least when using RBR tables.


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Maybe, but if we stick with the removal of this sentence altogether based on Ben's feedback above, do you agree with what I wrote above that this is basically covered by storing one's log files carefully?

Removing the sentence works for me.

Where do you talk about storing the logs carefully? I can't seem to locate that part.


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

I definitely don't know the answer to this, but it's basically what @knz wrote in DOC-1242 so I added it. From a docs POV it is quite useful as context so the reader knows these features were def not built ground-up with these use cases in mind, so there is more work to do here / this should only be part of the solution

I think saying that "the zone config settings may not be obeyed at all times" is still an overstatement. It may be better to say that "zone config settings may leave some decision making capacity to the system, and it may not place replicas according to your domiciling needs" (or something like that). Then, we can link to this for more detail: https://www.cockroachlabs.com/docs/v21.2/configure-replication-zones#types-of-constraints.


_includes/v21.2/sql/multiregion-movr-regional-by-row.md, line 21 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

I will defer since I don't wanna re-test all this right now - filed https://cockroachlabs.atlassian.net/browse/DOC-2660

(unfortunately we don't have automatic testing of the SQL in our docs - see #7067)

OK, that's fair.


v21.2/data-domiciling.md, line 18 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

IDK, probably some neurons misfiring - removed it!

:-)


v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

I went with EMEA because when I spoke to Daniel Holt he said EMEA a lot? So I kinda had that in my notes. But yeah I think EU specifically is what is meant here - went ahead and updated all uses to EU

SGTM. I know that the terms are used interchangeably a lot, but if we're dealing with only EU regions, better to be specific. Thanks!


v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Thanks for the clarification. UX-wise I would not expect as a normie to have to care about what CRDB is doing internally w.r.t multi-tenancy when I'm just doing muh cockroach demo etc so I'm glad that default will change.

But I will admit to some sadness that I will ever have to care about that flag at all (not as an employee who wants/needs to support our m-t. strategy but as in this instance an avg user of the binary who is not doing a fancy multi-tenant deployment). But I will save my moaning for https://cockroachlabs.atlassian.net/browse/DOC-1181 :-)

The average users shouldn't have to worry about it at all. We'll revert the default to false shortly, and then it'll be of no interest to most customers. Eventually, we'll change the default to true, and the hope is that customers will never have to disable it.


v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):
Sorry, wasn't clear. I was saying this:

Or are you saying we should rephrase this to use a different term than "satisfying the constraints"? we could also re-use the term from the goal statement and say "prevents us from meeting the requirement"

I see it's been addressed above. Thanks.


v21.2/data-domiciling-with-placement-restricted.md, line 91 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Apparently not: I tested this by running everything up to this point and then going to lunch. When I came back I ran this. Output is ~the same. I really think it's the non-voters (which seems to be supported by the replica_localities in the queries further below)

> SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;

  zone_id | subzone_id |    type    |         config         | report_id |        violation_start        | violating_ranges
----------+------------+------------+------------------------+-----------+-------------------------------+-------------------
       52 |          0 | constraint | +region=europe-west1:1 |         1 | 2022-02-16 17:42:20.343827+00 |               16
       52 |          0 | constraint | +region=us-east1:1     |         1 | 2022-02-16 17:40:20.338341+00 |               72
       52 |          0 | constraint | +region=us-west1:1     |         1 | 2022-02-16 17:40:20.338341+00 |               72
(3 rows)

It strikes me as odd that the non-voters would cause violations. We might need someone from KV to confirm that this is expected. Maybe @aayushshah15?


v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

I don't really like it either but I think our docs need to at least address some potential user surprise by explaining even at a high level why some of these features are not super tightly woven together yet (despite our best efforts intentions etc). I initially added the issue link because we sometimes add issue links to things that are "known limitations" a la http://www.cockroachlabs.com/docs/dev/known-limitations.html

During testing/writing, I was def struck by the fact that this query returned NULLs here, so I think we need to at least say something about it and thus hesitant to remove the paragraph. I removed the issue link though! What do you think?

Sure, this looks better. Might want to change to "due to a current limitation of the replication reports".


v21.2/data-domiciling-with-separate-databases.md, line 54 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

LOL - updated to use your suggested phrase

LGTM.


v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Hmm, is this really an artifact of the separate databases approach? Or is it relevant with any MR setup?

If the latter I can move it back to the general limitations. Though I think maybe it makes sense to remove this since as you say it's just kind of generally true about MR things. I went with removal since it doesn't seem super domiciling-specific - any objection?

I'd also think that we may want to talk about the cross database restrictions (like having to enable cross-db FKs and views manually)

Ideally but I think I'd rather not add much to this text since it's already had legal vetting and/or we may remove it now (see above convo)

FWIW we do have that info in our FK docs for if/when folks hit the cross-db issue (again not ideal but I'm trying not to touch this one hardly at all)

For simplicity, I'm fine removing and leaning on the existing documentation for the limitations.

@ajstorm ajstorm self-requested a review February 22, 2022 15:58
Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again Adam! I updated everything I could based on your latest feedback

Think we're still waiting on the one question about the non-voters' status w.r.t. replication reports (FWIW it makes sense to me as a user that they would be considered violating within that context, but I defer to y'all on the technical matters/design)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 3 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Sorry, my question was written before Ben's comments, but only committed after his. I don't think we need to keep sensitive data outside of indexes, as they should be domiciled correctly, at least when using RBR tables.

OK thanks for clarifying - I've removed that bullet


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Removing the sentence works for me.

Where do you talk about storing the logs carefully? I can't seem to locate that part.

Cool!

Re: storing logs, a couple bullets down we say "make sure your log files are stored according to the legal requirements of ..."


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

I think saying that "the zone config settings may not be obeyed at all times" is still an overstatement. It may be better to say that "zone config settings may leave some decision making capacity to the system, and it may not place replicas according to your domiciling needs" (or something like that). Then, we can link to this for more detail: https://www.cockroachlabs.com/docs/v21.2/configure-replication-zones#types-of-constraints.

Thanks for clarifying. I have updated to use that phrasing, and added the link to the zone config docs


v21.2/data-domiciling-with-placement-restricted.md, line 29 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

SGTM. I know that the terms are used interchangeably a lot, but if we're dealing with only EU regions, better to be specific. Thanks!

Thanks for pointing it out - I don't have a lot of expertise in such terms, so agree re: specificity


v21.2/data-domiciling-with-placement-restricted.md, line 33 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

The average users shouldn't have to worry about it at all. We'll revert the default to false shortly, and then it'll be of no interest to most customers. Eventually, we'll change the default to true, and the hope is that customers will never have to disable it.

👍


v21.2/data-domiciling-with-placement-restricted.md, line 76 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Sorry, wasn't clear. I was saying this:

Or are you saying we should rephrase this to use a different term than "satisfying the constraints"? we could also re-use the term from the goal statement and say "prevents us from meeting the requirement"

I see it's been addressed above. Thanks.

Oh ok - thanks for clarifying!


v21.2/data-domiciling-with-placement-restricted.md, line 141 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Sure, this looks better. Might want to change to "due to a current limitation of the replication reports".

Updated to say "current limitation"


v21.2/data-domiciling-with-separate-databases.md, line 77 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

For simplicity, I'm fine removing and leaning on the existing documentation for the limitations.

👍

Copy link
Contributor

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there :-)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Cool!

Re: storing logs, a couple bullets down we say "make sure your log files are stored according to the legal requirements of ..."

I think it might still be a good idea to have a parenthetical in the first bullet where we give some examples of what attributes of a schema are stored in metadata (table and index names, PK info, etc). Aside from that, I'm good here.


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Thanks for clarifying. I have updated to use that phrasing, and added the link to the zone config docs

LGTM. One nit: I'd change "...so the zone config setting may leave" to "...and the zone config setting may leave". The fact that ZCs are heuristic based isn't the reason that the domiciling requirements might not be met, it's because ZCs can either be fully restricted or partially restricted. The latter leaves the door open for the system to make some decisions on its own.

@ajstorm ajstorm self-requested a review February 25, 2022 14:53
Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, made the couple of requested tweaks :-)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

I think it might still be a good idea to have a parenthetical in the first bullet where we give some examples of what attributes of a schema are stored in metadata (table and index names, PK info, etc). Aside from that, I'm good here.

Added back "For example, this metadata may include table and index names and primary key information"


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

LGTM. One nit: I'd change "...so the zone config setting may leave" to "...and the zone config setting may leave". The fact that ZCs are heuristic based isn't the reason that the domiciling requirements might not be met, it's because ZCs can either be fully restricted or partially restricted. The latter leaves the door open for the system to make some decisions on its own.

Updated!

Copy link
Contributor

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Still one partially unresolved issue about how to present the details in step 3. You might need to get a follow-up from @bdarnell on his thoughts there.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


v21.2/data-domiciling-with-placement-restricted.md, line 97 at r4 (raw file):

{{site.data.alerts.end}}

Next, run the query suggested in [the Replication Reports documentation](query-replication-reports.html#find-out-which-of-your-tables-have-a-constraint-violation) that should show which database and table names contain the `violating_ranges`.

In regards to Ben's comment below, I'm wondering if for the sake of this example, we want to this part (and everything that follows) in step 3 out to a separate document? That way we'd have the details there if, like you say, customers are required to validate replica placement, but it's not cluttering up the example for those that aren't as paranoid.

@ajstorm ajstorm self-requested a review March 2, 2022 14:01
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Agree it's a bit redundant but I would argue that some repetition could be helpful since an average reader who's not a CRDB engineer might not deduce all of these consequences from the statement above. Also it lets us link to some specific features/docs that folks who care about domiciling should be familiar with

I removed the sentence about table names because I don't fully understand what it was trying to say either - and it may be superseded by the note about specifying where log files are stored, since that was something else Daniel Holt mentioned in a conversation that I think (?) takes care of any concerns in this approximate area

Resulting edit is

  • Additional metadata about the objects in a schema may be stored in system ranges, system tables, etc. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the multi-region SQL statements, or the low-level zone configs mechanism.

In r4 I still see "this metadata may include table and index names".

I suspect that all the talk about table names is about some confusion that we have internally about the difference between personally-identifiable information and the broader category of "user information". For example, if Movr is a CockroachCloud customer and they're planning to launch a new food delivery app, they'll start to have tables named things like "restaurants". That's sensitive confidential information that could be revealed by a simple listing of table names, so for our CockroachCloud operations we must treat such metadata as private, hide it from our SREs to the extent possible, etc. But the table names alone are not PII, and they wouldn't generally trigger data domiciling laws like GDPR (nor does most metadata)

What's important for domiciling purposes is the data itself, not the metadata, and this is insufficiently clear in your "resulting edit". We need to be clear that a subset (or a sample) of data from indexed columns may appear in meta ranges or other system tables.


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r4 (raw file):

- Additional metadata about the objects in a schema may be stored in [system ranges](architecture/distribution-layer.html#meta-ranges), system tables, etc. For example, this metadata may include table and index names, and primary key information. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism.
- [Zone configs](configure-replication-zones.html) can be used for data placement but these features were historically built for performance, not for domiciling. Further, the way CockroachDB performs replica placement is currently heuristic-based, and the zone config settings may leave some decision making capacity to the system, and it may not place replicas according to your domiciling needs. For more information, see [Configure Replication Zones](configure-replication-zones.html#types-of-constraints).
- Make sure your [log files](logging-overview.html) are stored according to the legal requirements of the jurisdictions you're operating in. For example, it may be easiest to store your log files for a multi-region cluster in the locality with the most restrictive subset of regulations.

If there is a "most restrictive subset", you'd probably just put all your data there, not just logs. The premise of this domiciling functionality is that the various domiciling laws are mutually exclusive - european users' data must be in europe, users from region X must be in that region, etc.

Instead of centralizing logs in one region, you're more likely to want to keep logs in the region where they were generated. The is some cross-region leakage just like with system tables, but the vast majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough for you, you can use the log redaction functionality to strip all raw data from the logs, or limit your log retention entirely.


v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):

So if I'm on the line at my job for implementing domiciling and ensuring compliance and thus possibly $$$ / legal ramifications I don't really fully trust just those reports

But is this really better? You pick one arbitrary row and show where it is. But how does this give me confidence that all my data is in the right place? If I don't trust a report that's designed to report constraint violations, how am I going to duct tape a bunch of queries together to get something I trust more?

My suggestion is to move all the stuff about finding the location of a single row out of this page and to the architecture section about the distribution layer. That seems like a better place for this kind of under-the-hood optional detail.


v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

That's a good question. Maybe not? I don't know the answer to why someone would use this instead. Was hesitant to remove since it is an option that exists and is already documented and was vetted by Legal team, so I figured it might be worth keeping around without changing it in case it's needed

I think removing it is also fine tho - @vy-ton do you have an opinion about whether we should keep/remove this? I erred on side of keeping since we already had it as an option and it was reviewed etc. It may be the "bad" option but it's an option

I would at least remove it from the main sidebar (or push it down to an "archive" section at the bottom so it's not as prominent as the new page). I think we could remove it completely as long as the page is still there in the 21.1 version of the docs.

Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adam and Ben. Made more updates based on your comments below. In particular removed a bunch of the "show range for a single row" stuff, and moved all back to a single page so we no longer show the weaker "separate databases" approach in v21.2+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, and @vy-ton)


_includes/v21.2/sql/data-domiciling-limitations.md, line 4 at r1 (raw file):

In r4 I still see "this metadata may include table and index names".

Sorry - should be fixed now to say what it has in the quote

I suspect that all the talk about table names is about some confusion ...

Yes, definitely confusion on my end - thanks for explaining this

What's important for domiciling purposes is the data itself, not the metadata, and this is insufficiently clear in your "resulting edit". We need to be clear that a subset (or a sample) of data from indexed columns may appear in meta ranges or other system tables.

Updated to the below. Does this capture what you are saying properly?

"When columns are indexed, a subset (or sample) of data from the indexed columns may appear in meta ranges or other system tables. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the multi-region SQL statements, or the low-level zone configs mechanism."


_includes/v21.2/sql/data-domiciling-limitations.md, line 5 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If there is a "most restrictive subset", you'd probably just put all your data there, not just logs. The premise of this domiciling functionality is that the various domiciling laws are mutually exclusive - european users' data must be in europe, users from region X must be in that region, etc.

Instead of centralizing logs in one region, you're more likely to want to keep logs in the region where they were generated. The is some cross-region leakage just like with system tables, but the vast majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough for you, you can use the log redaction functionality to strip all raw data from the logs, or limit your log retention entirely.

Thanks, updated the logging bullet to be basically your comment here. What do you think?

"If your log files are kept in the region where they were generated, there is some cross-region leakage (as with the system tables described previously), but the majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough, you can use the log redaction functionality to strip all raw data from the logs. You can also limit your log retention entirely."


v21.2/data-domiciling-with-placement-restricted.md, line 188 at r1 (raw file):

If I don't trust a report that's designed to report constraint violations, how am I going to duct tape a bunch of queries together to get something I trust more?

Good point - guess I was down the rabbit hole a bit there. I have removed the stuff about finding a single row and just left in the replication reports. Please let me know what you think of this. It's definitely shorter now and hopefully clearer too

My suggestion is to move all the stuff about finding the location of a single row out of this page and to the architecture section about the distribution layer. That seems like a better place for this kind of under-the-hood optional detail.

Makes sense - removed all the stuff about finding the location of a single row and filed https://cockroachlabs.atlassian.net/browse/DOC-2797 to capture maybe adding more docs on this. We already have docs on SHOW RANGE FOR ROW so it's not clear if we will want to bother adding the info about querying crdb_internal.ranges_no_leases or not. According to the crdb_internal docs that table is not supported for prod use


v21.2/data-domiciling-with-placement-restricted.md, line 97 at r4 (raw file):

Previously, ajstorm (Adam Storm) wrote…

In regards to Ben's comment below, I'm wondering if for the sake of this example, we want to this part (and everything that follows) in step 3 out to a separate document? That way we'd have the details there if, like you say, customers are required to validate replica placement, but it's not cluttering up the example for those that aren't as paranoid.

Based on Ben's comment, I removed all the stuff about finding the replica for a single row. Do you think that's an acceptable middle ground? I think it def makes things much shorter and clearer while still giving a user the tools to check stuff with replication reports


v21.2/data-domiciling-with-separate-databases.md, line 2 at r1 (raw file):

I think we could remove it completely as long as the page is still there in the 21.1 version of the docs.

That works for me. I deleted it from the v21.2 docs (it's still there in 21.1) and moved all this info back to the original single 'Data Domiciling with CockroachDB' page.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


v21.2/data-domiciling.md, line 255 at r5 (raw file):

Using CockroachDB as part of your approach to data domiciling has several limitations:

- When columns are [indexed](indexes.html), a subset (or sample) of data from the indexed columns may appear in [meta ranges](architecture/distribution-layer.html#meta-ranges) or other system tables. CockroachDB synchronizes these system ranges and system tables across nodes. This synchronization does not respect any multi-region settings applied via either the [multi-region SQL statements](multiregion-overview.html), or the low-level [zone configs](configure-replication-zones.html) mechanism.

Pick one of "subset" or "sample"; i didn't mean for you to take that part verbatim :)

@ajstorm ajstorm requested a review from bdarnell March 3, 2022 13:43
Copy link
Contributor

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


v21.2/data-domiciling-with-placement-restricted.md, line 97 at r4 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Based on Ben's comment, I removed all the stuff about finding the replica for a single row. Do you think that's an acceptable middle ground? I think it def makes things much shorter and clearer while still giving a user the tools to check stuff with replication reports

Yeah, that SGTM too, and is similar to what I was suggesting we remove.

@ajstorm ajstorm self-requested a review March 3, 2022 13:43
Copy link
Contributor

@awoods187 awoods187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 11 files at r1, 2 of 7 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


v21.2/data-domiciling.md, line 3 at r5 (raw file):

---
title: Data Domiciling with CockroachDB
summary: Learn how to implement data domiciling using CockroachDB's multi-region SQL capabilities and the ALTER DATABASE ... PLACEMENT RESTRICTED statement.

Do we want to be even more explicit and say "a part of your dd approach"


v21.2/data-domiciling.md, line 29 at r5 (raw file):

This page assumes you are already familiar with:

- CockroachDB's [multi-region SQL abstractions](multiregion-overview.html).

Do we need to say something like, if you arent using the MR abstractions this approach won't help?


v21.2/data-domiciling.md, line 81 at r5 (raw file):

### Step 3. View noncompliant replicas

Next, run a [replication report](query-replication-reports.html) to see which ranges are still not in compliance with your desired domiciling: that data on EU-based entities (users, etc.) does not leave EU-based nodes.

Do we need to explain that it takes time to move leaseholders?


v21.2/data-domiciling.md, line 152 at r5 (raw file):

### Step 4. Apply stricter replica placement settings

To ensure that data on EU-based users, vehicles, etc. from [`REGIONAL BY ROW` tables](regional-tables.html#regional-by-row-tables) is stored only on EU-based nodes in the cluster, you must disable the use of [non-voting replicas](architecture/replication-layer.html#non-voting-replicas) on all of these tables. You can do this using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement.

Is it worth mentioning that this cluster setting will do this for all regions?

Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben! Made that last small update.

Thanks for looking Andy! Made some updates based on your comments - PTAL

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @awoods187, @bdarnell, @knz, @RichardJCai, and @vy-ton)


v21.2/data-domiciling.md, line 3 at r5 (raw file):

Previously, awoods187 (Andy Woods) wrote…

Do we want to be even more explicit and say "a part of your dd approach"

Makes sense! How about "Learn how to use CockroachDB's multi-region SQL capabilities and the ALTER DATABASE ... PLACEMENT RESTRICTED statement as part of your data domiciling approach"


v21.2/data-domiciling.md, line 29 at r5 (raw file):

Previously, awoods187 (Andy Woods) wrote…

Do we need to say something like, if you arent using the MR abstractions this approach won't help?

I added a followup sentence "If you are not using them, the instructions on this page will not apply." What do you think?


v21.2/data-domiciling.md, line 81 at r5 (raw file):

Previously, awoods187 (Andy Woods) wrote…

Do we need to explain that it takes time to move leaseholders?

I added a little paragraph with some words about this and a link to the 'Performance considerations' section of Migrate to Multi-region SQL

let me know what you think!


v21.2/data-domiciling.md, line 152 at r5 (raw file):

Previously, awoods187 (Andy Woods) wrote…

Is it worth mentioning that this cluster setting will do this for all regions?

Do you mean in the sense that it will turn off non-voting replicas for all regional tables in the entire database?

If so I tweaked the end of the sentence slightly to say "you must disable ... non-voting replicas on all of the regional tables in this database"

Does that capture what you mean?


v21.2/data-domiciling.md, line 255 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Pick one of "subset" or "sample"; i didn't mean for you to take that part verbatim :)

Ha ok :-) went with "subset"

Copy link
Contributor

@awoods187 awoods187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


v21.2/data-domiciling.md, line 3 at r5 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Makes sense! How about "Learn how to use CockroachDB's multi-region SQL capabilities and the ALTER DATABASE ... PLACEMENT RESTRICTED statement as part of your data domiciling approach"

LGTM


v21.2/data-domiciling.md, line 152 at r5 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Do you mean in the sense that it will turn off non-voting replicas for all regional tables in the entire database?

If so I tweaked the end of the sentence slightly to say "you must disable ... non-voting replicas on all of the regional tables in this database"

Does that capture what you mean?

yes, exactly

@rmloveland rmloveland force-pushed the 20220120-data-domiciling-with-placement-restricted branch from 5929010 to 87dd1a1 Compare March 23, 2022 18:17
@rmloveland
Copy link
Contributor Author

@ericharmeling this has been through technical and legal review and is RFAL from a (your) docs team POV - let me know if questions!

Copy link
Contributor

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I have a couple of non-blocking suggestions. Otherwise, LGTM.

Also - don't forget to copy over to 22.1 too.

Reviewed 2 of 7 files at r2, 1 of 7 files at r5, 7 of 7 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @bdarnell, @knz, @RichardJCai, @rmloveland, and @vy-ton)


v21.2/data-domiciling.md, line 20 at r7 (raw file):

This page has instructions for data domiciling in [multi-region clusters](multiregion-overview.html) using the [`ALTER DATABASE ... PLACEMENT RESTRICTED`](placement-restricted.html) statement. At a high level, this process involves:

1. Controlling the placement of specific row or table data using regional tables - specifically, the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables), [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables).

`

Suggestion:

Controlling the placement of specific row or table data using regional tables with the [`REGIONAL BY ROW`](multiregion-overview.html#regional-by-row-tables) and [`REGIONAL BY TABLE`](multiregion-overview.html#regional-tables) clauses.

v21.2/data-domiciling.md, line 150 at r7 (raw file):

~~~

This output shows that the `movr` database has ranges out of compliance, which you saw previously. Unfortunately, this output does not contain the table or index names due to a current limitation of the replication reports: non-voting replicas are not associated with any tables or indexes by the reports.

Should this replication report limitation be listed in the Limitation section of this page? If so, I suggest linking to that section, and adding the limitation in a new bullet point.

If you think the limitation should instead be listed on a different page (the replication layer architecture page, for example), then perhaps add the limitation to that page and have the ref link to there?

Suggestion:

[limitation](#limitations)

v21.2/data-domiciling.md, line 261 at r7 (raw file):

- If your [log files](logging-overview.html) are kept in the region where they were generated, there is some cross-region leakage (like the system tables described previously), but the majority of user data that makes it into the logs is going to be homed in that region. If that's not strong enough, you can use the [log redaction functionality](configure-logs.html#redact-logs) to strip all raw data from the logs. You can also limit your log retention entirely.
- If you start a node with a [`--locality`](cockroach-start.html#locality) flag that says the node is in region _A_, but the node is actually running in some region _B_, data domiciling based on the inferred node placement will not work. A CockroachDB node only knows its locality based on the text supplied to the `--locality` flag; it can not ensure that it is actually running in that physical location.

See my comment above re: the replication reports limitation.

Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review Eric!

I made that one edit you suggested, and I would like to punt the replication reports thing to be part of addressing DOC-2864 for reasons mentioned below

(And yes, will def copy over to v22.1- thanks for the reminder! have that on the ol' todo list - didn't want to add until after all was approved so folks don't have to scroll through)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @ajstorm, @andreimatei, @awoods187, @bdarnell, @ericharmeling, @knz, @RichardJCai, and @vy-ton)


v21.2/data-domiciling.md, line 152 at r5 (raw file):

Previously, awoods187 (Andy Woods) wrote…

yes, exactly

Done.


v21.2/data-domiciling.md, line 20 at r7 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

`

✔️


v21.2/data-domiciling.md, line 150 at r7 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

Should this replication report limitation be listed in the Limitation section of this page? If so, I suggest linking to that section, and adding the limitation in a new bullet point.

If you think the limitation should instead be listed on a different page (the replication layer architecture page, for example), then perhaps add the limitation to that page and have the ref link to there?

Good point. We have an issue https://cockroachlabs.atlassian.net/browse/DOC-2864 to update the replication report docs to note this limitation

I'd rather not list the replication report limitation as a limitation on this page since the limitations section on this page is meant to refer to limitations of the approach to data domiciling described here, not limitations of that one sub-tool (the replication reports) we are using to check our work

... but I'm totes into cross-linking this paragraph with the doc update that will fix DOC-2864 by mentioning that limitation of the replication reports, as soon as that gets in

let me know if that makes sense or if you strongly disagree


v21.2/data-domiciling.md, line 261 at r7 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

See my comment above re: the replication reports limitation.

responded above!

@rmloveland rmloveland force-pushed the 20220120-data-domiciling-with-placement-restricted branch from 87dd1a1 to 41630ff Compare March 24, 2022 14:30
Fixes DOC-1178
Fixes DOC-1242

Summary of changes:

- Update the 'Data Domiciling' page with info that explains how to
  accomplish domiciling with the 'ALTER DATABASE ... PLACEMENT
  RESTRICTED' feature, and how to check that replicas are where they are
  supposed to be using the Replication Reports.

- As a bonus, break out some info from the multi-region latency demo
  tutorial into includes, since we are now also using them from an
  example on the updated domiciling page.

- Apply these changes to both v21.2 and v22.1.
@rmloveland rmloveland force-pushed the 20220120-data-domiciling-with-placement-restricted branch from 41630ff to 8a1fa31 Compare March 25, 2022 14:48
@rmloveland rmloveland merged commit 6b8c24f into master Mar 25, 2022
@rmloveland rmloveland deleted the 20220120-data-domiciling-with-placement-restricted branch March 25, 2022 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants