Skip to content

Commit

Permalink
Add more detail about data security (#512)
Browse files Browse the repository at this point in the history
* integrate custom docs with new UI

* more edits

* use website wording for intro

* fix numbering in table

* rename and some edits

* rename manage_repo file, per Bo

* Merge.

* formatting edits

* updates per Keyur's feedback

* Fix typos

* fix nav order

* fix link to API key request form

* update form link

* update key request form and output dir env var

* Revert to gerund

Though the style guide says to just use imperatives, "get started" just sounds weird. Also this is more consistent with "troubleshooting"

* new troubleshooting entry

* fix typo

* new data container procedures

* more work

* more work

* complete data draft

* more changes

* more changes

* more revisions

* update troubleshooting doc etc.

* new version of diagrams

* remove data loading problems troubleshooting entry; can't reproduce

* revert title change

* add example for not mixing entity types

* changes from Keyur

* add screenshots for GCP, and related changes

* fixed one image

* added screenshots for Cloud Run service

* resize images

* more changes from Keyur

* fix a tiny error

* delete unused images

* fix missing dash

* update build file

* adjust build command

* Revert "adjust build command"

This reverts commit 4ce0fb9.

* update docker file

* more fixes

* one last fix

* make links to Cloud Console open in a new page

* fixes to quickstart suggested by Prem

* one more change

* change from Keyur

* revise procedure

* merge

* add brief explanation of data model to quickstart

* slight wording tweak

* incorporate feedback from Keyur

* remove erroneous edit

* correct missing text

* more work on tasks for finding stuff

* merge

* update to use env.sample

* typo

* typo

* get file back in head shape

* fix file name

* add more detail about data security

* fix typo

* corrections from Keyur

* fix other mention of SQL queries
  • Loading branch information
kmoscoe authored Sep 18, 2024
1 parent f5cee9a commit 5a670c6
Show file tree
Hide file tree
Showing 4 changed files with 17 additions and 10 deletions.
2 changes: 1 addition & 1 deletion custom_dc/data_cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ Before you proceed, ensure you have completed steps 1 to 3 of the [One-time setu
### Step 1: Set environment variables

To run a local instance of the services container, you need to set all the environment variables in the `custom_dc/env.list` file. See [above](#set-vars) for the details, with the following differences:
- For the `INPUT_DIR`, specify the full local path where your CSV and JSON files are stored, as described in the [Getting started](/custom_dc/).
- For the `INPUT_DIR`, specify the full local path where your CSV and JSON files are stored, as described in the [Quickstart](/custom_dc/quickstart.html#env-vars).
- Set `GOOGLE_CLOUD_PROJECT` to your GCP project name.

### Step 2: Generate credentials for Google Cloud authentication {#gen-creds}
Expand Down
7 changes: 5 additions & 2 deletions custom_dc/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,12 @@ Yes; there are many options for doing so. If you want an entirely private site w

Note that you cannot apply fine-grained access restrictions, such as access to specific data or pages. Access is either all or nothing. If you want to be able to partition off data, you would need to create additional custom instances.

### Will my custom data end up in the base Data Commons?
### Will my data or queries end up in base Data Commons? {#data-security}

No. Even if a query needs data from the base Data Commons, gleaned from the (incomplete) results from your custom data, it will never transfer any custom data to the base data store.
Your user queries, observations data, or property values are never transferred to base Data Commons. The NL model built from your custom data lives solely in your custom instance. The custom Data Commons instance does make API calls to the base Data Commons instance (as depicted in [this diagram](/custom_dc/index.html#system-overview)) only in the following instances:
- At data load time, API calls are made from the custom instance to the base instance to resolve entity names to [DCIDs](/glossary.html#dcid); for example, if your data refers to a particular country name, the custom instance will send an API request to look up its DCID.
- At run time, when a user enters an NL query, the custom instance uses its local NL model to identify the relevant statistical variables. The custom instance then issues two requests for statistical variable observations: a SQL query to your custom SQL database and an API call to the base Data Commons database. These requests only include DCIDs and contain no information about the original query or context of the user request. The data is joined by entity DCIDs.
- At run time, when the website frontend renders a data visualization, it will also make the same two requests to get observations data.

## Natural language processing

Expand Down
14 changes: 9 additions & 5 deletions custom_dc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@ If you have the resources to develop and maintain a custom Data Commons instance
- You want to add your own data to Data Commons but want to customize the UI of the site.
- You want to add your own private data to Data Commons, and restrict access to it.

Also, if you want to add all of your data to the base Data Commons and test how it will work with the exploration tools and natural language queries, you will need to at least host a local development site for testing purposes.

For the following use cases, a custom Data Commons instance is not necessary:

- You only want to make your own data available to the base public Data Commons site and don't need to test it. In this case, see the procedures in [Data imports](/import_dataset/index.html).
Expand All @@ -53,16 +51,22 @@ For the following use cases, a custom Data Commons instance is not necessary:
1. You cannot set access controls on specific data, only the entire custom site.

## System overview
{: #system-overview}

Essentially, a custom Data Commons instance is a mirror of the public Data Commons, that runs in [Docker](http://docker.com) containers hosted in the cloud. In the browsing tools, the custom data appears alongside the base data in the list of variables. When a query is sent to the custom website, a Data Commons server fetches both the custom and base data to provide multiple visualizations. At a high level, here is a conceptual view of a custom Data Commons instance:

![setup1](/assets/images/custom_dc/customdc_setup1.png){: height="450" }

A custom Data Commons instance uses custom data that you provide as raw CSV files. An importer script converts the CSV data into the Data Commons format and stores this in a SQL database. For local development, we provide a lightweight, open-source [SQLite](http://sqlite.org) database; for production, we recommend that you use [Google Cloud SQL](https://cloud.google.com/sql/){: target="_blank"}.

In addition to the data, a custom Data Commons instance consists of two Docker containers: one with the core services that serve the data and website; and one with utilities for managing and loading custom data and embeddings used for natural-language processing.

Details about the components that make up the containers are provided in the [Getting started](/custom_dc/quickstart.html) guide.
> **Note**: You have full control and ownership of your data, which will live in SQL data stores that you own and manage. Your data is never transferred to the base Data Commons data stores managed by Google; see full details in this [FAQ](/custom_dc/faq.html#data-security).
In addition to the data, a custom Data Commons instance consists of two Docker containers:
- A "data management" container, with utilities for managing and loading custom data and embeddings used for natural-language processing
- A "services" container, with the core services that serve the data and website

Details about the components that make up the containers are provided in the [Quickstart](/custom_dc/quickstart.html) guide.

## Requirements and cost

Expand All @@ -83,7 +87,7 @@ The cost of running a site on Google Cloud Platform depends on the size of your
{: #workflow}
## Recommended workflow

1. Work through the [Getting started](/custom_dc/quickstart.html) page to learn how to run a local Data Commons instance and load some sample data.
1. Work through the [Quickstart](/custom_dc/quickstart.html) page to learn how to run a local Data Commons instance and load some sample data.
1. Prepare your real-world data and load it in the local custom instance. Data Commons requires your data to be in a specific format. See [Prepare and load your own data](/custom_dc/custom_data.html) for details.
> Note: This section is very important! If your data is not in the scheme Data Commons expects, it won't load.
1. If you want to customize the look of the feel of the site, see [Customize the site](/custom_dc/custom_ui.html).
Expand Down
4 changes: 2 additions & 2 deletions custom_dc/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
layout: default
title: Getting started
title: Quickstart
nav_order: 2
parent: Build your own Data Commons
---

{:.no_toc}
# Getting started
# Quickstart

This page shows you how to run a local custom Data Commons instance inside Docker containers and load sample custom data from a local SQLite database. A custom Data Commons instance uses code from the public open-source repo, available at [https://github.com/datacommonsorg/](https://github.com/datacommonsorg/){: target="_blank"}.

Expand Down

0 comments on commit 5a670c6

Please sign in to comment.