diff --git a/custom_dc/data_cloud.md b/custom_dc/data_cloud.md index 6404c5c98..9614b21a6 100644 --- a/custom_dc/data_cloud.md +++ b/custom_dc/data_cloud.md @@ -157,7 +157,7 @@ Before you proceed, ensure you have completed steps 1 to 3 of the [One-time setu ### Step 1: Set environment variables To run a local instance of the services container, you need to set all the environment variables in the `custom_dc/env.list` file. See [above](#set-vars) for the details, with the following differences: -- For the `INPUT_DIR`, specify the full local path where your CSV and JSON files are stored, as described in the [Getting started](/custom_dc/). +- For the `INPUT_DIR`, specify the full local path where your CSV and JSON files are stored, as described in the [Quickstart](/custom_dc/quickstart.html#env-vars). - Set `GOOGLE_CLOUD_PROJECT` to your GCP project name. ### Step 2: Generate credentials for Google Cloud authentication {#gen-creds} diff --git a/custom_dc/faq.md b/custom_dc/faq.md index 4d5866b8a..7bb9d6298 100644 --- a/custom_dc/faq.md +++ b/custom_dc/faq.md @@ -19,9 +19,12 @@ Yes; there are many options for doing so. If you want an entirely private site w Note that you cannot apply fine-grained access restrictions, such as access to specific data or pages. Access is either all or nothing. If you want to be able to partition off data, you would need to create additional custom instances. -### Will my custom data end up in the base Data Commons? +### Will my data or queries end up in base Data Commons? {#data-security} -No. Even if a query needs data from the base Data Commons, gleaned from the (incomplete) results from your custom data, it will never transfer any custom data to the base data store. +Your user queries, observations data, or property values are never transferred to base Data Commons. The NL model built from your custom data lives solely in your custom instance. The custom Data Commons instance does make API calls to the base Data Commons instance (as depicted in [this diagram](/custom_dc/index.html#system-overview)) only in the following instances: +- At data load time, API calls are made from the custom instance to the base instance to resolve entity names to [DCIDs](/glossary.html#dcid); for example, if your data refers to a particular country name, the custom instance will send an API request to look up its DCID. +- At run time, when a user enters an NL query, the custom instance uses its local NL model to identify the relevant statistical variables. The custom instance then issues two requests for statistical variable observations: a SQL query to your custom SQL database and an API call to the base Data Commons database. These requests only include DCIDs and contain no information about the original query or context of the user request. The data is joined by entity DCIDs. +- At run time, when the website frontend renders a data visualization, it will also make the same two requests to get observations data. ## Natural language processing diff --git a/custom_dc/index.md b/custom_dc/index.md index eb60a338e..0cf775651 100644 --- a/custom_dc/index.md +++ b/custom_dc/index.md @@ -28,8 +28,6 @@ If you have the resources to develop and maintain a custom Data Commons instance - You want to add your own data to Data Commons but want to customize the UI of the site. - You want to add your own private data to Data Commons, and restrict access to it. -Also, if you want to add all of your data to the base Data Commons and test how it will work with the exploration tools and natural language queries, you will need to at least host a local development site for testing purposes. - For the following use cases, a custom Data Commons instance is not necessary: - You only want to make your own data available to the base public Data Commons site and don't need to test it. In this case, see the procedures in [Data imports](/import_dataset/index.html). @@ -53,6 +51,7 @@ For the following use cases, a custom Data Commons instance is not necessary: 1. You cannot set access controls on specific data, only the entire custom site. ## System overview +{: #system-overview} Essentially, a custom Data Commons instance is a mirror of the public Data Commons, that runs in [Docker](http://docker.com) containers hosted in the cloud. In the browsing tools, the custom data appears alongside the base data in the list of variables. When a query is sent to the custom website, a Data Commons server fetches both the custom and base data to provide multiple visualizations. At a high level, here is a conceptual view of a custom Data Commons instance: @@ -60,9 +59,14 @@ Essentially, a custom Data Commons instance is a mirror of the public Data Commo A custom Data Commons instance uses custom data that you provide as raw CSV files. An importer script converts the CSV data into the Data Commons format and stores this in a SQL database. For local development, we provide a lightweight, open-source [SQLite](http://sqlite.org) database; for production, we recommend that you use [Google Cloud SQL](https://cloud.google.com/sql/){: target="_blank"}. -In addition to the data, a custom Data Commons instance consists of two Docker containers: one with the core services that serve the data and website; and one with utilities for managing and loading custom data and embeddings used for natural-language processing. -Details about the components that make up the containers are provided in the [Getting started](/custom_dc/quickstart.html) guide. +> **Note**: You have full control and ownership of your data, which will live in SQL data stores that you own and manage. Your data is never transferred to the base Data Commons data stores managed by Google; see full details in this [FAQ](/custom_dc/faq.html#data-security). + +In addition to the data, a custom Data Commons instance consists of two Docker containers: +- A "data management" container, with utilities for managing and loading custom data and embeddings used for natural-language processing +- A "services" container, with the core services that serve the data and website + +Details about the components that make up the containers are provided in the [Quickstart](/custom_dc/quickstart.html) guide. ## Requirements and cost @@ -83,7 +87,7 @@ The cost of running a site on Google Cloud Platform depends on the size of your {: #workflow} ## Recommended workflow -1. Work through the [Getting started](/custom_dc/quickstart.html) page to learn how to run a local Data Commons instance and load some sample data. +1. Work through the [Quickstart](/custom_dc/quickstart.html) page to learn how to run a local Data Commons instance and load some sample data. 1. Prepare your real-world data and load it in the local custom instance. Data Commons requires your data to be in a specific format. See [Prepare and load your own data](/custom_dc/custom_data.html) for details. > Note: This section is very important! If your data is not in the scheme Data Commons expects, it won't load. 1. If you want to customize the look of the feel of the site, see [Customize the site](/custom_dc/custom_ui.html). diff --git a/custom_dc/quickstart.md b/custom_dc/quickstart.md index cbe29697f..d54253b1d 100644 --- a/custom_dc/quickstart.md +++ b/custom_dc/quickstart.md @@ -1,12 +1,12 @@ --- layout: default -title: Getting started +title: Quickstart nav_order: 2 parent: Build your own Data Commons --- {:.no_toc} -# Getting started +# Quickstart This page shows you how to run a local custom Data Commons instance inside Docker containers and load sample custom data from a local SQLite database. A custom Data Commons instance uses code from the public open-source repo, available at [https://github.com/datacommonsorg/](https://github.com/datacommonsorg/){: target="_blank"}.