Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document data docker schema update mode #527

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 45 additions & 43 deletions custom_dc/custom_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ San Francisco,2023,300,300,200,50
San Jose,2023,400,400,300,0
```

The _ENTITY_ is an existing property in the Data Commons knowledge graph that is used to describe an entity, most commonly a place. The best way to think of the entity type is as a key that could be used to join to other data sets. The column heading can be expressed as any existing place-related property; see [Place types](/place_types.html) for a full list. It may also be any of the special DCID prefixes listed in [Special place names](#special-names).
The _ENTITY_ is an existing property in the Data Commons knowledge graph that is used to describe an entity, most commonly a place. The best way to think of the entity type is as a key that could be used to join to other data sets. The column heading can be expressed as any existing place-related property; see [Place types](/place_types.html) for a full list. It may also be any of the special DCID prefixes listed in [Special place names](#special-names).

> **Note:** The type of the entities in a single file should be unique; do not mix multiple entity types in the same CSV file. For example, if you have observations for cities and counties, put all the city data in one CSV file and all the county data in another one.

Expand Down Expand Up @@ -105,49 +105,49 @@ The config.json file specifies how the CSV contents should be mapped and resolve
Here is the general spec for the JSON file:

<pre>
{
"inputFiles": {
"<var>FILE_NAME1</var>": {
"entityType": "<var>ENTITY_PROPERTY</var>",
"ignoreColumns": ["<var>COLUMN1</var>", "<var>COLUMN2</var>", ...],
{
"inputFiles": {
"<var>FILE_NAME1</var>": {
"entityType": "<var>ENTITY_PROPERTY</var>",
"ignoreColumns": ["<var>COLUMN1</var>", "<var>COLUMN2</var>", ...],
"provenance": "<var>NAME</var>",
"observationProperties" {
"unit": "<var>MEASUREMENT_UNIT</var>",
"observationPeriod": "<var>OBSERVATION_PERIOD</var>",
"scalingFactor": "<var>DENOMINATOR_VALUE</var>",
"measurementMethod": "<var>METHOD</var>"
}
},
"<var>FILE_NAME2</var>": {
...
},
...
"variables": {
"<var>VARIABLE1</var>": {"group": "<var>GROUP_NAME1</var>"},
"VARIABLE2": {"group": "<var>GROUP_NAME1</var>"},
"<var>VARIABLE3</var>": {
"name": "<var>DISPLAY_NAME</var>",
"description": "<var>DESCRIPTION</var>",
"searchDescriptions": ["<var>SENTENCE1</var>", "<var>SENTENCE2</var>", ...],
"group": "<var>GROUP_NAME2</var>",
"properties": {
"<var>PROPERTY_NAME1</var>":"<var>VALUE</var>",
"<var>PROPERTY_NAME2</var>":"<var>VALUE</var>",
}
},
},
"sources": {
"<var>SOURCE_NAME1</var>": {
"url": "<var>URL</var>",
"provenances": {
"<var>PROVENANCE_NAME1</var>": "<var>URL</var>",
"<var>PROVENANCE_NAME2</var>": "<var>URL</var>",
...
}
}
}
}
},
"<var>FILE_NAME2</var>": {
...
},
...
"variables": {
"<var>VARIABLE1</var>": {"group": "<var>GROUP_NAME1</var>"},
"VARIABLE2": {"group": "<var>GROUP_NAME1</var>"},
"<var>VARIABLE3</var>": {
"name": "<var>DISPLAY_NAME</var>",
"description": "<var>DESCRIPTION</var>",
"searchDescriptions": ["<var>SENTENCE1</var>", "<var>SENTENCE2</var>", ...],
"group": "<var>GROUP_NAME2</var>",
"properties": {
"<var>PROPERTY_NAME1</var>":"<var>VALUE</var>",
"<var>PROPERTY_NAME2</var>":"<var>VALUE</var>",
}
},
},
"sources": {
"<var>SOURCE_NAME1</var>": {
"url": "<var>URL</var>",
"provenances": {
"<var>PROVENANCE_NAME1</var>": "<var>URL</var>",
"<var>PROVENANCE_NAME2</var>": "<var>URL</var>",
...
}
}
}
}
</pre>

Each section contains some required and optional fields, which are described in detail below.
Expand Down Expand Up @@ -192,7 +192,7 @@ You must specify the provenance details under `sources`.`provenances`; this fiel
- [`unit`](/glossary.html#unit): The unit of measurement used in the observations. This is a string representing a currency, area, weight, volume, etc. For example, `SquareFoot`, `USD`, `Barrel`, etc.
- [`measurementPeriod`](/glossary.html#observation-period): The period of time in which the observations were recorded. This must be in ISO duration format, namely `P[0-9][Y|M|D|h|m|s]`. For example, `P1Y` is 1 year, `P3M` is 3 months, `P3h` is 3 hours.
- [`measurementMethod`](/glossary.html#measurement-method): The method used to gather the observations. This can be a random string or an existing DCID of [`MeasurementMethodEnum`](https://datacommons.org/browser/MeasurementMethodEnum){: target="_blank"} type; for example, `EDA_Estimate` or `WorldBankEstimate`.
- [`scalingFactor`](/glossary.html#scaling-factor): An integer representing the denominator used in measurements involving ratios or percentages. For example, for percentages, the denominator would be `100`.
- [`scalingFactor`](/glossary.html#scaling-factor): An integer representing the denominator used in measurements involving ratios or percentages. For example, for percentages, the denominator would be `100`.

Note that you cannot mix different property values in a single CSV file. If you have observations using different properties, you must put them in separate CSV files.

Expand All @@ -204,7 +204,7 @@ The `variables` section is optional. You can use it to override names and associ

`name`

: The display name of the variable, which will show up in the site's exploration tools. If not specified, the column name is used as the display name.
: The display name of the variable, which will show up in the site's exploration tools. If not specified, the column name is used as the display name.
The name should be concise and precise; that is, the shortest possible name that allow humans to uniquely identify a given variable. The name is used to generate NL embeddings.

`description`
Expand Down Expand Up @@ -234,7 +234,7 @@ Each property is specified as a key:value pair. Here are some examples:

You can have a multi-level group hierarchy by using `/` as a separator between each group.

`searchDescriptions`
`searchDescriptions`

: An array of descriptions to be used for creating more NL embeddings for the variable. This is only needed if the variable `name` is not sufficient for generating embeddings.

Expand All @@ -261,12 +261,12 @@ The `sources` section is optional. It encodes the sources and provenances associ

The following procedures show you how to load and serve your custom data locally.

To load data in Google Cloud, see instead [Load data in Google Cloud](/custom_dc/deploy_cloud.html) for procedures.
To load data in Google Cloud, see instead [Load data in Google Cloud](/custom_dc/data_cloud.html) for procedures.

### Configure environment variables

Edit the `env.list` file you created [previously](/custom_dc/quickstart.html#env-vars) as follows:
- Set the `INPUT_DIR` variable to the directory where your input files are stored.
- Set the `INPUT_DIR` variable to the directory where your input files are stored.
- Set the `OUTPUT_DIR` variable to the directory where you would like the output files to be stored. This can be the same or different from the input directory. When you rerun the Docker data management container, it will create a `datacommons` subdirectory under this directory.

### Start the Docker containers with local custom data {#docker-data}
Expand Down Expand Up @@ -303,7 +303,7 @@ If you need to troubleshoot custom data, it is helpful to inspect the contents o

To do so, from a terminal window, open the database:

<pre>
<pre>
sqlite3 <var>OUTPUT_DIRECTORY</var>/datacommons/datacommons.db
</pre>

Expand All @@ -328,3 +328,5 @@ country/BEL|average_annual_wage|2005|55662.21541|c/p/1

To exit the sqlite shell, press Ctrl-D.

### Database schema updates

28 changes: 14 additions & 14 deletions custom_dc/data_cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ This page shows you how to store your custom data in Google Cloud, and create th

## Overview

Once you have tested locally, the next step is to get your data into the Google Cloud Platform. You upload your CSV and JSON files to [Google Cloud Storage](https://cloud.google.com/storage){: target="_blank"}, and run the Data Commons data management Docker container as a Cloud Run job. The job will transform and store the data in a [Google Cloud SQL](https://cloud.google.com/sql){: target="_blank"} database, and generate NL embeddings stored in Cloud Storage.
Once you have tested locally, the next step is to get your data into the Google Cloud Platform. You upload your CSV and JSON files to [Google Cloud Storage](https://cloud.google.com/storage){: target="_blank"}, and run the Data Commons data management Docker container as a Cloud Run job. The job will transform and store the data in a [Google Cloud SQL](https://cloud.google.com/sql){: target="_blank"} database, and generate NL embeddings stored in Cloud Storage.

![data management setup](/assets/images/custom_dc/customdc_setup3.png)

Alternatively, if you have a very large data set, you may find it faster to store your input files and run the data management container locally, and output the data to Google Cloud Storage. If you would like to use this approach, follow steps 1 to 3 of the one-time setup steps below and then skip to [Run the data management container locally](#run-local).
Alternatively, if you have a very large data set, you may find it faster to store your input files and run the data management container locally, and output the data to Google Cloud Storage. If you would like to use this approach, follow steps 1 to 3 of the one-time setup steps below and then skip to [Run the data management container locally](#run-local).

## Prerequisites

Expand All @@ -42,10 +42,10 @@ This stores the CSV and JSON files that you will upload whenever your data chang
1. For the **Location type**, choose the same regional options as for Cloud SQL above.
1. When you have finished setting all the configuration options, click **Create**.
1. In the **Bucket Details** page, click **Create Folder** to create a new folder to hold your data and name it as desired.
1. Optionally, create separate folders to hold input and output files, or just use the same one as for the input.
1. Optionally, create separate folders to hold input and output files, or just use the same one as for the input.

**Note:** If you plan to run the data management container locally, you only need to create a single folder to hold the output files.
1. Record the folder path(s) as <code>gs://<var>BUCKET_NAME</var>/<var>FOLDER_PATH</var></code> for setting the `INPUT_DIR` and `OUTPUT_DIR` environment variables below.
1. Record the folder path(s) as <code>gs://<var>BUCKET_NAME</var>/<var>FOLDER_PATH</var></code> for setting the `INPUT_DIR` and `OUTPUT_DIR` environment variables below.

### Step 3: Create a Google Cloud SQL instance

Expand All @@ -64,7 +64,7 @@ This stores the data that will be served at run time. The Data Commons data mana
1. Select **Databases**.
1. Click **Create Database**.
1. Choose a name for the database or use the default, `datacommons`.
1. Click **Create**.
1. Click **Create**.
1. In the **Overview** page for the new instance, record the **Connection name** to set in environment variables in the next step.

### Step 4: Create a Google Cloud Run job
Expand Down Expand Up @@ -93,13 +93,13 @@ Now set environment variables:
1. Add names and values for the following environment variables:
- `USE_CLOUDSQL`: Set to `true`.
- `DC_API_KEY`: Set to your API key.
- `INPUT_DIR`: Set to the Cloud Storage bucket and input folder that you created in step 2 above.
- `INPUT_DIR`: Set to the Cloud Storage bucket and input folder that you created in step 2 above.
- `OUTPUT_DIR`: Set to the Cloud Storage bucket (and, optionally, output folder) that you created in step 2 above. If you didn't create a separate folder for output, specify the same folder as the `INPUT_DIR`.
- `CLOUDSQL_INSTANCE`: Set to the full connection name of the instance you created in step 3 above.
- `DB_USER`: Set to a user you configured when you created the instance in step 3, or to `root` if you didn't create a new user.
- `DB_PASS`: Set to the user's or root password you configured when you created the instance in step 3.
- `DB_NAME`: Only set this if you configured the database name to something other than `datacommons`.
1. When you finished, click **Done**.
1. When you have finished, click **Done**.

![Cloud Run job](/assets/images/custom_dc/gcp_screenshot3.png){: width="450" }

Expand All @@ -116,7 +116,7 @@ Now set environment variables:

> **Note:** Do not upload the local `datacommons` subdirectory or its files.

As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. To load them into Cloud SQL, you run the Cloud Run job you created above.
As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. To load them into Cloud SQL, you run the Cloud Run job you created above.

### Step 2: Start the data management Cloud Run job {#run-job}

Expand All @@ -128,7 +128,7 @@ To run the job:

1. Go to [https://console.cloud.google.com/run/jobs](https://console.cloud.google.com/run/jobs){: target="_blank"} for your project.
1. From the list of jobs, click the link of the "datacommons-data" job you created above.
1. Click **Execute**. It will take several minutes for the job to run. You can click the **Logs** tab to view the progress.
1. Click **Execute**. It will take several minutes for the job to run. You can click the **Logs** tab to view the progress.

When it completes, to verify that the data has been loaded correctly, see the next step.

Expand Down Expand Up @@ -157,7 +157,7 @@ Before you proceed, ensure you have completed steps 1 to 3 of the [One-time setu
### Step 1: Set environment variables

To run a local instance of the services container, you need to set all the environment variables in the `custom_dc/env.list` file. See [above](#set-vars) for the details, with the following differences:
- For the `INPUT_DIR`, specify the full local path where your CSV and JSON files are stored, as described in the [Quickstart](/custom_dc/quickstart.html#env-vars).
- For the `INPUT_DIR`, specify the full local path where your CSV and JSON files are stored, as described in the [Quickstart](/custom_dc/quickstart.html#env-vars).
- Set `GOOGLE_CLOUD_PROJECT` to your GCP project name.

### Step 2: Generate credentials for Google Cloud authentication {#gen-creds}
Expand All @@ -170,13 +170,13 @@ Open a terminal window and run the following command:
gcloud auth application-default login
```

This opens a browser window that prompts you to enter credentials, sign in to Google Auth Library and allow Google Auth Library to access your account. Accept the prompts. When it has completed, a credential JSON file is created in
This opens a browser window that prompts you to enter credentials, sign in to Google Auth Library and allow Google Auth Library to access your account. Accept the prompts. When it has completed, a credential JSON file is created in
`$HOME/.config/gcloud/application_default_credentials.json`. Use this in the command below to authenticate from the docker container.

The first time you run it, may be prompted to specify a quota project for billing that will be used in the credentials file. If so, run this command:

<pre>
gcloud auth application-default set-quota-project <var>PROJECT_ID</var>
<pre>
gcloud auth application-default set-quota-project <var>PROJECT_ID</var>
</pre>

If you are prompted to install the Cloud Resource Manager API, press `y` to accept.
Expand Down Expand Up @@ -215,7 +215,7 @@ See the section [above](#gen-creds) for procedures.

From the root directory of your repo, run the following command, assuming you are using a locally built image:

<pre>
<pre>
docker run -it \
--env-file $PWD/custom_dc/env.list \
-p 8080:8080 \
Expand Down
64 changes: 64 additions & 0 deletions custom_dc/database_update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
layout: default
title: Update your database schema
nav_order: 9
parent: Build your own Data Commons
---

{:.no_toc}
# Update your database schema

While starting Data Commons services, you may see an error that starts with `SQL schema check failed`. This means your database schema must be updated for compatibility with the latest Data Commons services.

You can update your database by running a data management job with the environment variable `SCHEMA_UPDATE_ONLY` set to `true`. This will alter your database without modifying already-imported data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to be updated based on what we decide in https://github.com/datacommonsorg/website/pull/4686/files#r1815646141

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


Running a data management job in the default mode will also update the database schema, but may take longer since it fully re-imports your custom data.

Once your database is updated, starting Data Commons services should succeed.

This page contains detailed instructions for passing `SCHEMA_UPDATE_ONLY` to the data management container using various workflows.

* TOC
{:toc}

## Local data management job with local SQLite database

Add `-e SCHEMA_UPDATE_ONLY=true` to the Docker run command for the data management container (the first command in [this doc section](/custom_dc/custom_data.html#docker-data){: target="_blank"}):

<pre>
docker run \
--env-file $PWD/custom_dc/env.list \
-v <var>INPUT_DIRECTORY</var>:<var>INPUT_DIRECTORY</var> \
-v <var>OUTPUT_DIRECTORY</var>:<var>OUTPUT_DIRECTORY</var> \
<b>-e SCHEMA_UPDATE_ONLY=true</b> \
gcr.io/datcom-ci/datacommons-data:stable
</pre>

## Cloud Run data management job

Run your existing Cloud Run job with an environment variable override.

1. Go to [https://console.cloud.google.com/run/jobs](https://console.cloud.google.com/run/jobs){: target="_blank"} for your project.
1. From the list of jobs, click the link of the "datacommons-data" job. This should be a job that uses the `stable` or `latest` version of the image hosted at gcr.io/datcom-ci/datacommons-data:stable.
1. Next to Execute, use the dropdown to find the option to **Execute with overrides**.
1. Use the **Add variable** button to set a variable with name `SCHEMA_UPDATE_ONLY` and value `true`.
1. Click **Execute**.
1. It should only take a few minutes for the job to run. You can click the **Logs** tab to view the progress.


## (Advanced) Local data management job with Cloud SQL

If you followed [these instructions](/custom_dc/data_cloud.html#run-local){: target="_blank"} to load data from your local machine into a Cloud SQL database, add `-e SCHEMA_UPDATE_ONLY=true` to the Docker run command from the final step:

<pre>
docker run \
--env-file $PWD/custom_dc/env.list \
-v <var>INPUT_DIRECTORY</var>:<var>INPUT_DIRECTORY</var> \
-v <var>OUTPUT_DIRECTORY</var>:<var>OUTPUT_DIRECTORY</var> \
-e GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json \
-v $HOME/.config/gcloud/application_default_credentials.json:/gcp/creds.json:ro \
<b>-e SCHEMA_UPDATE_ONLY=true</b> \
gcr.io/datcom-ci/datacommons-data:<var>VERSION</var>
</pre>

Substitute the `VERSION` that matches the services container image which failed with a schema check error (typically either `stable` or `latest`).
Loading