Skip to content

Commit

Permalink
Address feedback from one.org et. al. (#532)
Browse files Browse the repository at this point in the history
* integrate custom docs with new UI

* more edits

* use website wording for intro

* fix numbering in table

* rename and some edits

* rename manage_repo file, per Bo

* Merge.

* formatting edits

* updates per Keyur's feedback

* Fix typos

* fix nav order

* fix link to API key request form

* update form link

* update key request form and output dir env var

* Revert to gerund

Though the style guide says to just use imperatives, "get started" just sounds weird. Also this is more consistent with "troubleshooting"

* new troubleshooting entry

* fix typo

* new data container procedures

* more work

* more work

* complete data draft

* more changes

* more changes

* more revisions

* update troubleshooting doc etc.

* new version of diagrams

* remove data loading problems troubleshooting entry; can't reproduce

* revert title change

* add example for not mixing entity types

* changes from Keyur

* add screenshots for GCP, and related changes

* fixed one image

* added screenshots for Cloud Run service

* resize images

* more changes from Keyur

* fix a tiny error

* delete unused images

* fix missing dash

* update build file

* adjust build command

* Revert "adjust build command"

This reverts commit 4ce0fb9.

* update docker file

* more fixes

* one last fix

* make links to Cloud Console open in a new page

* fixes to quickstart suggested by Prem

* one more change

* change from Keyur

* revise procedure

* merge

* add brief explanation of data model to quickstart

* slight wording tweak

* incorporate feedback from Keyur

* remove erroneous edit

* correct missing text

* more work on tasks for finding stuff

* merge

* update to use env.sample

* typo

* typo

* get file back in head shape

* fix file name

* add more detail about data security

* fix typo

* corrections from Keyur

* fix other mention of SQL queries

* add both data directories to docker run commands

* remove extra slash

* update feedback links

* tiny tweaks

* fixes from Hannah

* fix grammar

* remove redundant text

* add link for data requests

* second try

* fix link

* add template parameter back

* add link to issue tracker docs

* feedback from Keyur

* fix template parameter

* add doc for observation properties

* more edits

* corrections from Keyur

* one more change from Keyur

* Add update schema option.

* wording fixes

* add CLI procedures

* fix troubleshooting doc

* fix custom data

* fix troubleshooting again

* added procedures for using Secret Manager

* small fixes
  • Loading branch information
kmoscoe authored Nov 5, 2024
1 parent 82becc0 commit 2ad5474
Show file tree
Hide file tree
Showing 5 changed files with 68 additions and 13 deletions.
Binary file modified assets/images/custom_dc/gcp_screenshot3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/images/custom_dc/gcp_screenshot7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 54 additions & 8 deletions custom_dc/data_cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,18 @@ This stores the data that will be served at run time. The Data Commons data mana
1. Click **Create**.
1. In the **Overview** page for the new instance, record the **Connection name** to set in environment variables in the next step.

### Step 4: Create a Google Cloud Run job
### Step 4 (optional but recommended): Add secrets to the Google Cloud Secret Manager

Although this is not strictly required, we recommend that you store secrets, including your API keys and DB passwords, in [Google Cloud Secret Manager](https://cloud.google.com/security/products/secret-manager){: target="_blank"}, where they are encrypted in transit and at rest, rather than stored and transmitted in plain text. See also the [Secret Manager](https://cloud.google.com/run/docs/create-jobs){: target="_blank"} documentation for additional options available.

1. Go to [https://console.cloud.google.com/security/secret-manager](https://console.cloud.google.com/security/secret-manager){: target="_blank"} for your project.
1. Click **Create secret**.
1. Enter a name that indicates the purpose of the secret; for example, for the Data Commons API key, name it something like `dc-api-key`.
1. In the **Secret value** field, enter the value.
1. Click **Create secret**.
1. Repeat the same procedure for the Maps API key and any passwords you created for your Cloud SQL database in step 3.

### Step 5: Create a Google Cloud Run job

Since you won't need to customize the data management container, you can simply run an instance of the released container provided by Data Commons team, at [https://console.cloud.google.com/gcr/images/datcom-ci/global/datacommons-data](https://console.cloud.google.com/gcr/images/datcom-ci/global/datacommons-data){: target="_blank"}.

Expand All @@ -92,14 +103,13 @@ Now set environment variables:
1. Click **Add variable**.
1. Add names and values for the following environment variables:
- `USE_CLOUDSQL`: Set to `true`.
- `DC_API_KEY`: Set to your API key.
- `INPUT_DIR`: Set to the Cloud Storage bucket and input folder that you created in step 2 above.
- `OUTPUT_DIR`: Set to the Cloud Storage bucket (and, optionally, output folder) that you created in step 2 above. If you didn't create a separate folder for output, specify the same folder as the `INPUT_DIR`.
- `CLOUDSQL_INSTANCE`: Set to the full connection name of the instance you created in step 3 above.
- `DB_USER`: Set to a user you configured when you created the instance in step 3, or to `root` if you didn't create a new user.
- `DB_PASS`: Set to the user's or root password you configured when you created the instance in step 3.
- `DB_NAME`: Only set this if you configured the database name to something other than `datacommons`.
1. When you finished, click **Done**.
1. If you are not storing API keys and passwords in Google Secret Manager, add variables for `DC_API_KEY` and `DB_PASS`. Otherwise, click **Reference a secret**, in the **Name** field, enter `DC_API_KEY`, and from the **Secret** drop-down field, select the relevant secret you created in step 4. Repeat for `DB_PASS`.
1. When you are finished, click **Done**.

![Cloud Run job](/assets/images/custom_dc/gcp_screenshot3.png){: width="450" }

Expand All @@ -110,13 +120,25 @@ Now set environment variables:

### Step 1: Upload data files to Google Cloud Storage

As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. If you want versioned snapshots, we recommend that you create a new subfolder and store the latest version of the files there. If you prefer to simply incrementally update, you can simply overwrite files in a pre-existing folder. Creating new subfolders is slower but safer. Overwriting files is faster but riskier.

To upload data using the Cloud Console:

1. Go to [https://console.cloud.google.com/storage/browse](https://console.cloud.google.com/storage/browse){: target="_blank"} and select your custom Data Commons bucket.
1. Navigate to the folder you created in the earlier step.
1. Click **Upload Files**, and select all your CSV files and `config.json`.

To upload data using the command line:

1. Navigate to your local "input" directory where your source files are located.
1. Run the following command:
<pre>
gcloud storage cp config.json *.csv gs://<var>BUCKET_NAME</var>/<var>FOLDER_PATH</var>
</pre>

> **Note:** Do not upload the local `datacommons` subdirectory or its files.
As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. To load them into Cloud SQL, you run the Cloud Run job you created above.
Once you have uploaded the new data, you must rerun the data management Cloud Run job.

### Step 2: Run the data management Cloud Run job {#run-job}

Expand All @@ -128,21 +150,42 @@ To run the job using the Cloud Console:

1. Go to [https://console.cloud.google.com/run/jobs](https://console.cloud.google.com/run/jobs){: target="_blank"} for your project.
1. From the list of jobs, click the link of the "datacommons-data" job you created above.
1. Optionally, if you have received a `SQL check failed` error when previously trying to start the container, and would like to minimize startup time, click **Execute with overrides** and click **Add variable** to set a new variable with name `DATA_RUN_MODE` and value `schemaupdate`.
1. Click **Execute**. It will take several minutes for the job to run. You can click the **Logs** tab to view the progress.

When it completes, to verify that the data has been loaded correctly, see the next step.
To run the job using the command line:

1. From any local directory, run the following command:
<pre>
gcloud run jobs execute <var>JOB_NAME</var>
</pre>
1. To view the progress of the job, run the following command:
<pre>
gcloud beta run jobs logs tail <var>JOB_NAME</var>
</pre>

When it completes, to verify that the data has been loaded correctly, see [Inspect the Cloud SQL database](#inspect-sql).

#### Run the data management Cloud Run job in schema update mode {#schema-update-mode}
#### Optional: Run the data management Cloud Run job in schema update mode {#schema-update-mode}

If you have tried to start a container, and have received a `SQL check failed` error, this indicates that a database schema update is needed. You need to restart the data management container, and you can specify an additional, optional, flag, `DATA_RUN_MODE=schemaupdate`. This mode updates the database schema without re-importing data or re-building natural language embeddings. This is the quickest way to resolve a SQL check failed error during services container startup.

To run the job using the Cloud Console:

1. Go to [https://console.cloud.google.com/run/jobs](https://console.cloud.google.com/run/jobs){: target="_blank"} for your project.
1. From the list of jobs, click the link of the "datacommons-data" job you created above.
1. Optionally, select **Execute** > **Execute with overrides** and click **Add variable** to set a new variable with name `DATA_RUN_MODE` and value `schemaupdate`.
1. Click **Execute**. It will take several minutes for the job to run. You can click the **Logs** tab to view the progress.

To run the job using the command line:
1. From any local directory, run the following command:
<pre>
gcloud run jobs execute <var>JOB_NAME</var> --update-env-vars DATA_RUN_MODE=schemaupdate
</pre>
1. To view the progress of the job, run the following command:
<pre>
gcloud beta run jobs logs tail <var>JOB_NAME</var>
</pre>

### Inspect the Cloud SQL database {#inspect-sql}

To view information about the created tables:
Expand Down Expand Up @@ -203,11 +246,14 @@ docker run \
-v <var>OUTPUT_DIRECTORY</var>:<var>OUTPUT_DIRECTORY</var> \
-e GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json \
-v $HOME/.config/gcloud/application_default_credentials.json:/gcp/creds.json:ro \
[-e DATA_RUN_MODE=schemaupdate \]
gcr.io/datcom-ci/datacommons-data:<var>VERSION</var>
</pre>

The version is `latest` or `stable`.

> Note: The DATA_RUN_MODE flag is only relevant if you have previously received a `SQL check failed` error, and is optional to speed up the startup process.
To verify that the data is correctly created in your Cloud SQL database, use the procedure in [Inspect the Cloud SQL database](#inspect-sql) above.

#### Run the data management Docker container in schema update mode
Expand Down
17 changes: 13 additions & 4 deletions custom_dc/deploy_cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,7 @@ See also [Deploying to Cloud Run](https://cloud.google.com/run/docs/deploying){:
1. Expand the **Variables and secrets** tab.
1. Click the **Variables and Secrets** tab.
1. Click **Add variable**.
1. Add the same environment variables, with the same names and values as you did when you created the [data management run job](/custom_dc/data_cloud.html#env-vars) You can omit the `INPUT_DIR` variable.
1. Add a variable for the `MAPS_API_KEY` and set it to your Maps API key.
1. Add the same environment variables and secrets, with the same names and values as you did when you created the [data management run job](/custom_dc/data_cloud.html#env-vars) You can omit the `INPUT_DIR` variable. Add a variable or reference a secret for `MAPS_API_KEY`.
1. When you are finished, click **Done**.

![Cloud Run service](/assets/images/custom_dc/gcp_screenshot7.png){: width="450"}
Expand All @@ -122,7 +121,17 @@ Click **Create** to kick off the deployment. Click the **Logs** tab to see the

## Manage the service

Every time you make changes to the code and release a new Docker artifact, or rerun the [data management job](/custom_dc/data_cloud.html#run-job), you need to restart the service as well. To do so:
Every time you make changes to the code and release a new Docker artifact, or rerun the [data management job](/custom_dc/data_cloud.html#run-job), you need to restart the service as well.

To restart the service using the Cloud Console:

1. Go to the [https://console.cloud.google.com/run/](https://console.cloud.google.com/run/){: target="_blank"} page, click on the service you created above, and click **Edit & Deploy Revision**.
1. Select a new container and click **Deploy**.
1. Select a new container image and click **Deploy**.

To restart the service using the command line:

From any local directory, run the following command:

<pre>
gcloud run deploy <var>SERVICE_NAME</var> --image <var>CONTAINER_IMAGE_URL</var>
</pre>
2 changes: 1 addition & 1 deletion custom_dc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ If you already have an account with another cloud provider, we can provide a con

In terms of development time and effort, to launch a site with custom data in compatible format and no UI customization, you can expect it to take less than three weeks. If you need substantial UI customization it may take up to four months.

The cost of running a site on Google Cloud Platform depends on the size of your data, the traffic you expect to receive, and the amount of geographical replication you want. For a small dataset, we have found the cost comes out to roughly $100 per year. You can get more precise information and cost estimation tools at [Google Cloud pricing](https://cloud.google.com/pricing){: target="_blank"}.
The cost of running a site on Google Cloud Platform depends on the size of your data, the traffic you expect to receive, and the amount of geographical replication you want. You can get precise information and cost estimation tools at [Google Cloud pricing](https://cloud.google.com/pricing){: target="_blank"}.

{: #workflow}
## Recommended workflow
Expand Down

0 comments on commit 2ad5474

Please sign in to comment.