Skip to content
Permalink

Comparing changes

This is a direct comparison between two commits made in this repository or its related repositories. View the default comparison for this range or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: datacommonsorg/docsite
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 87d87d1136955c0c973394ec17a3492481eb212b
Choose a base ref
..
head repository: datacommonsorg/docsite
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 53e07ad6402c2b48331afd336447738dbd715131
Choose a head ref
4 changes: 3 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ source "https://rubygems.org"
#
# To upgrade, run `bundle update github-pages`.
gem "github-pages", group: :jekyll_plugins
gem 'jekyll-redirect-from'


group :jekyll_plugins do
gem "jekyll-feed", "~> 0.6"
@@ -17,6 +17,8 @@ end
group :jekyll_plugins do
gem "jekyll-tabs"
gem "jekyll-relative-links"
gem 'jekyll-redirect-from'
gem "jekyll-last-modified-at"
end

gem "webrick", "~> 1.8"
3 changes: 3 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -135,6 +135,8 @@ GEM
octokit (~> 4.0, != 4.4.0)
jekyll-include-cache (0.2.1)
jekyll (>= 3.7, < 5.0)
jekyll-last-modified-at (1.3.2)
jekyll (>= 3.7, < 5.0)
jekyll-mentions (1.6.0)
html-pipeline (~> 2.3)
jekyll (>= 3.7, < 5.0)
@@ -276,6 +278,7 @@ PLATFORMS
DEPENDENCIES
github-pages
jekyll-feed (~> 0.6)
jekyll-last-modified-at
jekyll-redirect-from
jekyll-relative-links
jekyll-tabs
4 changes: 4 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -26,6 +26,10 @@ plugins:
- jekyll-redirect-from
- jekyll-tabs
- jekyll-relative-links
- jekyll-last-modified-at

last-modified-at:
date-format: '%B %d, %Y'

sass:
style: compressed
6 changes: 3 additions & 3 deletions _layouts/default.html
Original file line number Diff line number Diff line change
@@ -99,9 +99,9 @@
{% endunless %}
{{ content }}
</div>
<div style="text-align:center" id="feedback-form">
<p>
<a target="_blank" href="https://docs.google.com/forms/d/e/1FAIpQLSf23mC17idzIpzg6v4frCh8iWTl9dxeb4iSTVgo0WiBvnv5ZA/viewform?usp=pp_url&entry.871991796={{ page.url }}">
<div style="text-align:center" id="metadata">
<p>
Page last updated: {% last_modified_at %} &#8226; <a target="_blank" href="https://docs.google.com/forms/d/e/1FAIpQLSf23mC17idzIpzg6v4frCh8iWTl9dxeb4iSTVgo0WiBvnv5ZA/viewform?usp=pp_url&entry.871991796={{ page.url }}">
Send feedback about this page
</a>
</p>
Binary file modified assets/images/custom_dc/gcp_screenshot3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/images/custom_dc/gcp_screenshot7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
69 changes: 27 additions & 42 deletions custom_dc/build_image.md
Original file line number Diff line number Diff line change
@@ -41,84 +41,68 @@ If you want to pick up the latest prebuilt version, do the following:
```

## Build a local image {#build-repo}

You will need to build a local image in any of the following cases:
- You are making substantive changes to the website UI
- You are ready to deploy your custom site to GCP

Rather than building from "head", that is, the very latest changes in Github, which may not have been tested, we recommend that you use the tested "release" equivalent of the stable Docker image. This release uses the tag `customdc_stable`, and is available at [https://github.com/datacommonsorg/website/releases/tag/customdc_stable](https://github.com/datacommonsorg/website/releases/tag/customdc_stable){: target="_blank"}.

> **Note:** If you are working on a large-scale customization, we recommend that you use a version control system to manage your code. We provide procedures for Github, and assume the following:
- You have a Github account and project.
- You have created a fork off the base Data Commons `website` repo (https://github.com/datacommonsorg/website){: target="_blank"} and a remote that points to it, and that you will push to that fork.


### Sync a local workspace to the stable release
Building from the master branch includes the very latest changes in Github, that may not have been tested. Instead, we recommend that you use the tested "stable" branch equivalent of the stable Docker image. This branch is `customdc_stable`, and is available at [https://github.com/datacommonsorg/website/tree/customdc_stable](https://github.com/datacommonsorg/website/tree/customdc_stable){: target="_blank"}.

If you are using a version control system other than Github, you can download a ZIP or TAR file from [https://github.com/datacommonsorg/website/releases/tag/customdc_stable](https://github.com/datacommonsorg/website/releases/tag/customdc_stable){: target="_blank"}.
> **Note:** If you are working on a large-scale customization, we recommend that you use a version control system to manage your code. We provide some procedures for Github.
In Github, use the following procedure.
### Clone the stable branch only

1. If you want to reuse the root directory you previously created and cloned, skip to step 3.
If you want to create a new source directory and start from scratch, clone the repo up to the stable release tag:
Use this procedure if you are not using Github, or if you are using Github and want to create a new source directory and start from scratch.

1. Run the following command:
<pre>
git clone https://github.com/datacommonsorg/website --branch customdc_stable --single-branch [<var>DIRECTORY</var>]
</pre>
1. Change to the root directory:

This creates a new local branch called `customdc_stable` set to track the Data Commons repo branch.
1. To verify, run:
<pre>
cd website | cd <var>DIRECTORY</var>
git branch -vv
</pre>
You should see output like the following:

```
* customdc_stable 83732891 [origin/customdc_stable] 2024-11-06 Custom DC stable release (#4710)
```
Rather than developing on this default branch, we recommend that you create another branch.

1. Create a new branch synced to the stable release:
### Sync code to the stable branch

The following procedure uses Github. If you are using another version control system, use the appropriate methods for updating submodules and syncing.

1. Switch to the directory where you have cloned the Data Commons doe:
<pre>
git checkout -b <var>BRANCH_NAME</var> customdc_stable
cd website | cd <var>DIRECTORY</var>
</pre>

1. To verify that your local repo is at the same version of the code, run the following command:

1. Update files:
```
git log --oneline --graph
git pull origin customdc_stable
```
You should see output similar to the following:
Note that g`origin` here refers to the source `datacommonsorg/website` repo. You may be using another remote name to point to that repo.

You should see output like the following:
```
* 52635c8 (grafted, HEAD -> branch1, tag: customdc_stable) ...
...
From https://github.com/datacommonsorg/website
* branch customdc_stable -> FETCH_HEAD
Already up to date.
```

Verify that the last commit in the output matches that listed in https://github.com/datacommonsorg/website/releases/tag/customdc_stable.

1. Press `q` to exit the output log.

1. Create and update the necessary submodules:

```
git submodule foreach git pull origin customdc_stable
git submodule update --init --recursive
```
You should see output like the following:

```
Submodule 'import' (https://github.com/datacommonsorg/import.git) registered for path 'import'
Submodule 'mixer' (https://github.com/datacommonsorg/mixer.git) registered for path 'mixer'
Submodule path 'import': checked out '7d197583b6ad0dfe0568532f919482527c004a8e'
Submodule path 'mixer': checked out '478cd499d4841a14efaf96ccf71bd36b74604486'
```
1. Update all other files:

```
git pull origin customdc_stable
```
You will likely see the following output:

```
From https://github.com/datacommonsorg/website
* tag customdc_stable -> FETCH_HEAD
Already up to date.
```

### Build the repo locally

@@ -147,6 +131,7 @@ docker run -it \
--env-file $PWD/custom_dc/env.list \
-p 8080:8080 \
-e DEBUG=true \
-v <var>INPUT_DIRECTORY</var>:<var>INPUT_DIRECTORY</var> \
-v <var>OUTPUT_DIRECTORY</var>:<var>OUTPUT_DIRECTORY</var> \
[-v $PWD/server/templates/custom_dc/custom:/workspace/server/templates/custom_dc/custom \]
[-v $PWD/static/custom_dc/custom:/workspace/static/custom_dc/custom \]
20 changes: 20 additions & 0 deletions custom_dc/custom_data.md
Original file line number Diff line number Diff line change
@@ -273,6 +273,8 @@ Edit the `env.list` file you created [previously](/custom_dc/quickstart.html#env

Once you have configured everything, use the following commands to run the data management container and restart the services container, mapping your input and output directories to the same paths in Docker.

#### Step 1: Start the data management container

In one terminal window, from the root directory, run the following command to start the data management container:

<pre>
@@ -283,6 +285,24 @@ docker run \
gcr.io/datcom-ci/datacommons-data:stable
</pre>

##### (Optional) Start the data management container in schema update mode {#schema-update-mode}

If you have tried to start a container, and have received a `SQL check failed` error, this indicates that a database schema update is needed. You need to restart the data management container, and you can specify an additional, optional, flag, `DATA_RUN_MODE=schemaupdate`. This mode updates the database schema without re-importing data or re-building natural language embeddings. This is the quickest way to resolve a SQL check failed error during services container startup.

To do so, add the following line to the above command:

```
docker run \
...
-e DATA_RUN_MODE=schemaupdate \
...
gcr.io/datcom-ci/datacommons-data:stable
```

Once the job has run, go to step 2 below.

#### Step 2: Start the services container

In another terminal window, from the root directory, run the following command to start the services container:

<pre>
90 changes: 78 additions & 12 deletions custom_dc/data_cloud.md
Original file line number Diff line number Diff line change
@@ -67,7 +67,18 @@ This stores the data that will be served at run time. The Data Commons data mana
1. Click **Create**.
1. In the **Overview** page for the new instance, record the **Connection name** to set in environment variables in the next step.

### Step 4: Create a Google Cloud Run job
### Step 4 (optional but recommended): Add secrets to the Google Cloud Secret Manager

Although this is not strictly required, we recommend that you store secrets, including your API keys and DB passwords, in [Google Cloud Secret Manager](https://cloud.google.com/security/products/secret-manager){: target="_blank"}, where they are encrypted in transit and at rest, rather than stored and transmitted in plain text. See also the [Secret Manager](https://cloud.google.com/run/docs/create-jobs){: target="_blank"} documentation for additional options available.

1. Go to [https://console.cloud.google.com/security/secret-manager](https://console.cloud.google.com/security/secret-manager){: target="_blank"} for your project.
1. Click **Create secret**.
1. Enter a name that indicates the purpose of the secret; for example, for the Data Commons API key, name it something like `dc-api-key`.
1. In the **Secret value** field, enter the value.
1. Click **Create secret**.
1. Repeat the same procedure for the Maps API key and any passwords you created for your Cloud SQL database in step 3.

### Step 5: Create a Google Cloud Run job

Since you won't need to customize the data management container, you can simply run an instance of the released container provided by Data Commons team, at [https://console.cloud.google.com/gcr/images/datcom-ci/global/datacommons-data](https://console.cloud.google.com/gcr/images/datcom-ci/global/datacommons-data){: target="_blank"}.

@@ -92,14 +103,13 @@ Now set environment variables:
1. Click **Add variable**.
1. Add names and values for the following environment variables:
- `USE_CLOUDSQL`: Set to `true`.
- `DC_API_KEY`: Set to your API key.
- `INPUT_DIR`: Set to the Cloud Storage bucket and input folder that you created in step 2 above.
- `OUTPUT_DIR`: Set to the Cloud Storage bucket (and, optionally, output folder) that you created in step 2 above. If you didn't create a separate folder for output, specify the same folder as the `INPUT_DIR`.
- `CLOUDSQL_INSTANCE`: Set to the full connection name of the instance you created in step 3 above.
- `DB_USER`: Set to a user you configured when you created the instance in step 3, or to `root` if you didn't create a new user.
- `DB_PASS`: Set to the user's or root password you configured when you created the instance in step 3.
- `DB_NAME`: Only set this if you configured the database name to something other than `datacommons`.
1. When you finished, click **Done**.
1. If you are not storing API keys and passwords in Google Secret Manager, add variables for `DC_API_KEY` and `DB_PASS`. Otherwise, click **Reference a secret**, in the **Name** field, enter `DC_API_KEY`, and from the **Secret** drop-down field, select the relevant secret you created in step 4. Repeat for `DB_PASS`.
1. When you are finished, click **Done**.

![Cloud Run job](/assets/images/custom_dc/gcp_screenshot3.png){: width="450" }

@@ -110,27 +120,71 @@ Now set environment variables:

### Step 1: Upload data files to Google Cloud Storage

As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. If you want versioned snapshots, we recommend that you create a new subfolder and store the latest version of the files there. If you prefer to simply incrementally update, you can simply overwrite files in a pre-existing folder. Creating new subfolders is slower but safer. Overwriting files is faster but riskier.

To upload data using the Cloud Console:

1. Go to [https://console.cloud.google.com/storage/browse](https://console.cloud.google.com/storage/browse){: target="_blank"} and select your custom Data Commons bucket.
1. Navigate to the folder you created in the earlier step.
1. Click **Upload Files**, and select all your CSV files and `config.json`.

To upload data using the command line:

1. Navigate to your local "input" directory where your source files are located.
1. Run the following command:
<pre>
gcloud storage cp config.json *.csv gs://<var>BUCKET_NAME</var>/<var>FOLDER_PATH</var>
</pre>

> **Note:** Do not upload the local `datacommons` subdirectory or its files.
As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. To load them into Cloud SQL, you run the Cloud Run job you created above.
Once you have uploaded the new data, you must rerun the data management Cloud Run job.

### Step 2: Start the data management Cloud Run job {#run-job}
### Step 2: Run the data management Cloud Run job {#run-job}

Now that everything is configured, and you have uploaded your data in Google Cloud Storage, you simply have to start the Cloud Run data management job to convert the CSV data into tables in the Cloud SQL database and generate the embeddings (in a `datacommons/nl` subfolder).

Every time you upload new input CSV or JSON files to Google Cloud Storage, you will need to rerun the job.

To run the job:
To run the job using the Cloud Console:

1. Go to [https://console.cloud.google.com/run/jobs](https://console.cloud.google.com/run/jobs){: target="_blank"} for your project.
1. From the list of jobs, click the link of the "datacommons-data" job you created above.
1. Optionally, if you have received a `SQL check failed` error when previously trying to start the container, and would like to minimize startup time, click **Execute with overrides** and click **Add variable** to set a new variable with name `DATA_RUN_MODE` and value `schemaupdate`.
1. Click **Execute**. It will take several minutes for the job to run. You can click the **Logs** tab to view the progress.

When it completes, to verify that the data has been loaded correctly, see the next step.
To run the job using the command line:

1. From any local directory, run the following command:
<pre>
gcloud run jobs execute <var>JOB_NAME</var>
</pre>
1. To view the progress of the job, run the following command:
<pre>
gcloud beta run jobs logs tail <var>JOB_NAME</var>
</pre>

When it completes, to verify that the data has been loaded correctly, see [Inspect the Cloud SQL database](#inspect-sql).

#### (Optional) Run the data management Cloud Run job in schema update mode {#schema-update-mode}

If you have tried to start a container, and have received a `SQL check failed` error, this indicates that a database schema update is needed. You need to restart the data management container, and you can specify an additional, optional, flag, `DATA_RUN_MODE=schemaupdate`. This mode updates the database schema without re-importing data or re-building natural language embeddings. This is the quickest way to resolve a SQL check failed error during services container startup.

To run the job using the Cloud Console:
1. Go to [https://console.cloud.google.com/run/jobs](https://console.cloud.google.com/run/jobs){: target="_blank"} for your project.
1. From the list of jobs, click the link of the "datacommons-data" job you created above.
1. Optionally, select **Execute** > **Execute with overrides** and click **Add variable** to set a new variable with name `DATA_RUN_MODE` and value `schemaupdate`.
1. Click **Execute**. It will take several minutes for the job to run. You can click the **Logs** tab to view the progress.

To run the job using the command line:
1. From any local directory, run the following command:
<pre>
gcloud run jobs execute <var>JOB_NAME</var> --update-env-vars DATA_RUN_MODE=schemaupdate
</pre>
1. To view the progress of the job, run the following command:
<pre>
gcloud beta run jobs logs tail <var>JOB_NAME</var>
</pre>

### Inspect the Cloud SQL database {#inspect-sql}

@@ -181,7 +235,7 @@ gcloud auth application-default set-quota-project <var>PROJECT_ID</var>

If you are prompted to install the Cloud Resource Manager API, press `y` to accept.

### Step 3: Run the Docker container
### Step 3: Run the data management Docker container

From your project root directory, run:

@@ -199,6 +253,20 @@ The version is `latest` or `stable`.

To verify that the data is correctly created in your Cloud SQL database, use the procedure in [Inspect the Cloud SQL database](#inspect-sql) above.

#### (Optional) Run the data management Docker container in schema update mode

If you have tried to start a container, and have received a `SQL check failed` error, this indicates that a database schema update is needed. You need to restart the data management container, and you can specify an additional, optional, flag, `DATA_RUN_MODE` to miminize the startup time.

To do so, add the following line to the above command:

```
docker run \
...
-e DATA_RUN_MODE=schemaupdate \
...
gcr.io/datcom-ci/datacommons-data:stable
```

## Advanced setup (optional): Access Cloud data from a local services container

For testing purposes, if you wish to run the services Docker container locally but access the data in Google Cloud, use the following procedures.
@@ -211,7 +279,7 @@ To run a local instance of the services container, you will need to set all the

See the section [above](#gen-creds) for procedures.

### Step 3: Run the Docker container
### Step 3: Run the services Docker container

From the root directory of your repo, run the following command, assuming you are using a locally built image:

@@ -230,5 +298,3 @@ docker run -it \
</pre>




Loading