Skip to content

Commit

Permalink
2024-11-06 Custom DC stable release (#4710)
Browse files Browse the repository at this point in the history
Highlights:
- Pins the version of `transformers` in nl_requirements.txt
- Adds support for a schema update mode in the data management
container, documented
[here](https://docs.datacommons.org/custom_dc/troubleshooting.html#schema-check-failed).
Schema check errors should also now have a direct link to the
troubleshooting page.
- Updates services container to exit as soon as any of the mixer, NL, or
website servers fails to start up
  • Loading branch information
hqpho authored Nov 6, 2024
2 parents ae0c287 + 454b21f commit 8373289
Show file tree
Hide file tree
Showing 101 changed files with 3,037 additions and 1,190 deletions.
34 changes: 34 additions & 0 deletions .github/workflows/all-commits-in-master.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: all-commits-in-master

on:
pull_request:
branches: [ "customdc_stable" ]

jobs:
check_commits:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
# Fetch all history for accurate comparison
fetch-depth: 0
# Check out the PR branch
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}

- name: Check if commits exist in master
run: |
git remote add dc https://github.com/datacommonsorg/website.git
git fetch dc
MASTER_BRANCH="dc/master"
# Get the list of commits in the source branch that are not in the master branch
MISSING_COMMITS=$(git log --pretty="%H - %s" $MASTER_BRANCH..HEAD --)
if [[ -n "$MISSING_COMMITS" ]]; then
echo "ERROR: The following commits are not present in $MASTER_BRANCH:"
echo "$MISSING_COMMITS"
exit 1
fi
echo "All commits are present in $MASTER_BRANCH"
29 changes: 22 additions & 7 deletions build/cdc_data/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,16 @@ if [[ $OUTPUT_DIR == "" ]]; then
exit 1
fi

if [[ $DATA_RUN_MODE != "" ]]; then
if [[ $DATA_RUN_MODE != "schemaupdate" ]]; then
echo "DATA_RUN_MODE must be either empty or 'schemaupdate'"
exit 1
fi
echo "DATA_RUN_MODE=$DATA_RUN_MODE"
else
DATA_RUN_MODE="customdc"
fi

echo "INPUT_DIR=$INPUT_DIR"
echo "OUTPUT_DIR=$OUTPUT_DIR"

Expand All @@ -51,7 +61,7 @@ ADDITIONAL_CATALOG_PATH=$DC_NL_EMBEDDINGS_DIR/custom_catalog.yaml
CUSTOM_EMBEDDINGS_INDEX=user_all_minilm_mem

# Set IS_CUSTOM_DC var to true.
# This is used by the embeddings builder to set up a custom dc env.
# This is used by the embeddings builder to set up a custom dc env.
export IS_CUSTOM_DC=true

if [[ $USE_SQLITE == "true" ]]; then
Expand All @@ -67,15 +77,20 @@ cd $WORKSPACE_DIR/import/simple
# Run importer.
python3 -m stats.main \
--input_dir=$INPUT_DIR \
--output_dir=$DC_OUTPUT_DIR
--output_dir=$DC_OUTPUT_DIR \
--mode=$DATA_RUN_MODE

# cd back to workspace dir to run the embeddings builder.
cd $WORKSPACE_DIR

# Run embeddings builder.
python3 -m tools.nl.embeddings.build_embeddings \
--embeddings_name=$CUSTOM_EMBEDDINGS_INDEX \
if [[ $DATA_RUN_MODE == "schemaupdate" ]]; then
echo "Skipping embeddings builder because run mode is 'schemaupdate'."
echo "Schema update complete."
else
# Run embeddings builder.
python3 -m tools.nl.embeddings.build_embeddings \
--embeddings_name=$CUSTOM_EMBEDDINGS_INDEX \
--output_dir=$DC_NL_EMBEDDINGS_DIR \
--additional_catalog_path=$ADDITIONAL_CATALOG_PATH

echo "Data loading completed."
echo "Data loading complete."
fi
4 changes: 2 additions & 2 deletions build/cdc_services/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ else
fi

# Wait for any process to exit
wait
wait -n

# Exit with status of process that exited first
exit $?
exit $?
15 changes: 14 additions & 1 deletion build/ci/cloudbuild.webdriver.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,26 @@ steps:
# ./run_test.sh -b will build client packages.
# These js files generated will be necessery for the flask_webdriver_test task.
./run_test.sh -b
# Download the files needed for nl server to run. Do the download here because
# webdriver runs on mulitple processes & we only want to do the download once.
- id: download_nl_files
name: python:3.11.3
entrypoint: /bin/sh
waitFor:
- package_js
args:
- -c
- |
cd tools/nl/download_nl_files
./run.sh
# Run the webdriver tests
- id: flask_webdriver_test
name: gcr.io/datcom-ci/webdriver-chrome:2024-06-05
entrypoint: /bin/sh
waitFor:
- package_js
- download_nl_files
args:
- -c
- |
Expand Down
8 changes: 4 additions & 4 deletions build/nl_server/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,10 @@ COPY shared/. /workspace/shared/

# Download nl files from gcs
COPY deploy/nl/catalog.yaml .
COPY build/nl_server/requirements.txt /workspace/build/nl_server/requirements.txt
COPY build/nl_server/download_nl_files.py .
RUN pip3 install -r /workspace/build/nl_server/requirements.txt
RUN python3 download_nl_files.py
COPY tools/nl/download_nl_files/requirements.txt /workspace/download_nl_files/requirements.txt
COPY tools/nl/download_nl_files/download_nl_files.py .
RUN pip3 install -r /workspace/download_nl_files/requirements.txt
RUN python3 download_nl_files.py --is_docker_mode=True

# Run server
WORKDIR /workspace
Expand Down
6 changes: 3 additions & 3 deletions deploy/terraform-custom-datacommons/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,14 @@ dc_api_key = "your-api-key"

- **project_id**: The Google Cloud project ID where the resources will be created.
- **namespace**: A unique namespace to differentiate multiple instances of custom Data Commons within the same project.
- **dc_api_key**: Data Commons API key. [Request an API key](https://docs.google.com/forms/d/e/1FAIpQLSeVCR95YOZ56ABsPwdH1tPAjjIeVDtisLF-8oDYlOxYmNZ7LQ/viewform?resourcekey=0-yJ9nT9ST-TfoKNtmGIws-g)
- **dc_api_key**: Data Commons API key. [Request an API key](https://apikeys.datacommons.org)

#### Optional Configuration Variables

- **region**: The [GCP region](https://cloud.google.com/about/locations) where resources will be deployed.
- **enable_redis**: Set to true to enable redis caching (default: false)
- **dc_web_service_image**: Docker image to use for the services container. Default: `gcr.io/datcom-ci/datacommons-services:stable`
- **dc_data_job_image**: Docker image to use for the data loading job. Default: `gcr.io/datcom-ci/datacommons-data:stable`
- **dc_web_service_image**: Docker image to use for the services container. Default: `gcr.io/datcom-ci/datacommons-services:stable`. Set to `gcr.io/datcom-ci/datacommons-services:latest` to use the latest web service image.
- **dc_data_job_image**: Docker image to use for the data loading job. Default: `gcr.io/datcom-ci/datacommons-data:stable`. Set to `gcr.io/datcom-ci/datacommons-data:latest` to use the latest data job image.
- **make_dc_web_service_public**: By default, the Data Commons web service is publicly accessible. Set this to `false` if your GCP account has restrictions on public access. [Reference](https://cloud.google.com/run/docs/authenticating/public).
- **google_analytics_tag_id**: Set to your [Google Analytics Tag ID](https://support.google.com/analytics/answer/9539598) to enable Google Analytics tracking.

Expand Down
5 changes: 4 additions & 1 deletion deploy/terraform-custom-datacommons/modules/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ resource "google_secret_manager_secret_version" "dc_api_key_version" {
resource "google_cloud_run_v2_service" "dc_web_service" {
name = "${var.namespace}-datacommons-web-service"
location = var.region
deletion_protection = false

template {
containers {
Expand Down Expand Up @@ -320,6 +321,8 @@ resource "google_cloud_run_service_iam_member" "dc_web_service_invoker" {
resource "google_cloud_run_v2_job" "dc_data_job" {
name = "${var.namespace}-datacommons-data-job"
location = var.region
deletion_protection = false

template {
template {
containers {
Expand Down Expand Up @@ -375,4 +378,4 @@ resource "google_cloud_run_v2_job" "dc_data_job" {
google_secret_manager_secret_version.dc_api_key_version,
google_secret_manager_secret_version.maps_api_key_version
]
}
}
2 changes: 1 addition & 1 deletion import
Submodule import updated 52 files
+9 −9 run_test.sh
+1 −1 simple/run_stats.sh
+1 −0 simple/stats/constants.py
+15 −5 simple/stats/data.py
+45 −18 simple/stats/db.py
+69 −0 simple/stats/nl.py
+65 −41 simple/stats/runner.py
+5 −0 simple/stats/schema_constants.py
+31 −5 simple/stats/variable_per_row_importer.py
+143 −22 simple/tests/stats/db_test.py
+2 −2 simple/tests/stats/entities_importer_test.py
+2 −2 simple/tests/stats/events_importer_test.py
+2 −2 simple/tests/stats/mcf_importer_test.py
+34 −4 simple/tests/stats/nl_test.py
+2 −2 simple/tests/stats/observations_importer_test.py
+40 −6 simple/tests/stats/runner_test.py
+2 −2 simple/tests/stats/schema_test.py
+40 −0 simple/tests/stats/test_data/db/input/sqlite_current_schema_populated.sql
+36 −0 simple/tests/stats/test_data/db/input/sqlite_old_schema_populated.sql
+65 −0 simple/tests/stats/test_data/nl/expected/topic_triples/custom_dc_topic_cache.json
+12 −1 simple/tests/stats/test_data/nl/input/topic_triples.csv
+2 −0 ...tests/stats/test_data/runner/expected/input_dir_driven_with_existing_old_schema_data/key_value_store.db.csv
+5 −0 simple/tests/stats/test_data/runner/expected/input_dir_driven_with_existing_old_schema_data/nl/sentences.csv
+31 −0 ...le/tests/stats/test_data/runner/expected/input_dir_driven_with_existing_old_schema_data/observations.db.csv
+109 −0 simple/tests/stats/test_data/runner/expected/input_dir_driven_with_existing_old_schema_data/triples.db.csv
+2 −0 simple/tests/stats/test_data/runner/expected/schema_update_only/key_value_store.db.csv
+4 −0 simple/tests/stats/test_data/runner/expected/schema_update_only/observations.db.csv
+6 −0 simple/tests/stats/test_data/runner/expected/schema_update_only/triples.db.csv
+51 −0 simple/tests/stats/test_data/runner/expected/topic_nl_sentences/nl/custom_dc_topic_cache.json
+7 −0 simple/tests/stats/test_data/runner/expected/topic_nl_sentences/triples.db.csv
+4 −0 simple/tests/stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/article_entities.csv
+4 −0 simple/tests/stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/author_entities.csv
+44 −0 simple/tests/stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/config.json
+15 −0 simple/tests/stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/countries.csv
+36 −0 ...stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/sqlite_old_schema_populated.sql
+5 −0 simple/tests/stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/variable_per_row.csv
+12 −0 simple/tests/stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/variables.mcf
+3 −0 simple/tests/stats/test_data/runner/input/input_dir_driven_with_existing_old_schema_data/wikidataids.csv
+4 −0 simple/tests/stats/test_data/runner/input/schema_update_only/article_entities.csv
+4 −0 simple/tests/stats/test_data/runner/input/schema_update_only/author_entities.csv
+44 −0 simple/tests/stats/test_data/runner/input/schema_update_only/config.json
+15 −0 simple/tests/stats/test_data/runner/input/schema_update_only/countries.csv
+36 −0 simple/tests/stats/test_data/runner/input/schema_update_only/sqlite_old_schema_populated.sql
+5 −0 simple/tests/stats/test_data/runner/input/schema_update_only/variable_per_row.csv
+12 −0 simple/tests/stats/test_data/runner/input/schema_update_only/variables.mcf
+3 −0 simple/tests/stats/test_data/runner/input/schema_update_only/wikidataids.csv
+8 −0 simple/tests/stats/test_data/runner/input/topic_nl_sentences/schema.mcf
+7 −0 simple/tests/stats/test_data/variable_per_row_importer/expected/row_obs_props/observations.db.csv
+15 −0 simple/tests/stats/test_data/variable_per_row_importer/input/row_obs_props/config.json
+7 −0 simple/tests/stats/test_data/variable_per_row_importer/input/row_obs_props/input.csv
+32 −0 simple/tests/stats/test_util.py
+5 −2 simple/tests/stats/variable_per_row_importer_test.py
2 changes: 1 addition & 1 deletion mixer
Submodule mixer updated 94 files
+7 −7 deploy/storage/base_bigtable_info.yaml
+1 −1 deploy/storage/bigquery.version
+104 −104 internal/server/place/golden/get_locations_ranking/country.json
+2 −1 internal/server/place/golden/get_place_stat_date_within_place/CA_County.json
+2 −1 internal/server/place/golden/get_place_stat_date_within_place/USA_State.json
+3 −3 internal/server/place/golden/get_related_locations/county.json
+4 −2 internal/server/stat/golden/get_stat_all/branch.json
+74 −59 internal/server/stat/golden/get_stat_all/result.json
+2 −1 internal/server/stat/golden/get_stat_date_within_place/CA_County.json
+37 −24 internal/server/stat/golden/get_stat_date_within_place/USA_State.json
+9 −5 internal/server/stat/golden/get_stats/census_pep.json
+3 −0 internal/server/statvar/formula/formula.go
+4 −0 internal/server/statvar/formula/formula_test.go
+13 −13 internal/server/translator/golden/query/statvar-obs.json
+105 −0 internal/server/v0/placestatvar/golden/get_place_stat_vars/california.json
+251 −2,010 internal/server/v0/placestatvar/golden/get_place_stat_vars/santa_clara.json
+1 −1 internal/server/v0/triple/golden/get_triples/limit1.json
+0 −41 internal/server/v0/triple/golden/get_triples/observation.json
+1 −1 internal/server/v1/info/golden/bulk_variable_group_info/sqlite.json
+384 −50 internal/server/v1/info/golden/bulk_variable_info/bulk_bt_and_sql.json
+392 −58 internal/server/v1/info/golden/bulk_variable_info/bulk_result.json
+4 −4 internal/server/v1/info/golden/variable_group_info/demographics.json
+2 −2 internal/server/v1/info/golden/variable_group_info/demographics_gbr.json
+6 −6 internal/server/v1/info/golden/variable_group_info/root.json
+2 −2 internal/server/v1/info/golden/variable_info/total_crimes.json
+9 −0 internal/server/v1/observationdates/golden/observation_dates_linked/CA_County.json
+146 −84 internal/server/v1/observationdates/golden/observation_dates_linked/USA_State.json
+9 −8 internal/server/v1/observations/golden/bulk_point/all_2010.json
+24 −23 internal/server/v1/observations/golden/bulk_point/all_latest.json
+12 −11 internal/server/v1/observations/golden/bulk_point/preferred_latest.json
+116 −116 internal/server/v1/observations/golden/bulk_point_linked/all_CA_County.json
+204 −218 internal/server/v1/observations/golden/bulk_point_linked/all_Country.json
+20 −19 internal/server/v1/observations/golden/bulk_point_linked/all_FRA_AA2_2016.json
+584 −584 internal/server/v1/observations/golden/bulk_point_linked/all_US_State.json
+116 −116 internal/server/v1/observations/golden/bulk_point_linked/preferred_CA_County.json
+5 −9 internal/server/v1/observations/golden/bulk_point_linked/preferred_Country.json
+14 −13 internal/server/v1/observations/golden/bulk_point_linked/preferred_FRA_AA2_2016.json
+306 −306 internal/server/v1/observations/golden/bulk_point_linked/preferred_US_State.json
+195 −98 internal/server/v1/observations/golden/bulk_series/all_result.json
+21 −5 internal/server/v1/observations/golden/bulk_series/preferred_result.json
+513 −247 internal/server/v1/observations/golden/bulk_series_linked/all_FRA_AA2_2016.json
+422 −209 internal/server/v1/observations/golden/bulk_series_linked/preferred_FRA_AA2_2016.json
+6 −2 internal/server/v1/observations/golden/derived_series/case1.json
+5 −4 internal/server/v1/page/golden/place_page/asm.Crime.json
+527 −433 internal/server/v1/page/golden/place_page/asm.Demographics.json
+7 −5 internal/server/v1/page/golden/place_page/asm.Economics.json
+5 −4 internal/server/v1/page/golden/place_page/asm.Education.json
+5 −4 internal/server/v1/page/golden/place_page/asm.Energy.json
+5 −4 internal/server/v1/page/golden/place_page/asm.Environment.json
+5 −4 internal/server/v1/page/golden/place_page/asm.Equity.json
+16 −11 internal/server/v1/page/golden/place_page/asm.Health.json
+5 −4 internal/server/v1/page/golden/place_page/asm.Housing.json
+35 −16 internal/server/v1/page/golden/place_page/asm.Overview.json
+102 −99 internal/server/v1/page/golden/place_page/ca.Crime.json
+196 −101 internal/server/v1/page/golden/place_page/ca.Demographics.json
+118 −107 internal/server/v1/page/golden/place_page/ca.Economics.json
+102 −99 internal/server/v1/page/golden/place_page/ca.Education.json
+102 −99 internal/server/v1/page/golden/place_page/ca.Energy.json
+102 −99 internal/server/v1/page/golden/place_page/ca.Environment.json
+102 −99 internal/server/v1/page/golden/place_page/ca.Equity.json
+112 −105 internal/server/v1/page/golden/place_page/ca.Health.json
+102 −99 internal/server/v1/page/golden/place_page/ca.Housing.json
+142 −115 internal/server/v1/page/golden/place_page/ca.Overview.json
+71 −69 internal/server/v1/page/golden/place_page/county.Overview.json
+132 −0 internal/server/v1/variables/golden/bulk_variables/california.json
+328 −2,010 internal/server/v1/variables/golden/bulk_variables/california_and_santa_clara_union.json
+1 −1 internal/server/v2/facet/golden/contained_in_facet/CA_County_all.json
+6 −13 internal/server/v2/facet/golden/contained_in_facet/country.json
+10 −10 internal/server/v2/facet/golden/series_facet/series_facet.json
+9 −2 internal/server/v2/observation/calculation.go
+122 −26 internal/server/v2/observation/calculation_util.go
+101 −43 internal/server/v2/observation/calculation_util_test.go
+1 −1 internal/server/v2/observation/golden/calculation/custom_data.json
+35 −20 internal/server/v2/observation/golden/contained_in_2015/FRA_AA2.json
+405 −173 internal/server/v2/observation/golden/contained_in_all/CA_County.json
+563 −294 internal/server/v2/observation/golden/contained_in_all/FRA_AA2.json
+116 −116 internal/server/v2/observation/golden/contained_in_latest/CA_County.json
+204 −226 internal/server/v2/observation/golden/contained_in_latest/Country.json
+378 −378 internal/server/v2/observation/golden/contained_in_latest/US_State.json
+8 −4 internal/server/v2/observation/golden/derived_series/case1.json
+9 −8 internal/server/v2/observation/golden/direct/2010.json
+9 −8 internal/server/v2/observation/golden/direct/2015.json
+216 −119 internal/server/v2/observation/golden/direct/all.json
+4 −4 internal/server/v2/observation/golden/direct/facet_id_filter.json
+4 −4 internal/server/v2/observation/golden/direct/filter.json
+42 −41 internal/server/v2/observation/golden/direct/latest.json
+4 −4 internal/server/v2/observation/golden/direct/multi_facet_id_filter.json
+4 −4 internal/server/v2/observation/golden/direct/multi_filter.json
+75 −0 internal/server/v2/observation/golden/variable/result.json
+6 −3 internal/sqldb/createtables.go
+4 −1 internal/store/files/recon_name_to_types.json
+ test/datacommons.db
+10 −2 test/statvar_ranking/missing_Earth_country_rankings.json
+1 −1 test/triples.csv
3 changes: 2 additions & 1 deletion nl_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@ torchvision==0.17.2
# TODO: this is pinned because latest huggingface_hub is not compatible with
# sentence-transformers v2.2.2. Look into upgrading sentence-transformers to
# v2.3.0 or newer
huggingface_hub==0.25.2
huggingface_hub==0.25.2
transformers==4.45.2
Original file line number Diff line number Diff line change
Expand Up @@ -827,6 +827,35 @@
"denom": "Count_Person",
"title": "People With Only Private Health Insurance in Counties of California"
},
{
"columns": [
{
"tiles": [
{
"statVarKey": [
"Count_Death"
],
"title": "Number of Deaths in California",
"type": "LINE"
}
]
},
{
"tiles": [
{
"description": "Number of Deaths in California",
"statVarKey": [
"Count_Death"
],
"title": "Number of Deaths in California",
"type": "HIGHLIGHT"
}
]
}
],
"denom": "Count_Person",
"title": "Number of Deaths"
},
{
"columns": [
{
Expand Down
Loading

0 comments on commit 8373289

Please sign in to comment.