Make nextstrain build scripts consistent #868

jgadling · 2021-12-15T19:16:03Z

Summary:

What: Use more consistent S3 file paths for nextstrain outputs and logs
Ticket: sc<fill_in_issue_number>
Env: <rdev link>

Demos:

We were writing debug information to different S3 destinations for ondemand vs scheduled runs, and also using a different set of variables to generate these paths in two scripts that are (mostly) identical. This updates the scripts to use the same S3 paths and variables wherever possible, so that they're more alike than different.

I've also updated some of our local dev configuration to make more of these scripts work properly (and be more testable) in the local dev environment.

Notes:

Checklist:

I merged latest <base branch>
I manually verified the change
I added labels to my PR
I tested in multiple browsers
I added relevant unit tests
I have notified others of changes they need to make locally (migrations, jobs, package updates, etc)

jgadling · 2021-12-15T19:22:48Z

docker-compose.yml

@@ -217,6 +217,8 @@ services:
      genepinet:
        aliases:
          - nextstrain.genepinet.localdev
+    volumes:
+      - ./src/backend:/usr/src/app


This means we can run a nextstrain container locally with our local code changes in it.

jgadling · 2021-12-15T19:23:36Z

src/backend/Dockerfile.nextstrain

@@ -39,6 +39,8 @@ RUN mkdir /ncov && \
    git remote add origin https://github.com/nextstrain/ncov.git && \
    git fetch origin master && \
    git reset --hard FETCH_HEAD
+RUN mkdir -p /ncov/auspice
+RUN mkdir -p /ncov/logs


Make sure these directories always exist when we do a phylo run, so we don't get failures when trying to copy debug logs to s3

jgadling · 2021-12-15T19:25:20Z

src/backend/aspen/workflows/nextstrain_run/run_nextstrain_ondemand.sh

+  export aws="aws --endpoint-url ${BOTO_ENDPOINT_URL}"
+else
+  export aws="aws"
+fi


This makes it so we can run this script in local-dev (which needs the --endpoint-url flag) more easily.

jgadling · 2021-12-15T19:26:15Z

src/backend/aspen/workflows/nextstrain_run/run_nextstrain_ondemand.sh

 # fetch aspen config
-genepi_config="$(aws secretsmanager get-secret-value --secret-id $GENEPI_CONFIG_SECRET_NAME --query SecretString --output text)"
+genepi_config="$($aws secretsmanager get-secret-value --secret-id $GENEPI_CONFIG_SECRET_NAME --query SecretString --output text)"


We're using the $aws variable we set above with optional --endpoint-url flags to run aws cli commands.

jgadling · 2021-12-15T19:26:38Z

src/backend/aspen/workflows/nextstrain_run/run_nextstrain_ondemand.sh

 aspen_s3_db_bucket="$(jq -r .S3_db_bucket <<< "$genepi_config")"
+key_prefix="phylo_run/${S3_FILESTEM}/${WORKFLOW_ID}"
+s3_prefix="s3://${aspen_s3_db_bucket}/${key_prefix}"


Set our prefixes once so we don't have to accidentally get them wrong later on in this script

jgadling · 2021-12-15T19:28:10Z

src/backend/aspen/workflows/nextstrain_run/run_nextstrain_scheduled.sh


 aws configure set region $AWS_REGION

+if [ ! -z "${BOTO_ENDPOINT_URL}" ]; then


The changes to this file are ~identical to the changes to the ondemand script, they just make the two scripts more similar.

jgadling · 2021-12-15T19:28:45Z

src/backend/scripts/setup_dev_data.sh

@@ -34,7 +34,8 @@ ${local_aws} secretsmanager update-secret --secret-id genepi-config --secret-str
  "DB_rw_username": "user_rw",
  "DB_rw_password": "password_rw",
  "DB_address": "database.genepinet.localdev",
-  "S3_external_auspice_bucket": "genepi-external-auspice-data"
+  "S3_external_auspice_bucket": "genepi-external-auspice-data",
+  "S3_db_bucket": "genepi-db-data"


Add some config info in local dev that we can use in the nextstrain run scripts

jgadling · 2021-12-15T19:29:03Z

src/backend/scripts/setup_dev_data.sh

@@ -94,6 +95,8 @@ ${local_aws} ssm put-parameter --name /genepi/local/localstack/pangolin-ondemand

 echo "Creating s3 buckets"
 ${local_aws} s3api head-bucket --bucket genepi-external-auspice-data || ${local_aws} s3 mb s3://genepi-external-auspice-data
+${local_aws} s3api head-bucket --bucket genepi-db-data || ${local_aws} s3 mb s3://genepi-db-data
+${local_aws} s3api head-bucket --bucket genepi-gisaid-data || ${local_aws} s3 mb s3://genepi-gisaid-data


Create the buckets we'll need so we can run builds in local dev.

jgadling · 2021-12-15T19:29:47Z

src/backend/scripts/setup_localdata.py

+    s3_resource.Bucket(gisaid_s3_bucket).Object(processed_sequences_s3_key).put(Body="")
+    s3_resource.Bucket(gisaid_s3_bucket).Object(processed_metadata_s3_key).put(Body="")
+    s3_resource.Bucket(gisaid_s3_bucket).Object(aligned_sequences_s3_key).put(Body="")
+    s3_resource.Bucket(gisaid_s3_bucket).Object(aligned_metadata_s3_key).put(Body="")


Actually write some gisaid files to s3 so our nextstrain scripts can download them in local dev.

lvreynoso

This looks great! I love that you added support for local testing, awesome job!

jgadling · 2021-12-15T19:34:21Z

This looks great! I love that you added support for local testing, awesome job!

Yeah, this doesn't get us to 100% local-testability yet, but it's a first step... 🤷‍♀️

danrlu · 2021-12-15T20:07:37Z

src/backend/aspen/workflows/nextstrain_run/run_nextstrain_ondemand.sh


 # upload the tree to S3
-key="phylo_run/${build_date}/${S3_FILESTEM}/${WORKFLOW_ID}/ncov.json"
-aws s3 cp /ncov/auspice/ncov_aspen.json "s3://${aspen_s3_db_bucket}/${key}"
+key="${key_prefix}/ncov_aspen.json"


this key is used in the save.py code below, not sure what it is for. just checking.

it tells save.py which path to write to the DB so we can find it and make it available to auspice

danrlu · 2021-12-15T20:12:10Z

src/backend/aspen/workflows/nextstrain_run/run_nextstrain_ondemand.sh

-key="phylo_run/${build_date}/${S3_FILESTEM}/${WORKFLOW_ID}/ncov.json"
-aws s3 cp /ncov/auspice/ncov_aspen.json "s3://${aspen_s3_db_bucket}/${key}"
+key="${key_prefix}/ncov_aspen.json"
+$aws s3 cp /ncov/auspice/ncov_aspen.json "s3://${aspen_s3_db_bucket}/${key}"


sometimes we get > 1 JSON in 1 folder, so in the older code JSONs are renamed in this step (the builds.yaml needs too). Is the current set up, each tree run will have its own folder and this won't happen anymore?

The new path will create a folder prefix for every phylo tree:
https://github.com/chanzuckerberg/aspen/pull/868/files#diff-510b082ad0018f47841b06cb3ebcc9910c1e3487976975fcb68ff759f89b0dbbR29

So the files will be like /phylo_run/Tuolomne Contextual/12345/ncov_aspen.json

i'll add this to my tree wiki!

We can also change the way we format these paths -- if there's anything that you think is easier to understand, let me know!

it's great as you put here! this is going to make the tree debug a lot easier. i spend a lot of time clicking through the 1234 folders every time TnT

danrlu · 2021-12-15T20:16:52Z

src/backend/aspen/workflows/nextstrain_run/run_nextstrain_scheduled.sh

+if [ ! -e /ncov/data/sequences_aspen.fasta ]; then
+    cp /ncov/data/references_sequences.fasta /ncov/data/sequences_aspen.fasta;
+    cp /ncov/data/references_metadata.tsv /ncov/data/metadata_aspen.tsv;
+fi;


danrlu · 2021-12-15T20:24:22Z

Hahaha I was reviewing a merged PR? shhhhhh nobody tells anyone

danrlu and others added 5 commits December 15, 2021 10:12

put yaml in the same output folder

7e89d6a

add county name

ec570a5

Standardize s3 paths for builds.

b0b4699

Set up localdev to be able to test more of our nextstrain flow.

c385152

Fix debug.

a4fc312

jgadling requested review from danrlu and lvreynoso December 15, 2021 19:18

jgadling commented Dec 15, 2021

View reviewed changes

Lint.

708d8b5

jgadling changed the title ~~Jgadling/consistent nextstrain builds~~ Make nextstrain build scripts consistent Dec 15, 2021

jgadling mentioned this pull request Dec 15, 2021

Danrlu/output folder #867

Closed

6 tasks

lvreynoso approved these changes Dec 15, 2021

View reviewed changes

jgadling merged commit a820acd into trunk Dec 15, 2021

jgadling deleted the jgadling/consistent-nextstrain-builds branch December 15, 2021 19:35

danrlu reviewed Dec 15, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make nextstrain build scripts consistent #868

Make nextstrain build scripts consistent #868

jgadling commented Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

jgadling Dec 15, 2021

lvreynoso left a comment

jgadling commented Dec 15, 2021

danrlu Dec 15, 2021

jgadling Dec 15, 2021

danrlu Dec 15, 2021 •

edited

Loading

jgadling Dec 15, 2021 •

edited

Loading

danrlu Dec 15, 2021

jgadling Dec 15, 2021

danrlu Dec 15, 2021 •

edited

Loading

danrlu Dec 15, 2021

danrlu commented Dec 15, 2021


		aws configure set region $AWS_REGION

		if [ ! -z "${BOTO_ENDPOINT_URL}" ]; then

Make nextstrain build scripts consistent #868

Make nextstrain build scripts consistent #868

Conversation

jgadling commented Dec 15, 2021

Summary:

Demos:

Notes:

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvreynoso left a comment

Choose a reason for hiding this comment

jgadling commented Dec 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danrlu Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

jgadling Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danrlu Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danrlu commented Dec 15, 2021

danrlu Dec 15, 2021 •

edited

Loading

jgadling Dec 15, 2021 •

edited

Loading

danrlu Dec 15, 2021 •

edited

Loading