Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nextstrain build scripts consistent #868

Merged
merged 6 commits into from
Dec 15, 2021

Conversation

jgadling
Copy link
Contributor

Summary:

  • What: Use more consistent S3 file paths for nextstrain outputs and logs
  • Ticket: sc<fill_in_issue_number>
  • Env: <rdev link>

Demos:

We were writing debug information to different S3 destinations for ondemand vs scheduled runs, and also using a different set of variables to generate these paths in two scripts that are (mostly) identical. This updates the scripts to use the same S3 paths and variables wherever possible, so that they're more alike than different.

I've also updated some of our local dev configuration to make more of these scripts work properly (and be more testable) in the local dev environment.

Notes:

Checklist:

  • I merged latest <base branch>
  • I manually verified the change
  • I added labels to my PR
  • I tested in multiple browsers
  • I added relevant unit tests
  • I have notified others of changes they need to make locally (migrations, jobs, package updates, etc)

@jgadling jgadling requested review from danrlu and lvreynoso December 15, 2021 19:18
@@ -217,6 +217,8 @@ services:
genepinet:
aliases:
- nextstrain.genepinet.localdev
volumes:
- ./src/backend:/usr/src/app
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we can run a nextstrain container locally with our local code changes in it.

@@ -39,6 +39,8 @@ RUN mkdir /ncov && \
git remote add origin https://github.com/nextstrain/ncov.git && \
git fetch origin master && \
git reset --hard FETCH_HEAD
RUN mkdir -p /ncov/auspice
RUN mkdir -p /ncov/logs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure these directories always exist when we do a phylo run, so we don't get failures when trying to copy debug logs to s3

export aws="aws --endpoint-url ${BOTO_ENDPOINT_URL}"
else
export aws="aws"
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it so we can run this script in local-dev (which needs the --endpoint-url flag) more easily.

# fetch aspen config
genepi_config="$(aws secretsmanager get-secret-value --secret-id $GENEPI_CONFIG_SECRET_NAME --query SecretString --output text)"
genepi_config="$($aws secretsmanager get-secret-value --secret-id $GENEPI_CONFIG_SECRET_NAME --query SecretString --output text)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using the $aws variable we set above with optional --endpoint-url flags to run aws cli commands.

aspen_s3_db_bucket="$(jq -r .S3_db_bucket <<< "$genepi_config")"
key_prefix="phylo_run/${S3_FILESTEM}/${WORKFLOW_ID}"
s3_prefix="s3://${aspen_s3_db_bucket}/${key_prefix}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set our prefixes once so we don't have to accidentally get them wrong later on in this script


aws configure set region $AWS_REGION

if [ ! -z "${BOTO_ENDPOINT_URL}" ]; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to this file are ~identical to the changes to the ondemand script, they just make the two scripts more similar.

@@ -34,7 +34,8 @@ ${local_aws} secretsmanager update-secret --secret-id genepi-config --secret-str
"DB_rw_username": "user_rw",
"DB_rw_password": "password_rw",
"DB_address": "database.genepinet.localdev",
"S3_external_auspice_bucket": "genepi-external-auspice-data"
"S3_external_auspice_bucket": "genepi-external-auspice-data",
"S3_db_bucket": "genepi-db-data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some config info in local dev that we can use in the nextstrain run scripts

@@ -94,6 +95,8 @@ ${local_aws} ssm put-parameter --name /genepi/local/localstack/pangolin-ondemand

echo "Creating s3 buckets"
${local_aws} s3api head-bucket --bucket genepi-external-auspice-data || ${local_aws} s3 mb s3://genepi-external-auspice-data
${local_aws} s3api head-bucket --bucket genepi-db-data || ${local_aws} s3 mb s3://genepi-db-data
${local_aws} s3api head-bucket --bucket genepi-gisaid-data || ${local_aws} s3 mb s3://genepi-gisaid-data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create the buckets we'll need so we can run builds in local dev.

s3_resource.Bucket(gisaid_s3_bucket).Object(processed_sequences_s3_key).put(Body="")
s3_resource.Bucket(gisaid_s3_bucket).Object(processed_metadata_s3_key).put(Body="")
s3_resource.Bucket(gisaid_s3_bucket).Object(aligned_sequences_s3_key).put(Body="")
s3_resource.Bucket(gisaid_s3_bucket).Object(aligned_metadata_s3_key).put(Body="")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually write some gisaid files to s3 so our nextstrain scripts can download them in local dev.

@jgadling jgadling changed the title Jgadling/consistent nextstrain builds Make nextstrain build scripts consistent Dec 15, 2021
@jgadling jgadling mentioned this pull request Dec 15, 2021
6 tasks
Copy link
Contributor

@lvreynoso lvreynoso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I love that you added support for local testing, awesome job!

@jgadling
Copy link
Contributor Author

This looks great! I love that you added support for local testing, awesome job!

Yeah, this doesn't get us to 100% local-testability yet, but it's a first step... 🤷‍♀️

@jgadling jgadling merged commit a820acd into trunk Dec 15, 2021
@jgadling jgadling deleted the jgadling/consistent-nextstrain-builds branch December 15, 2021 19:35

# upload the tree to S3
key="phylo_run/${build_date}/${S3_FILESTEM}/${WORKFLOW_ID}/ncov.json"
aws s3 cp /ncov/auspice/ncov_aspen.json "s3://${aspen_s3_db_bucket}/${key}"
key="${key_prefix}/ncov_aspen.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this key is used in the save.py code below, not sure what it is for. just checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it tells save.py which path to write to the DB so we can find it and make it available to auspice

key="phylo_run/${build_date}/${S3_FILESTEM}/${WORKFLOW_ID}/ncov.json"
aws s3 cp /ncov/auspice/ncov_aspen.json "s3://${aspen_s3_db_bucket}/${key}"
key="${key_prefix}/ncov_aspen.json"
$aws s3 cp /ncov/auspice/ncov_aspen.json "s3://${aspen_s3_db_bucket}/${key}"
Copy link
Collaborator

@danrlu danrlu Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes we get > 1 JSON in 1 folder, so in the older code JSONs are renamed in this step (the builds.yaml needs too). Is the current set up, each tree run will have its own folder and this won't happen anymore?
image

Copy link
Contributor Author

@jgadling jgadling Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new path will create a folder prefix for every phylo tree:
https://github.com/chanzuckerberg/aspen/pull/868/files#diff-510b082ad0018f47841b06cb3ebcc9910c1e3487976975fcb68ff759f89b0dbbR29

So the files will be like /phylo_run/Tuolomne Contextual/12345/ncov_aspen.json

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll add this to my tree wiki!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also change the way we format these paths -- if there's anything that you think is easier to understand, let me know!

Copy link
Collaborator

@danrlu danrlu Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's great as you put here! this is going to make the tree debug a lot easier. i spend a lot of time clicking through the 1234 folders every time TnT

if [ ! -e /ncov/data/sequences_aspen.fasta ]; then
cp /ncov/data/references_sequences.fasta /ncov/data/sequences_aspen.fasta;
cp /ncov/data/references_metadata.tsv /ncov/data/metadata_aspen.tsv;
fi;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@danrlu
Copy link
Collaborator

danrlu commented Dec 15, 2021

Hahaha I was reviewing a merged PR? shhhhhh nobody tells anyone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants