-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor workflow #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together, @rneher! This is already a major improvement over the current workflow both in the separation of concerns into separate rules and the abstraction of the build configs into separate config files. I like that we can use the powerful array builds approach when we want, but that the simpler flat config approach works just as well.
My biggest structural questions have to do with who we think will use this workflow:
- Do we want to support this as a canonical seasonal flu workflow that other public health organizations can run on their own data like we do for SARS-CoV-2?
- Can we support definition of specific input sequences and metadata in the config, when fauna inputs are not an option (both for our own internal uses and for external users).
In reviewing this and what a user's experience would be like working from GISAID, I noticed that we can now download ~20,000 sequences at once from GISAID. This means most people could run some powerful analyses from a couple of simple GISAID downloads. For an H3N2 HA and NA analysis, for example, this would produce two FASTA files like gisaid_epiflu_sequence_2015-2018_ha.fasta
and gisaid_epiflu_sequence_2015-2018_na.fasta
that would need to run through augur parse
to get the separate sequences and metadata. Can we support this simple use case in this workflow even for our own work?
It would also be nice to eventually move all of the configuration bits out of the Snakefiles and into config files. For example, clock rates, gene names, etc. are all candidates to move into config files where they are more visible, easier for non-Snakemake experts to modify, etc.
Snakefile
Outdated
"data" | ||
shell: | ||
"rm -rfv {params}" | ||
[f"auspice/{b}_ha.json" for b in config[build_dir + ""]] + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t understand why we needed this additional syntax/functionality for composing build directories. Is this the approach to creating "WHO" and "live" builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this should be rethought. this was mostly my quick way of hacking together various targets for testing. But being able to specify an auspice dir has been useful to run different analysis and profile in the same directory and not struggle with clean removing stuff you want to keep.
@@ -0,0 +1,86 @@ | |||
data_source: fauna |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that fauna is abstracted away in this implementation.
Do we want to allow users (including us) to define their own input sequences/metadata as we do in ncov? This would provide a path for public health partners to use our workflow from GISAID downloads without needing access to fauna. The bigger question is: do we want to support this as a public Nextstrain workflow for flu?
--nthreads {threads} | ||
""" | ||
|
||
rule sanitize_trees: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't pair strains by segment earlier than this in the workflow? It seems like the metadata merge would be the earliest place to ensure we have the same strains. Then we wouldn't need this additional rule to sanitize the trees...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that the new script for joining metadata (scripts/join_metadata.py
) supports inner joins of the data, so we could choose to limit analyses to strains with all requested segments at the beginning of the workflow. The steps that follow that join and work from the joined metadata should always pull the same strains for tree building...
workflow/snakemake_rules/export.smk
Outdated
if config.get('titer-models',False): | ||
inputs.append(rules.titers_sub.output.titers_model) | ||
inputs.append(rules.titers_tree.output.titers_model) | ||
if config.get('glycosilation', False) and wildcards.segment in ['ha', 'na']: | ||
inputs.append(rules.glyc.output.glyc) | ||
if config.get('lbi', False) and wildcards.segment in ['ha', 'na']: | ||
inputs.append(rules.lbi.output.lbi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These top-level boolean keys like titer-models
don't allow us to limit which segments to apply the models to from the config, so we'll run titer models on NA, PB1, etc. Could we move these types of parameters into the build definitions (like the vaccines
option below) instead so we have more control over which builds use which annotations?
Then we can eliminate all of these additional segment tests that happen for glycosylation, LBI, etc.
def get_subsample_input(w): | ||
files = {"metadata": f"data/{config['builds'][w.build_name]['lineage']}/metadata.tsv"} | ||
if config['builds'][w.build_name]['subsamples'][w.subsample].get('priorities', '')=='titers': | ||
files['titers']=build_dir + f"/{w.build_name}/titer_priorities.tsv" | ||
return files | ||
|
||
rule subsample: | ||
input: | ||
unpack(get_subsample_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of unpacking output from a wildcard function like this, what if we allow users to define the “titer priorities” file as a subsample input in the config file (e.g., priorities_file: "{build_name}/titer_priorities.tsv"
)? This would allow Snakemake to build the titer priorities on the fly, as needed, and also allow users to define their own priorities file. This would decouple the titers bit from the priorities bit.
and the meta data subset. | ||
|
||
input: | ||
- "data/lineage/metadata.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our discussion today, we want to support users who download separate segment FASTAs from GISAID. We may need to support files like data/lineage/metadata_segment.tsv
@joverlee521 made the point that we also want to support example data from open data with an example Auspice JSON so users can confirm that the basic workflow will work for them as a tutorial. We also need to move the distance files, fitness components, etc. still. I will try running the workflow with standard GISAID downloads and determine what minimal changes need to happen for this to work. |
|
||
localrules: clades, sanitize_trees | ||
|
||
build_dir = config.get("build_dir", "builds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this workflow more, I was surprised by this directory as the place to look for my output. The rest of our workflows use results/
with the contents of that directory depending on the pathogen workflow (zika
stores a single set of outputs top-level while ncov
nests build-specific outputs in subdirectories). Can we switch back to that here?
Gene maps represent configuration data and not workflow functionality. This change sets us up to move these data into a configuration file in the future.
The public builds rely on CDC ferret titers only which are not available yet in our S3 bucket, so we need to download from fauna for now.
Keeps 6-year and 12-year Yam trees for historical reasons.
Actually, we don't seem to be running these builds on the public site currently, so there is no need to keep updating the 6- and 12-year builds.
The public flu builds that don't use multiple forecasting models don't need to use the HI antigenic novelty model with a "cell_fra" suffix to distinguish it from other similar models.
Restores public Vic clade annotations.
Allows the list of regions to be a superset of the list of frequencies available per region, to handle the case when a given region has no data for a build and therefore has no frequencies. The new logic checks for the presence of each given region name (e.g., "Japan Korea") in a given frequencies JSON filename to find a list of available regions.
Fixes a data-dependent bug when the workflow tries to estimate region-specific frequencies for a region that has no aligned amino acid sequences. This fix includes two important steps: 1. augur filter run on amino acid sequence alignments to select data for a specific region is allowed to have empty output (printing only a warning, so we can track these events in logs), so the workflow can create an empty FASTA file for regions without data. These empty files represent that state of missing data. This implementation uses a new argument for augur filter provided in version 21.1.0. This commit updates the Conda environment accordingly. 2. The workflow conditionally builds a list of regional mutation frequencies to estimate for each build based on which amino acid sequence files are not empty. Because there can be multiple "genes" per segment, the workflow requires each region to have non-empty amino acid sequences for all genes in a segment. This implementation relies on converting the augur filter rule to a "checkpoint", so Snakemake knows to update the DAG after that rule runs. The aggregation of which region frequencies to estimate happens by calling an input function to the final global frequencies rule that checks the content of the filtered FASTA files and builds the list of region frequency JSONs to create. This last step works because the global frequency estimation script accepts a list of all possible regions and a list of available frequency JSONs.
Sorts rows in the measurements panel by the minimum y-axis position of the clade associated with a reference strain, clade, haplotype, or serum id in descending order. This sorting aligns the measurements panel entries more closely to the phylogenetic layout. Within each clade, rows are sorted alphabetically in ascending order, to facilitate lookups by name.
Adds vaccine strains recently selected for H1, corrects H3 vaccine strains, and adds recent egg-passaged vaccine strains to the JSONs, since these strains are included in our private nextflu builds.
Corrects the nucleotide positions of specific alleles expected per clade to reflect the new reference sequence coordinates for H3N2 and Vic.
Adds a `year_month` annotation through the existing `epiweeks` annotation that provides a coarser grain filtering by date. This kind of filtering is most useful in narrative writing, where we want to easily update a prior filter query string to reflect the latest data. Adding a new month to a list of year/months is much easier than adding all of the epiweeks.
Fix merge conflicts with `git merge -s ours master`, which effectively ignores changes in `master` when there are conflicts. Subsequent commits will add back changes that we _do_ want from master.
Corrects the HA gene coordinates for B/Vic such that HA1 and HA2 translations have the expected lengths (347 and 223, respectively) where previously the HA1 sequence was 1 amino acid longer at the end and the HA2 sequence was 1 shorter at the beginning. Documents the source of the new coordinates with the PDB record and corresponding publication.
Applies a standard UNIX sort to all outliers files, to simplify merging with outliers from `master` branch.
Updates outliers for H1N1pdm and H3N2 that were present in the master branch and not in this branch.
Uses UNIX sort -k 1,1 to sort strains ahead of merge with strain lists from master branch.
Updates lists of reference strains for H1N1pdm, H3N2, and Vic using strains in the master branch.
16065ea
to
b0f444f
Compare
This adds GH Action workflows for both the Nextstrain public builds and the private Nextflu builds. The `nextstrain build` commands are copied from the example commands in #76
This adds GH Action workflows for both the Nextstrain public builds and the private Nextflu builds. The `nextstrain build` commands are copied from the example commands in #76
Restore use of the `--description` option for `augur export v2`, which was lost during the major refactoring of the workflow in <#76> Allows for build specific descriptions, similar to the `auspice_config` based on array_builds.smk.¹ If a build specific description is not provided, then defaults to the workflow's `config/description.md`. ¹ <https://github.com/nextstrain/seasonal-flu/blob/b0ff43d8ccce7e28bf81350eaf8642c79908232f/workflow/snakemake_rules/array_builds.smk#L66-L67>
This PR refactors the old work flow that was spread out over files like
Snakefile_base
,Snakefile_WHO
, etc into a structure that follows the recommendations. In particular, it aims at splitting the workflows into substeps that have defined input and output files to minimize the inter-dependencies of different parts of the workflow. Complex input functions are avoided as much as possible to increase legibility.TODO
flu_seasonal_h3n2_2y_ha.json
from the refactored workflow needs to beflu_seasonal_h3n2_ha_2y.json
to match the live builds). Some options include:flu_seasonal_{lineage}_6m
but the auspice_name could beflu_seasonal_{lineage}_{{segment}}_6m
. There is some precedent for this approach in the auspice_json_prefix option in the ncov workflow.cp
) like we do for dated JSONs in ncov.auspice_name
build parameter. To get all of the properly named public builds we just need to run:snakemake --configfile profiles/nextstrain-public.yaml -- all_public
. Thisall_public
target will generate the expected files in theauspice_renamed/
directory.Running public builds (40 trees)
nextstrain build \ --aws-batch \ --detach \ --cpus 36 \ --memory 72gib \ . \ all_public \ -p \ --configfile profiles/nextstrain-public.yaml
Output is in
auspice_renamed/
.Running private.nextflu.org builds (80 trees)
nextstrain build \ --aws-batch \ --detach \ --cpus 36 \ --memory 72gib \ . \ all_who \ -p \ --configfile profiles/private.nextflu.org.yaml
Output is in
auspice-who/
.Future TODOs