Generalize Ingest #6

j23414 · 2022-11-17T23:17:24Z

Description of proposed changes

Ingest data from genbank to generate:

data/sequences_all.fasta
data/metadata_all.tsv

for a dengue build.

~~Instead of separately pulling denv1 to denv4, all types are combined in one file with an annotated column:~~ (2023-03-25, to avoid confusion keep serotypes separate)

data/sequences_denv1.fasta
data/metadata_denv1.tsv
data/sequences_denv2.fasta
data/metadata_denv2.tsv
data/sequences_denv3.fasta
data/metadata_denv3.tsv
data/sequences_denv4.fasta
data/metadata_denv4.tsv

Unordered list of remaining tasks that may change:

Update docs
~~Pull and merge cached datasets to avoid recompute~~
~~Pull title from PubMed instead of Description line~~

Related issue(s)

Testing

Checks pass

Local Test

Can test this locally by running

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout new_ingest
cd ingest
nextstrain build .

ls -ltr data
wc -l data/metadata_all.tsv

~~May need to install tidyverse~~ (No longer need R since the script was refactored into python)

R
install.packages(tidyverse)

tsibley · 2022-11-23T00:20:05Z

I haven't done a detailed review, but a couple high-level comments to start off:

I don't think just-in-time downloading of programs from monkeypox is the way we want to reuse stuff. It seems to me that it will be fragile and hard to maintain over time.
While I personally enjoy multi-lingual projects and am sympathetic to the best tools being the ones you know (and do like tidyverse), I think we should replace the lone R program here with something in Python for consistency with the rest of the repo/ecosystem.

The R program also won't work in our Docker runtime (e.g. try it with nextstrain build --docker ingest/) and if it works in our Conda and ambient runtimes for someone, it's only by chance. We could add R support to our runtimes, but I think that's a bigger scope of work. I also noted that the R program loads the whole metadata file into memory when it could instead do a streaming transform (e.g. as csv-to-ndjson does).

huddlej

This is great, @j23414! It's a big lift to convert tidyverse logic to pandas logic especially since pandas has a huge learning curve and remains confusing in many ways even to people who have used it for years. Most of the comments below touch on Python conventions used throughout the other Nextstrain Python codebase, but there is an important note about the mapping of NCBI lineage ids to serotype names that's more of a data formatting consideration.

ingest/bin/post_process_metadata.py

j23414 · 2022-12-06T18:47:58Z

I haven't done a detailed review, but a couple high-level comments to start off:

I don't think just-in-time downloading of programs from monkeypox is the way we want to reuse stuff. It seems to me that it will be fragile and hard to maintain over time.

While I personally enjoy multi-lingual projects and am sympathetic to the best tools being the ones you know (and do like tidyverse), I think we should replace the lone R program here with something in Python for consistency with the rest of the repo/ecosystem.
The R program also won't work in our Docker runtime (e.g. try it with nextstrain build --docker ingest/) and if it works in our Conda and ambient runtimes for someone, it's only by chance. We could add R support to our runtimes, but I think that's a bigger scope of work. I also noted that the R program loads the whole metadata file into memory when it could instead do a streaming transform (e.g. as csv-to-ndjson does).

Point 2 is addressed with 2c8553a

Re: Point 1, I still disagree...mostly because I'm currently trying to propagate changes across zika and ebola ingests which is already tedious when copying changes across their snakefiles. One might argue to only polish ingest scripts on one repo (dengue), however in my experience that results in over-specialized scripts that become really difficult to generalize later.

To meet the final end-goal of Point 1 (if not the immediate developmental path of Point 1), I'm happy to make final copies in a future PR after I'm sure the pipeline works for:

~~Otherwise my dev path has already branched into:~~

~~https://github.com/nextstrain/dengue/tree/new_ingest_uniqmerge - to ensure cached information is not lost in a new ingest~~
~~https://github.com/nextstrain/dengue/tree/new_ingest_build - to add the modified build rules similar to nextstrain/zika@d9600f7~~

tsibley · 2022-12-12T23:27:54Z

@j23414 Nod. I don't want to get in the way of your active development.

As a final state, though, I still think we should not rely on this sort of dynamic downloading. It's got all the problems of a package management system (dependencies, versioning, updates, etc), without any of the solutions a mature package management system provides.

There's also issues of brittle close coupling it introduces. Consider that someone changing files in the monkeypox workflow would (rightfully so, I'd argue) not think that modifying one of the programs in scripts/ could break the unrelated dengue workflow.

tsibley

I normally review PRs commit-by-commit, but this was not feasible here given the commit structure and sequence. Instead, I reviewed this PR as one large diff after excluding the initial verbatim copy of the ingest machinery from monkeypox ("add ingest from monkeypox repo" (645acb2)).

# HEAD was "fix: switch to augur curate normalize-strings" (b41fb45)
git diff --compact-summary --patch 645acb2..@

This exclusion helped reduce the amount under review, since it's been reviewed previously, and highlight what had to change from monkeypox to here.

A couple general top-level comments, before specifics:

It'd be good to incorporate changes from Parameterize ncbi_id in fetch_sequences mpox#146 here, one way or another. I'm sure that's your plan! but calling it out so we don't forget.
Running nextstrain build ingest produces files in ingest/.snakemake/ and ingest/logs/ that should be in a git ignore file.

README.md

Snakefile

bin/set_final_strain_name.py

ingest/config/optional.yaml

ingest/workflow/snakemake_rules/fetch_sequences.smk

ingest/workflow/snakemake_rules/transform.smk

ingest/config/config.yaml

j23414 · 2023-04-12T00:00:35Z

Working on rebasing this! Do not review yet! Thanks @victorlin!!!

The dengue genome is approximately 10k or 11k. Therefore, we can filter out any sequences that are less than 5k or greater than 15k. A list of added GenBank filters is below: * Pull sequences longer than 5k but less than 15k * Only pull VRL (viral) datasets (no PAT or patents) * Pull UpdateDate_dt entry to potentially only pull "recent data sets" in case the dataset gets too large Co-authored-by: Jover Lee <[email protected]>

Since strain name may be in Isolate_s or Strain_s, we need to check both columns for a reasonable strain name. Dengue virus types denv1 to 4 can be derived if their NCBI taxon IDs are listed in ViralLineage_IDs. * derive strain name from Strain_s if Isolate_s is blank * derive denv1 to 4 depending on ViralLineage_IDs

* update help statement * make --outfile required * simplify reordering output columns * nuanced viruslineage_ids processing * when multiple paper urls, pick one * 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s' pulled from genbank_url The following was added after discussion with trs Check for the non-"happy path" cases first and then return early (or erroring early, as the case may be). This leaves the "happy path" (or "expected path") as the remainder of the function. * return early if publications is empty Co-authored-by: Thomas Sibley <[email protected]>

Search for valid strain name in the following order: 'strain', 'strain_s', 'accession'. Move the order into configs instead of hardcoding it in the post_process_metadata.py script.

Co-authored-by: Thomas Sibley <[email protected]>

Since some strains (or isolates) may be resequenced resulting in duplicate strain names in the dengue dataset, index entries by GenBank Accession IDs.

Could not find genbank accession from GenBank or prior sequences.fasta.zst files.

Compromise by duplicating scripts from monkeypox until a generalized pathogen repository exists or these scripts get enfolded into an augur subcommands

Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious. Following the format shown at: nextstrain/ncov#1045

Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3 denv4), reverted this rule to original code of using 1 thread. However, added a threads parameter in the align rule so that this is easy to modify.

To simplify the workflow, instead of post processing metadata to clean up strain names and set dengue serotype based on virus lineage ID after the transform step, incorporate post processing directly into the transform step. This step was moved above any manual annotations. This also simplified the code so we were not having two code blocks determining the final metadata columns which may have become inconsistent.

j23414 · 2024-05-23T22:14:51Z

This PR superseded by merged PRs:

huddlej requested changes Nov 30, 2022

View reviewed changes

huddlej mentioned this pull request Dec 14, 2022

Ingest nextstrain/ebola#6

Draft

1 task

j23414 changed the title ~~WIP: New ingest [do not merge yet]~~ Generalize Ingest Jan 4, 2023

j23414 mentioned this pull request Mar 24, 2023

docs: drop obsolete fauna instructions #7

Merged

1 task

tsibley requested a review from a team March 24, 2023 23:37

j23414 force-pushed the new_ingest branch 2 times, most recently from 040c503 to 0ca3400 Compare March 25, 2023 10:18

tsibley self-requested a review March 27, 2023 22:30

tsibley requested changes Mar 27, 2023

View reviewed changes

j23414 force-pushed the new_ingest branch 2 times, most recently from b9865c9 to c932c00 Compare March 28, 2023 18:34

j23414 mentioned this pull request Mar 28, 2023

fix: fix and rewrite the help description nextstrain/mpox#147

Merged

1 task

j23414 force-pushed the new_ingest branch 10 times, most recently from 099600c to 430ff78 Compare April 5, 2023 02:40

j23414 requested a review from a team April 5, 2023 04:20

j23414 force-pushed the new_ingest branch from e74fc6b to 2b23284 Compare April 11, 2023 23:58

j23414 marked this pull request as draft April 11, 2023 23:59

j23414 force-pushed the new_ingest branch from 2b23284 to d44873f Compare April 12, 2023 00:21

j23414 and others added 22 commits August 18, 2023 21:42

[ingest] Simplify finding strain name

03c0819

Search for valid strain name in the following order: 'strain', 'strain_s', 'accession'. Move the order into configs instead of hardcoding it in the post_process_metadata.py script.

zstd compress output files

931f5ae

fix: makes the compress rule more generic

205165a

Co-authored-by: Thomas Sibley <[email protected]>

Build: Index by genbank accession instead of duplicate strain names

5d7aa6d

Since some strains (or isolates) may be resequenced resulting in duplicate strain names in the dengue dataset, index entries by GenBank Accession IDs.

fix: remove entries where accession is not found

7c2da29

Could not find genbank accession from GenBank or prior sequences.fasta.zst files.

Ingest: Compromise by duplicating scripts

269837e

Compromise by duplicating scripts from monkeypox until a generalized pathogen repository exists or these scripts get enfolded into an augur subcommands

Ingest: Replace monkeypox text and parameters with dengue in scripts

cc8731c

Ingest: Compromise by allowing redundant data pull by serotype

0a9a2a5

[wip] attempt at limiting concurrent deploys

641c020

Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious. Following the format shown at: nextstrain/ncov#1045

Build: parameterize threads in align rule

0ec11a9

Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3 denv4), reverted this rule to original code of using 1 thread. However, added a threads parameter in the align rule so that this is easy to modify.

docs: Add documentation on running ingest

dec2ec3

fix: wildcards paired with optional.yaml

e80947d

cleanup some unused metadata columns

d25a3e7

mark temp intermediate files

a309e5f

Switch to augur curate format-dates

5c6baf4

Switch to augur curate titlecase

f82298f

Use "accession" column as ID column directly

ac34243

fixup! Use "accession" column as ID column directly

9dbccc4

j23414 force-pushed the new_ingest branch from a6df20f to 9dbccc4 Compare August 19, 2023 01:43

This was referenced Oct 11, 2023

Use "accession" column as ID column #12

Merged

Copy ingest #13

Merged

Ignore snakemake state dir for current and subfolders #14

Merged

j23414 mentioned this pull request Dec 12, 2023

Nextclade assignment #16

Merged

2 tasks

j23414 closed this May 23, 2024

j23414 deleted the new_ingest branch May 23, 2024 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize Ingest #6

Generalize Ingest #6

j23414 commented Nov 17, 2022 •

edited

Loading

tsibley commented Nov 23, 2022

huddlej left a comment

j23414 commented Dec 6, 2022 •

edited

Loading

tsibley commented Dec 12, 2022

tsibley left a comment

j23414 commented Apr 12, 2023

j23414 commented May 23, 2024

Generalize Ingest #6

Generalize Ingest #6

Conversation

j23414 commented Nov 17, 2022 • edited Loading

Description of proposed changes

Related issue(s)

Testing

tsibley commented Nov 23, 2022

huddlej left a comment

Choose a reason for hiding this comment

j23414 commented Dec 6, 2022 • edited Loading

tsibley commented Dec 12, 2022

tsibley left a comment

Choose a reason for hiding this comment

j23414 commented Apr 12, 2023

j23414 commented May 23, 2024

j23414 commented Nov 17, 2022 •

edited

Loading

j23414 commented Dec 6, 2022 •

edited

Loading