Ingest #6

j23414 · 2022-12-12T18:34:45Z

Description of proposed changes

See commit messages.

Related issue(s)

Testing

Checks pass

If reviewer wants to run a local check as documented in ingest/README.md and README.md:

# Checkout branch
git clone https://github.com/nextstrain/ebola.git
cd ebola
git checkout ingest

# Test Ingest
cd ingest
nextstrain build .
ls -ltr data

# Test build
cd ..
nextstrain build .
nextstrain view auspice

huddlej

Thank you for putting this ingest PR together so quickly, @j23414. I'm sorry it took me so long to review this more closely. Let me know if you want to chat synchronously about any of the comments. Even though there are some bigger topics that fall outside the scope of this PR (e.g., How to best reuse ingest logic across pathogen repos?), there seem to be simple local solutions we can make in the short term.

ingest/workflow/snakemake_rules/transform.smk

Snakefile

ingest/workflow/snakemake_rules/transform.smk

ingest/README.md

huddlej · 2022-12-14T19:45:30Z

ingest/README.md

+nextstrain build . --configfiles config/config.yaml config/optional.yaml
+```
+
+### Adding new sequences not from GenBank


I tested this approach using sequences we have in fauna.

# From `master` of ebola repo: snakemake -j 1 -p data/ebola.fasta # From ingest directory of ingest branch: seqkit sample -n 50 ../data/ebola.fasta > local_ebola.fasta wget https://raw.githubusercontent.com/nextstrain/monkeypox/master/ingest/bin/fasta-to-ndjson -O bin/fasta-to-ndjson chmod 755 bin/fasta-to-ndjson ./bin/fasta-to-ndjson \ --fasta local_ebola.fasta \ --fields strain virus accession collection_date region country division location source locus authors url title journal puburl \ --separator "|" > data/local_ebola.ndjson

These steps all ran as expected. When I tried to dry-run the ingest after modifying the config file to include this new source, I got the following error:

$ nextstrain build . -np Building DAG of jobs... Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths. MissingInputException in line 49 of /Users/jlhudd/projects/nextstrain/ebola/ingest/workflow/snakemake_rules/fetch_sequences.smk: Missing input files for rule fetch_all_sequences: data/local_ebola_all.ndjson

I tried renaming the NDJSON file I created above to data/local_ebola_all.ndjson and ran the workflow again. It ran as expected. In cases where the strain names were different between fauna and GenBank for the same accession (e.g., Ebolavirus/H_sapiens_wt/SLE/2014/Makona_J0005 and J0005 for accession KP759628), both records appeared in the output metadata and sequences. While we can disambiguate the metadata records by strain name, the sequences use only the accession, so we end up with duplicate records in sequences.

These duplicate sequences cause augur filter to exit with an error (with a list of duplicate strains) in the top-level workflow. This outcome is probably close to what we would expect, since we want users to know duplicates exist and handle these accordingly. It might be nicer for users for the ingest to warn them about duplicates, but I don't think we have any tools to check for dups currently, do we?

Regarding the issue with the NDJSON filename mismatch above, we could probably address this problem by removing the serotype wildcard from the sources, since we plan to collect all Ebola serotypes into a single pair of metadata and sequence files and split by serotype in the Nextstrain workflow anyway.

Regarding the issue with the NDJSON filename mismatch above,

Done with commit: fc12467

"I don't think we have any tools to check for dups currently, do we?

Would be wonderful if we did...I tend to use smof uniq , uniq_merge.py, or write up a one-off perl scripts to remove duplicate sequences. There are many possible ways to remove sequence duplicates. Was there a de-dups rule you had in mind? Maybe from a nextstrain repo?

While we can disambiguate the metadata records by strain name

I've seen some gnarly strain names in my time (and don't get me started on resequenced Isolates, or sequenced segments from the same strain)... so highly recommend keeping with the Monkeypox method of IDs by Accession. I see you're checking old fauna datasets.... My method of checking fauna was writing a quick script to replace

> strain_name

with its associated:

>Accession_number

I can spec out a quick perl/python script to do the swap where it (1) reads in a metadata file to generate strain_name-to-accession_number hash and (2) read in sequence file to replace strain_name with Accession_number.

In cases where the strain names were different between fauna and GenBank for the same accession (e.g., Ebolavirus/H_sapiens_wt/SLE/2014/Makona_J0005 and J0005 for accession KP759628), both records appeared in the output metadata and sequences.

Ah this is exactly where I use uniq_merge.py to flag accessions with multiple strain names and create a fix-stain-name.pl script. I'll think about it more. Thank you for the feedback.

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)

Discussed in nextstrain/ebola#6 (comment)

Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) Pick curl instead of wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) Pick curl instead of wget as discussed in: nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. https://www.oreilly.com/library/view/building-maintainable-software/9781491955987/ch04.html Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment) Co-authored-by: John Huddleston <[email protected]>

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. https://www.oreilly.com/library/view/building-maintainable-software/9781491955987/ch04.html Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment) Co-authored-by: John Huddleston <[email protected]>

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

Flags most of the intermediate representations of sequences and metadata as temporary files such that we only keep the final compressed outputs. Deleting the intermediate files reduced the data directory size from 221MB to 584 KB.

Since serotype is annotated as a column in metadata, simplify intermediate filenames like `data/sequences_{serotype}.fasta` and `data/metadata_{serotype}.tsv` to `data/sequences.fasta` and `data/metadata.tsv`.

In ingest/README.md, assume user is within ingest directory.

…ables

This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

huddlej reviewed Dec 14, 2022

View reviewed changes

j23414 force-pushed the ingest branch from 6da615f to bb9421b Compare December 30, 2022 01:11

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Dec 30, 2022

Pick curl or wget based on availability

26dabfd

Discussed in nextstrain/ebola#6 (comment)

j23414 pushed a commit to nextstrain/zika that referenced this pull request Dec 30, 2022

Pick curl or wget based on availability

760973d

Discussed in nextstrain/ebola#6 (comment)

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Mar 25, 2023

Pick curl or wget based on availability

b9ed5f6

Discussed in nextstrain/ebola#6 (comment)

tsibley mentioned this pull request Mar 27, 2023

Generalize Ingest nextstrain/dengue#6

Closed

4 tasks

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 11, 2023

Pick curl or wget based on availability

e1e6c44

Discussed in nextstrain/ebola#6 (comment)

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 12, 2023

Pick curl or wget based on availability

3430f36

Discussed in nextstrain/ebola#6 (comment)

j23414 marked this pull request as draft April 12, 2023 18:05

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 14, 2023

Pick curl or wget based on availability

e759014

Discussed in nextstrain/ebola#6 (comment)

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023

Pick curl or wget based on availability

571083e

Discussed in nextstrain/ebola#6 (comment)

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023

Pick curl or wget based on availability

075828e

Discussed in nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023

fix: Use curl for downloading files

bfd1cbe

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023

fix: Use curl for downloading files

501598f

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request Apr 18, 2023

fix: Use curl for downloading files

1e592bd

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request May 5, 2023

fix: Use curl for downloading files

fb8e749

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request May 5, 2023

fix: Use curl for downloading files

bf92771

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request Jun 6, 2023

fix: Use curl for downloading files

af25286

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 force-pushed the ingest branch from bb9421b to cc4316d Compare June 20, 2023 20:47

j23414 added 12 commits June 20, 2023 16:53

copy zika ingest

55e13da

adapt to ebola

0233699

update ebola build to pull new ingest data

5600aa7

refactor: flag intermediate files as temp

477979b

Flags most of the intermediate representations of sequences and metadata as temporary files such that we only keep the final compressed outputs. Deleting the intermediate files reduced the data directory size from 221MB to 584 KB.

refactor: serotypes

e80815f

Since serotype is annotated as a column in metadata, simplify intermediate filenames like `data/sequences_{serotype}.fasta` and `data/metadata_{serotype}.tsv` to `data/sequences.fasta` and `data/metadata.tsv`.

docs: ingest directory

a17b2f0

In ingest/README.md, assume user is within ingest directory.

edit: build accepts and decompresses zst inputs

fad2af9

docs: included a data provisioning section

5c060a0

refactor: used explicit paths instead of references to the rules vari…

a1f36e2

…ables

fixup: change dengue to ebola

5a2c5c4

Pick curl or wget based on availability

6ad1636

j23414 force-pushed the ingest branch from cc4316d to 6ad1636 Compare June 20, 2023 21:55

j23414 added a commit to nextstrain/dengue that referenced this pull request Jun 21, 2023

fix: Use curl for downloading files

2dcbe81

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request Aug 19, 2023

fix: Use curl for downloading files

05c181b

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

j23414 added a commit to nextstrain/dengue that referenced this pull request Sep 11, 2023

fix: Use curl for downloading files

55df94e

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest #6

Ingest #6

j23414 commented Dec 12, 2022 •

edited

Loading

huddlej left a comment

huddlej Dec 14, 2022

j23414 Jan 4, 2023

Ingest #6

Are you sure you want to change the base?

Ingest #6

Conversation

j23414 commented Dec 12, 2022 • edited Loading

Description of proposed changes

Related issue(s)

Testing

huddlej left a comment

Choose a reason for hiding this comment

huddlej Dec 14, 2022

Choose a reason for hiding this comment

j23414 Jan 4, 2023

Choose a reason for hiding this comment

j23414 commented Dec 12, 2022 •

edited

Loading