-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest #6
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this ingest PR together so quickly, @j23414. I'm sorry it took me so long to review this more closely. Let me know if you want to chat synchronously about any of the comments. Even though there are some bigger topics that fall outside the scope of this PR (e.g., How to best reuse ingest logic across pathogen repos?), there seem to be simple local solutions we can make in the short term.
nextstrain build . --configfiles config/config.yaml config/optional.yaml | ||
``` | ||
|
||
### Adding new sequences not from GenBank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this approach using sequences we have in fauna.
# From `master` of ebola repo:
snakemake -j 1 -p data/ebola.fasta
# From ingest directory of ingest branch:
seqkit sample -n 50 ../data/ebola.fasta > local_ebola.fasta
wget https://raw.githubusercontent.com/nextstrain/monkeypox/master/ingest/bin/fasta-to-ndjson -O bin/fasta-to-ndjson
chmod 755 bin/fasta-to-ndjson
./bin/fasta-to-ndjson \
--fasta local_ebola.fasta \
--fields strain virus accession collection_date region country division location source locus authors url title journal puburl \
--separator "|" > data/local_ebola.ndjson
These steps all ran as expected. When I tried to dry-run the ingest after modifying the config file to include this new source, I got the following error:
$ nextstrain build . -np
Building DAG of jobs...
Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths.
MissingInputException in line 49 of /Users/jlhudd/projects/nextstrain/ebola/ingest/workflow/snakemake_rules/fetch_sequences.smk:
Missing input files for rule fetch_all_sequences:
data/local_ebola_all.ndjson
I tried renaming the NDJSON file I created above to data/local_ebola_all.ndjson
and ran the workflow again. It ran as expected. In cases where the strain names were different between fauna and GenBank for the same accession (e.g., Ebolavirus/H_sapiens_wt/SLE/2014/Makona_J0005
and J0005
for accession KP759628
), both records appeared in the output metadata and sequences. While we can disambiguate the metadata records by strain name, the sequences use only the accession, so we end up with duplicate records in sequences.
These duplicate sequences cause augur filter to exit with an error (with a list of duplicate strains) in the top-level workflow. This outcome is probably close to what we would expect, since we want users to know duplicates exist and handle these accordingly. It might be nicer for users for the ingest to warn them about duplicates, but I don't think we have any tools to check for dups currently, do we?
Regarding the issue with the NDJSON filename mismatch above, we could probably address this problem by removing the serotype wildcard from the sources, since we plan to collect all Ebola serotypes into a single pair of metadata and sequence files and split by serotype in the Nextstrain workflow anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the issue with the NDJSON filename mismatch above,
Done with commit: fc12467
"I don't think we have any tools to check for dups currently, do we?
Would be wonderful if we did...I tend to use smof uniq , uniq_merge.py, or write up a one-off perl scripts to remove duplicate sequences. There are many possible ways to remove sequence duplicates. Was there a de-dups rule you had in mind? Maybe from a nextstrain repo?
While we can disambiguate the metadata records by strain name
I've seen some gnarly strain names in my time (and don't get me started on resequenced Isolates, or sequenced segments from the same strain)... so highly recommend keeping with the Monkeypox method of IDs by Accession. I see you're checking old fauna datasets.... My method of checking fauna was writing a quick script to replace
> strain_name
with its associated:
>Accession_number
I can spec out a quick perl/python script to do the swap where it (1) reads in a metadata file to generate strain_name-to-accession_number hash and (2) read in sequence file to replace strain_name
with Accession_number
.
In cases where the strain names were different between fauna and GenBank for the same accession (e.g., Ebolavirus/H_sapiens_wt/SLE/2014/Makona_J0005 and J0005 for accession KP759628), both records appeared in the output metadata and sequences.
Ah this is exactly where I use uniq_merge.py to flag accessions with multiple strain names and create a fix-stain-name.pl script. I'll think about it more. Thank you for the feedback.
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment)
Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) Pick curl instead of wget as discussed in: nextstrain/ebola#6 (comment)
Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) Pick curl instead of wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) Pick curl instead of wget as discussed in: nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. https://www.oreilly.com/library/view/building-maintainable-software/9781491955987/ch04.html Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment) Co-authored-by: John Huddleston <[email protected]>
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. https://www.oreilly.com/library/view/building-maintainable-software/9781491955987/ch04.html Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment) Co-authored-by: John Huddleston <[email protected]>
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
Flags most of the intermediate representations of sequences and metadata as temporary files such that we only keep the final compressed outputs. Deleting the intermediate files reduced the data directory size from 221MB to 584 KB.
Since serotype is annotated as a column in metadata, simplify intermediate filenames like `data/sequences_{serotype}.fasta` and `data/metadata_{serotype}.tsv` to `data/sequences.fasta` and `data/metadata.tsv`.
In ingest/README.md, assume user is within ingest directory.
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
Description of proposed changes
See commit messages.
Related issue(s)
Testing
If reviewer wants to run a local check as documented in ingest/README.md and README.md: