-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create and upload partitioned datasets by year-month, clade, continent #398
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
""" | ||
Creates partitioned datasets: | ||
- by year_month | ||
- by clade | ||
- by continent | ||
""" | ||
|
||
|
||
rule metadata_by_year_month: | ||
input: | ||
"data/{database}/metadata.tsv", | ||
output: | ||
"data/{database}/metadata_year-month_{year}-{month}.tsv", | ||
shell: | ||
""" | ||
tsv-filter -H --istr-in-fld "date:{wildcards.year}-{wildcards.month}" {input} > {output} | ||
""" | ||
|
||
|
||
rule metadata_by_clade: | ||
input: | ||
"data/{database}/metadata.tsv", | ||
output: | ||
"data/{database}/metadata_clade_{clade}.tsv", | ||
shell: | ||
""" | ||
tsv-filter -H --istr-in-fld "Nextstrain_clade:{wildcards.clade}" {input} > {output} | ||
""" | ||
|
||
|
||
rule metadata_by_continent: | ||
input: | ||
"data/{database}/metadata.tsv", | ||
output: | ||
"data/{database}/metadata_region_{continent}.tsv", | ||
shell: | ||
""" | ||
tsv-filter -H --istr-eq "region:{wildcards.continent}" {input} > {output} | ||
""" | ||
|
||
rule sequences_by_metadata: | ||
input: | ||
sequences="data/{database}/sequences.fasta", | ||
metadata="data/{database}/metadata_{partition}.tsv", | ||
output: | ||
sequences="data/{database}/sequences_{partition}.fasta", | ||
strains=temp("data/{database}/strains_{partition}.txt"), | ||
shell: | ||
""" | ||
tsv-select -H -f strain {input.metadata} > {output.strains} | ||
seqkit grep -f {output.strains} {input.sequences} > {output.sequences} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's at all possible There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point |
||
""" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,8 @@ These output files are empty flag files to force Snakemake to run the upload rul | |
Note: we are doing parallel uploads of zstd compressed files to slowly make the transition to this format. | ||
""" | ||
|
||
import datetime | ||
|
||
def compute_files_to_upload(): | ||
""" | ||
Compute files to upload | ||
|
@@ -33,6 +35,30 @@ def compute_files_to_upload(): | |
"aligned.fasta.zst": f"data/{database}/aligned.fasta", | ||
"nextclade_21L.tsv.zst": f"data/{database}/nextclade_21L.tsv", | ||
} | ||
|
||
now = datetime.datetime.now() | ||
months_since_2020_01 = {f"{year}-{month:02d}" for year in range(2020, now.year+1) for month in range(1, 12+1) if year < now.year or month <= now.month} | ||
regions={"europe", "north-america", "south-america", "asia", "africa", "oceania"} | ||
|
||
max_per_year = {"19": "B", "20":"K", "21":"M", "22":"F","23":"A"} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would need to get bumped after every new clade in the current year? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the current implementation yes, but it could be automated based on clades.tsv - this is just a first attempt at showing how it could work |
||
clades = set() | ||
for year, max_letter in max_per_year.items(): | ||
for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ": | ||
if letter > max_letter: | ||
break | ||
clades.add(f"{year}{letter}") | ||
|
||
for clade in clades: | ||
files_to_upload[f"metadata_{clade}.tsv.zst"] = f"data/{database}/metadata_clade_{clade}.tsv" | ||
files_to_upload[f"sequences_{clade}.fasta.zst"] = f"data/{database}/sequences_clade_{clade}.fasta" | ||
|
||
for region in regions: | ||
files_to_upload[f"metadata_{region}.tsv.zst"] = f"data/{database}/metadata_region_{region}.tsv" | ||
files_to_upload[f"sequences_{region}.fasta.zst"] = f"data/{database}/sequences_region_{region}.fasta" | ||
|
||
for year_month in months_since_2020_01: | ||
files_to_upload[f"metadata_{year_month}.tsv.zst"] = f"data/{database}/metadata_year-month_{year_month}.tsv" | ||
files_to_upload[f"sequences_{year_month}.fasta.zst"] = f"data/{database}/sequences_year-month_{year_month}.fasta" | ||
|
||
if database=="genbank": | ||
files_to_upload["biosample.tsv.gz"] = f"data/{database}/biosample.tsv" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This splits into single clades, so I guess for the 21L-rooted builds we'd have to list as inputs 21L, 22A, 22B, and so on? And that will need continual updating, then, right?
If we end up using multiple inputs, we will start commonly running into long-standing issues with the very poor memory efficiency of the ncov workflow's "combine metadata" step. We avoid that now in our production runs because we only pass a single input.
Separately, a substring condition seems a bit imprecise and fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can generate that list programatically as well if desired - because clade definitions are now hierarchical.
We don't update clades that often, it wouldn't be that much of a headache to update the list explicitly every time.