-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create and upload partitioned datasets by year-month, clade, continent #398
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think this would cause quite an increase in storage usage during both transiently in ncov-ingest workflow runs and permanently in S3.
shell: | ||
""" | ||
tsv-select -H -f strain {input.metadata} > {output.strains} | ||
seqkit grep -f {output.strains} {input.sequences} > {output.sequences} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's at all possible strain
will contain spaces (and I think it is), then you'll want seqkit grep
's --by-name
(-n
) option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
"data/{database}/metadata_clade_{clade}.tsv", | ||
shell: | ||
""" | ||
tsv-filter -H --istr-in-fld "Nextstrain_clade:{wildcards.clade}" {input} > {output} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This splits into single clades, so I guess for the 21L-rooted builds we'd have to list as inputs 21L, 22A, 22B, and so on? And that will need continual updating, then, right?
If we end up using multiple inputs, we will start commonly running into long-standing issues with the very poor memory efficiency of the ncov workflow's "combine metadata" step. We avoid that now in our production runs because we only pass a single input.
Separately, a substring condition seems a bit imprecise and fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can generate that list programatically as well if desired - because clade definitions are now hierarchical.
We don't update clades that often, it wouldn't be that much of a headache to update the list explicitly every time.
months_since_2020_01 = {f"{year}-{month:02d}" for year in range(2020, now.year+1) for month in range(1, 12+1) if year < now.year or month <= now.month} | ||
regions={"europe", "north-america", "south-america", "asia", "africa", "oceania"} | ||
|
||
max_per_year = {"19": "B", "20":"K", "21":"M", "22":"F","23":"A"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need to get bumped after every new clade in the current year?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation yes, but it could be automated based on clades.tsv - this is just a first attempt at showing how it could work
Since partitions are equivalence classes, every partition is in total the same size as what is being partitioned. For s3 that would mean extra ~1.5 GB for open (sequences and metadata together) and ~3GB for GISAID (seq + metadata) per run. We could disable versioning if we're worried about s3 storage. For storage during run: yeah it's inefficient, but only because |
Filter rules in the config are applied _after_ subsampling, which poses issues with reliably getting the desired number of sequences. As @trvrb wrote¹: > In the workflow, the filter rule happens after the subsampling rules. > This makes it so that if we ask for say 2560 in a sampling bucket, we'll > lose >50% due to filtering out non-21L-descending clades. > > This could be solved by padding count targets to compensate, but this is > hacky and the numbers will change as time goes on. Or the filter rule > could be placed again before subsample, but we moved it afterwards for > good reasons. A few custom rules for the builds allows us to prefilter the full dataset before subsampling. Currently these rules are specific to our GISAID data source, but they could be easily expanded to our Open data sources too. In the future we might also provide clade-partitioned subsets from ncov-ingest², which we could use here instead with some adaptation of the build config. ¹ <#1029 (comment)> ² e.g. <nextstrain/ncov-ingest#398>
Filter rules in the config are applied _after_ subsampling, which poses issues with reliably getting the desired number of sequences. As @trvrb wrote¹: > In the workflow, the filter rule happens after the subsampling rules. > This makes it so that if we ask for say 2560 in a sampling bucket, we'll > lose >50% due to filtering out non-21L-descending clades. > > This could be solved by padding count targets to compensate, but this is > hacky and the numbers will change as time goes on. Or the filter rule > could be placed again before subsample, but we moved it afterwards for > good reasons. A few custom rules for the builds allows us to prefilter the full dataset before subsampling. Currently these rules are specific to our GISAID data source, but they could be easily expanded to our Open data sources too. In the future we might also provide clade-partitioned subsets from ncov-ingest², which we could use here instead with some adaptation of the build config. ¹ <#1029 (comment)> ² e.g. <nextstrain/ncov-ingest#398>
Description of proposed changes
We discussed partitioned sequences/metadata at our most recent Nextstrain call.
This PR shows how easy it would be to produce such partitions for metadata and sequences.
A subsample, e.g. 100k would be similarly easy to create.
Testing
Test results are available for inspection via:
and likewise for the partitions
metadata_22F.tsv.zst
metadata_africa.tsv.zst
etc