Create and upload partitioned datasets by year-month, clade, continent #398

corneliusroemer · 2023-04-03T18:33:27Z

Description of proposed changes

We discussed partitioned sequences/metadata at our most recent Nextstrain call.

This PR shows how easy it would be to produce such partitions for metadata and sequences.

A subsample, e.g. 100k would be similarly easy to create.

Testing

Test ingest: https://us-east-1.console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/94967da3-0d81-4a4c-b682-771d62ea2630

Test results are available for inspection via:

aws s3 cp s3://nextstrain-staging/files/ncov/open/branch/partitioned-metadata/sequences_2023-03.fasta.zst .
aws s3 cp s3://nextstrain-staging/files/ncov/open/branch/partitioned-metadata/metadata_2023-03.tsv.zst .

and likewise for the partitions metadata_22F.tsv.zst metadata_africa.tsv.zst etc

tsibley

I'd think this would cause quite an increase in storage usage during both transiently in ncov-ingest workflow runs and permanently in S3.

tsibley · 2023-04-04T23:46:24Z

workflow/snakemake_rules/partition.smk

+    shell:
+        """
+        tsv-select -H -f strain {input.metadata} > {output.strains}
+        seqkit grep -f {output.strains} {input.sequences} > {output.sequences}


If it's at all possible strain will contain spaces (and I think it is), then you'll want seqkit grep's --by-name (-n) option.

tsibley · 2023-04-04T23:49:07Z

workflow/snakemake_rules/partition.smk

+        "data/{database}/metadata_clade_{clade}.tsv",
+    shell:
+        """
+        tsv-filter -H --istr-in-fld "Nextstrain_clade:{wildcards.clade}" {input} > {output}


This splits into single clades, so I guess for the 21L-rooted builds we'd have to list as inputs 21L, 22A, 22B, and so on? And that will need continual updating, then, right?

If we end up using multiple inputs, we will start commonly running into long-standing issues with the very poor memory efficiency of the ncov workflow's "combine metadata" step. We avoid that now in our production runs because we only pass a single input.

Separately, a substring condition seems a bit imprecise and fragile.

We can generate that list programatically as well if desired - because clade definitions are now hierarchical.

We don't update clades that often, it wouldn't be that much of a headache to update the list explicitly every time.

tsibley · 2023-04-04T23:52:13Z

workflow/snakemake_rules/upload.smk

+    months_since_2020_01 = {f"{year}-{month:02d}" for year in range(2020, now.year+1) for month in range(1, 12+1) if year < now.year or month <= now.month}
+    regions={"europe", "north-america", "south-america", "asia", "africa", "oceania"}
+
+    max_per_year = {"19": "B", "20":"K", "21":"M", "22":"F","23":"A"}


This would need to get bumped after every new clade in the current year?

In the current implementation yes, but it could be automated based on clades.tsv - this is just a first attempt at showing how it could work

corneliusroemer · 2023-04-05T16:48:00Z

I'd think this would cause quite an increase in storage usage during both transiently in ncov-ingest workflow runs and permanently in S3.

Since partitions are equivalence classes, every partition is in total the same size as what is being partitioned.

For s3 that would mean extra ~1.5 GB for open (sequences and metadata together) and ~3GB for GISAID (seq + metadata) per run. We could disable versioning if we're worried about s3 storage.

For storage during run: yeah it's inefficient, but only because upload requires uncompressed inputs. If upload accepted zst compressed inputs, impact would be only ~1.5-3GB per partition (continent counts as one partion, year-month is one partition etc), so this PR would add 3 partitions. Also, output could be marked as temp so that it gets deleted once it's uploaded, that should make overall storage impact much smaller.

@trvrb

Filter rules in the config are applied _after_ subsampling, which poses issues with reliably getting the desired number of sequences. As @trvrb wrote¹: > In the workflow, the filter rule happens after the subsampling rules. > This makes it so that if we ask for say 2560 in a sampling bucket, we'll > lose >50% due to filtering out non-21L-descending clades. > > This could be solved by padding count targets to compensate, but this is > hacky and the numbers will change as time goes on. Or the filter rule > could be placed again before subsample, but we moved it afterwards for > good reasons. A few custom rules for the builds allows us to prefilter the full dataset before subsampling. Currently these rules are specific to our GISAID data source, but they could be easily expanded to our Open data sources too. In the future we might also provide clade-partitioned subsets from ncov-ingest², which we could use here instead with some adaptation of the build config. ¹ <#1029 (comment)> ² e.g. <nextstrain/ncov-ingest#398>

@trvrb

Filter rules in the config are applied _after_ subsampling, which poses issues with reliably getting the desired number of sequences. As @trvrb wrote¹: > In the workflow, the filter rule happens after the subsampling rules. > This makes it so that if we ask for say 2560 in a sampling bucket, we'll > lose >50% due to filtering out non-21L-descending clades. > > This could be solved by padding count targets to compensate, but this is > hacky and the numbers will change as time goes on. Or the filter rule > could be placed again before subsample, but we moved it afterwards for > good reasons. A few custom rules for the builds allows us to prefilter the full dataset before subsampling. Currently these rules are specific to our GISAID data source, but they could be easily expanded to our Open data sources too. In the future we might also provide clade-partitioned subsets from ncov-ingest², which we could use here instead with some adaptation of the build config. ¹ <#1029 (comment)> ² e.g. <nextstrain/ncov-ingest#398>

Create and upload partitioned datasets by year-month, clade, continent

05b69ab

corneliusroemer requested a review from a team April 4, 2023 16:30

tsibley reviewed Apr 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create and upload partitioned datasets by year-month, clade, continent #398

Create and upload partitioned datasets by year-month, clade, continent #398

corneliusroemer commented Apr 3, 2023 •

edited

Loading

tsibley left a comment

tsibley Apr 4, 2023

corneliusroemer Apr 5, 2023

tsibley Apr 4, 2023

corneliusroemer Apr 5, 2023

tsibley Apr 4, 2023

corneliusroemer Apr 5, 2023

corneliusroemer commented Apr 5, 2023

Create and upload partitioned datasets by year-month, clade, continent #398

Are you sure you want to change the base?

Create and upload partitioned datasets by year-month, clade, continent #398

Conversation

corneliusroemer commented Apr 3, 2023 • edited Loading

Description of proposed changes

Testing

tsibley left a comment

Choose a reason for hiding this comment

tsibley Apr 4, 2023

Choose a reason for hiding this comment

corneliusroemer Apr 5, 2023

Choose a reason for hiding this comment

tsibley Apr 4, 2023

Choose a reason for hiding this comment

corneliusroemer Apr 5, 2023

Choose a reason for hiding this comment

tsibley Apr 4, 2023

Choose a reason for hiding this comment

corneliusroemer Apr 5, 2023

Choose a reason for hiding this comment

corneliusroemer commented Apr 5, 2023

corneliusroemer commented Apr 3, 2023 •

edited

Loading