-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add augur curate
subcommands
#860
Comments
This is a great detailed overview, @huddlej! Thank you for the obvious care in putting it together. One thought for now after a first read: I imagine any given workflow will invoke many of these commands in some sort of combination, the same way our existing workflows combine several of these sorts of steps into bespoke programs. For efficiency, it would be good to ensure that the output of one command can be cleanly and easily piped into the input of another (e.g. Relatedly, it occurs to me that maybe I should give a tour of |
Tom beat me to this, where I had similar response but different possible solution... Thanks so much for the detailed write up. This really helped with my own understanding. I like the unix-y flavor of these small modular commands. However, I suspect in practice that I'd want to chain a bunch of bog-standard cleaning functions for my data from (say) ViPR. It will be somewhat annoying to write out a series of Snakemake rules that go from This feels a bit similar to the issue we have with Maybe you'd end up with a config-driven |
A config file with each step's args/params could absolutely be the "syntax TBD" I mentioned for multiple transforms "chained together in the same process". :-) It would be important that the fields for each step in the config file map without much change to the command line args for that step run separately, so there's not a new set of options to learn and document. One downside of config files is the hazard of embedding filenames into them, as this makes them brittle. For workflows with wildcards, for example, you couldn't use a static config file with filenames in it, but each step might need different filenames (so it might be hard to pass them separately from the config). (Even with config files, I do think actually piping between individual invocations is also important to allow (and not all that hard).) |
Thank you both for pointing this out! I imagined we could support piping between these commands, as a minimal way to support composing transforms. I just updated the issue to note this goal and include an example chained command. p.s. |
How would chaining work for transforming both sequence and metadata? Or is it expected that Also I'd imagine when dealing with very large metadata files, loading into memory repeated during chaining is not a great idea (though I don't know the internals of how An alternative I'd like to propose: allowing multiple transformations in a single command with a
Would be:
Note I used only a single flag to specify the columns to operate on ( One issue is the more complex argument parsing that needs to happen, though I think argparse can handle such cases. Also another issue is multiple actions on the same column, eg titlecase + normalise-strings. |
@ammaraziz Thanks for your comment! It's nice to have more people thinking about this. I think there's a number of downsides to the syntax you propose (some of which you note). My preference is for both multi-process chaining with pipes/files to be supported as well as one or more methods for single-process chaining. I think one of the methods for single-process chaining can be driven by a config file describing each subcommand's normal arguments (see #860 (comment)) instead of the Another method for single-process chaining would be syntax like the following (or similar):
This avoids having to genericize the arguments/options for each subcommand and should be familiar enough at a glance (albeit a bit magic-feeling maybe) to anyone used to the multi-process chaining with pipes. It does introduce potential for confusion around quoting, but that feels surmountable/avoidable in many cases (and there are always alternates; this wouldn't be the only invocation form allowed). |
Thank you @tsibley. I always worry I am intruding in these discussions! A few things to clarify/add. I like generic arguments, but I have noticed the overall style of I too like the idea of config file, excellent idea! However, as Emma mentioned in another threads, supporting more than 1 chaining could just add extra complexity. The piping style mentioned above as an example and by first post is really good. Finally, just to clarify as I may have missed it, will the transformation subcommand also act on the sequences or is it designed specifically to work on the metafile with a separate and possibly final option to modify the fasta file corresponding to the meta file? |
@ammaraziz Not intruding at all. :-) It's good to get feedback and input and questions from outside the core team. To answer your question about handling of sequences, our thinking has been that some of the transform subcommands will definitely have to handle sequences. For example, anything that's massaging strain name in metadata would want to make corresponding updates to the name used for the sequence. It's an interesting idea to batch these up and do them once at the end instead of doing them as the data streams through. I'm not sure how well that'd work with a more modular design of individual commands; it seems like it might make modularity harder. Would be interesting to think through some more though. |
FYI, I built the monkeypox/ingest transform rule as a prototype for this proposed subcommand. The shell pipeline connects small Python scripts that all read NDJSON records from stdin and then output the transformed records to stdout. The NDJSON records contain both metadata and sequences so the final step separates the NDJSON records to a metadata TSV file and a sequence FASTA file. |
Thanks @joverlee521 that example is super helpful :) Using ndjson as the format allows sequences & metadata to be piped together which allows us to change the strain name. Nice! I'm thinking about how we would support data that wasn't row-based (like sequences + metadata are). For instance, given the following dataset:
I've written scripts at various points in time to extract out:
Would this be considered in scope for |
@jameshadfield The examples you've listed seem like very general CSV/TSV manipulations that can be done with other tools (e.g. tsv-utils or csvtk). I don't think we want to reinvent the wheel here with |
On naming: my only thought is not |
Re: names: |
augur curate
subcommand
Reposting my comment from a Slack thread:
Along the same lines as that comment, our SURP intern recently needed to use the transform-date-fields tool from the monkeypox ingest to clean up dates from VIPR metadata. She wanted to run this command like:
But this didn't work, since the script only reads from stdin and only expect to receive NDJSON files. She then thought to use the csv-to-ndjson script to convert her metadata prior to piping into the date script, but her metadata was in TSV format (our standard output from augur parse and elsewhere in our workflows), so she got stuck needing to maunally convert TSV to CSV. Then the output of the transform date script is NDJSON, so she had to convert NDJSON back to TSV. In the end, the command looked something more like:
A couple of UI features appear important in these examples:
Putting those features together with the Slack comment above, I propose that we support explicit input and output flags for each
The use of
This approach of the format flag only works if we can know that each command will only have a single output stream. Maybe that's a reasonable assumption and The current proposed requirement of NDJSON as the only possible input and output format for curate commands would require something like:
The UX cost of the initial and final transforms to/from NDJSON decrease with the number of intermediate commands in the chain. The cost is highest for a single-use transform like the first example above, though. The question is whether we want to charge that UX cost to the user with the NDJSON-only curate commands or whether we want to charge the technical cost to ourselves to provide alternative I/O options for those commands. The HTSLIB (e.g., samtools/bcftools) ecosystem of commands that originally inspired the modular Augur design could be useful examples for this same kind of UI consideration. For example, it's possible to chain bcftools commands with an initial input in standard file format (BAM), intermediate data streamed to stdout in uncompressed BAM format (analogous here to our use of NDJSON), and final data written to an explicit output file in a standard format (VCF). |
I was really psyched about Other options that occurred to me last night included I like I like I'm also still ok with Trying this out with the example commands from my last comment:
|
Missing out on using @joverlee521's great work in other pathogen repos over naming issues is unfortunate. This feels like chicken and egg to me: I'd like to use these in practice to get a feel for what would be good naming, but I can't use them until a name is decided for implementation. If we can't decide on a name now, would it be possible to start with
|
+1 for @huddlej's thoughts on I/O. They match how I imagined the framework that individual commands plug into (regardless of whether Augur-provided or user-provided). Re: +1 for @victorlin's thoughts on naming and getting something out sooner than later. Re: the term curate and the semantic concerns @huddlej raised, I suggested curate in the meaning of data curation which is heavily focused on fixing, formatting, tidying, annotating, augmenting, etc. rather than museum curation or music curation which are indeed more heavily focused on selecting, filtering, etc. Wikipedia says this about data curation
which seems to fit well with what we're doing here. |
Good call on clarifying the "data curation" context ( If no one else is bothered by this, though, I'd say we adopt |
Thank you for the detailed examples @huddlej! I fully support the I/O options for individual commands especially to make one-off calls easier for users. With that in mind, the We would want to have a clean UI for these I/O options, which I think we can achieve with a separate shared base parser (a la https://stackoverflow.com/a/23296874). Re: naming, I'm happy with |
Dumping some thoughts/questions here after talking more about the I/O framework with @tsibley today. If a user provides both a metadata TSV and a FASTA file as inputs, I think we would only ever do an "inner" join that only keeps records that are present in both files. However, I do want to provide the users with an option to dictate how to handle unmatched records. This can be an
With the assumption that records in the two files should be 1:1, I would make Should We already handle this type of FASTA with If they already have metadata file and only want to keep the sequence id in the FASTA header, then they can also use something like It's unclear to me how often people work with this type of FASTA file, so I'm not sure if it would be a huge UX burden to ask users to preprocess FASTA files to only have the sequence id in the header. |
When you refer to I think the option to error, warn or silent is a good idea. I would like a helpful message when error/warn is produce such as:
Or write out which strains are missing metadata or fasta records. Personally, I care more about missing metadata than missing fasta entries, so a warning is really helpful.
Agreed this is solved by |
Thank you for the feedback @ammaraziz!
Yes, I mean records in both files. I agree that there definitely should be a distinction between unmatched metadata records and unmatched fasta records. |
There are cases when the user's input may include multiple sequence records for the same strain id. In the past, this has been an issue with GISAID downloads (caused by multiple submissions of the same strain name that would need to be deduplicated by accession id, if we had it). If we assume that curation happens after deduplication, the inner join assumption seems great. If we expect curation to precede deduplication, though, we need to allow for a left-join with metadata as the source of truth on the left. It is also possible that the user could have multiple entries in the metadata for the same strain id, although I can't remember encountering this myself with standard GISAID or NCBI data. It might be nice to distinguish in the warning/error message to users between missing and duplicate records, to handle this case. Then, we could include a reminder for users to deduplicate their data prior to running the curate commands. |
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. As part of the augur curate suite of commands, `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Should this issue be renamed? As it stands, it's completed per the tile (there's a curate subcommand) but the project may not be done. So maybe let's rename to |
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
The I'm closing this as complete. Any new subcommands or improvements can be requested through new issues. |
Context
Prior to the SARS-CoV-2 pandemic, we assumed that Nextstrain workflows would begin with pre-curated sequences and metadata sourced either from our internal database ("fauna") or custom scripts maintained by individual users. This expectation is reflected in the workflows for Zika, Ebola, and TB. These internal databases allowed us to curate a collection of high-quality sequences and metadata across multiple data sources (e.g., GISAID, ViPR, INSDC, etc.) and start workflows from this “source of truth”.
During development of the ncov workflow, we made an effort to support users outside of the Nextstrain team and accept input data from diverse sources. These input typically need to be standardized to a single format for sequences and metadata and duplicate records resolved prior to running the rest of the workflow. Standardization appears in several places in our codebase as "sanitize" or "transform" steps of ncov-ingest or ncov including:
Despite being written for the ncov workflow, these scripts perform common preprocessing tasks users will need for Nextstrain analyses of any pathogen. Much of the functionality in ncov-ingest scripts overlaps with that of the “sanitize” scripts in the ncov repository. This redundancy highlights the need for these common utilities even for a single pathogen with multiple valid input data sources (in this case, the GISAID API and the GISAID website search/downloads interface).
Some of the closest functionality of this kind lives in
augur parse
as functions to replace “forbidden characters” in strain names, “prettify” values of text fields, and fix date formatting. Asaugur parse
was often the first command in a Nextstrain workflow, these functions fulfilled the “sanitize” or “transform” needs we had even after pulling from a curated database. As Emma pointed out in a previous discussion, users (ourselves included) benefit from learning about data quality issues at the beginning of the workflow instead of later on.Similarly, latitude/longitude processing components of ncov-ingest originate from Emma’s own project to standardize locations and latitude/longitude values for use by Augur. This kind of functionality potentially applies to any pathogen with geographic information.
Description
Given the cross-pathogen nature of the standardization functions described above, this issue proposes a new
augur
subcommand to wrap sub-subcommands corresponding to the most useful sanitize/transform functions defined in the scripts above. The name of this subcommand should be generic enough to encompass the functions it performs but specific enough to clearly communicate what those functions are. Some possible candidates include:This command would implement the following functions from ncov-ingest and ncov scripts (and perhaps from
augur parse
and elsewhere):Some of these functions obviously have many different implementations, so we should determine the best generic implementation for an augur command. Other functions may be too specific to SARS-CoV-2 or even the GISAID API ingest to warrant implementation in augur.
Most of these functions operate on individual records in metadata or sequences, while the deduplication logic operates across the entire input files (adding columns could have arguably the same scope). We may want to implement a separate
augur deduplicate
command to distinguish the scope of these operations. One may need to clean/transform/sanitize input, for example, before one can resolve duplicates.Edit: As noted below, these subcommands should ideally support UNIX-style piping so we can compose/chain multiple subcommands in a single transformation operation.
Examples
A few examples of the functions above as commands (picking
transform
at random):With ideal chaining support, we might be able to do the following:
Proposed initial sub-subcommands
List of proposed initial sub-subcommands, will link individual issues for each sub-subcommand as we plan them out.
The following commands should be straight-forward to implement based on past scripts in ncov-ingest and monkeypox/ingest:
normalize-strings
(augur curate normalize-strings #998)titlecase
(augur curate titlecase #999)format-dates
(augur curate format-dates #1001)apply-geolocation-rules
(augur curate apply-geolocation-rules #1003)apply-record-annotations
The following commands will require more thinking and design. There are implementations in fauna/ncov-ingest/monkepox that can be improved:
format-strain-name
(augur curate format-strain-name #1204)deduplicate
(Subcommand for data de-duplication #919)The text was updated successfully, but these errors were encountered: