Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gff annotations #354

Merged
merged 7 commits into from
Aug 15, 2019
Merged

Gff annotations #354

merged 7 commits into from
Aug 15, 2019

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Aug 15, 2019

Moves feature annotation to a subset of GFF syntax. Closes #187, as future extensions can be added by extending the GFF features we support as needed. This results in changes to the output of augur translate and augur export v2 (see below). The format of mutations is unchanged.

annotations

where start end strand seqid type
GFF input 1-based fully-closed "+" or "-"
genbank input 1-based fully-closed "+" or "-"
augur translate output 1-based fully-closed "+" or "-" included included
augur export v1 output 0-based half-open 1 or -1 not included not included
augur export v2 output 1-based fully-closed "+" or "-" included included

mutations

nt & aa are both 1-based

rneher and others added 7 commits August 12, 2019 14:17
…ncluding specifying strand as +/- and going to 1-based locations.
…in translate. It might be more sensible to move this into export in case no translations are done.
Output from `augur translate` and `augur export v2` is GFF-like. `augur export v1` produces BED-like coordinates. See JSON schemas for details.
@rneher
Copy link
Member

rneher commented Aug 15, 2019

with a few minor changes, we can also make augur use annotations like this:

KX369547.1		genome	1	10769			0
KX369547.1		gene	90	456		+	0	gene "CA";
KX369547.1		gene	456	735		+	0	gene "PRO";
KX369547.1		gene	735	960		+	0	gene "MP";
KX369547.1		gene	960	2472		+	0	gene "ENV";
KX369547.1		gene	2472	3528		+	0	gene "NS1";
KX369547.1		gene	3528	4206		+	0	gene "NS2A";
KX369547.1		gene	4206	4596		+	0	gene "NS2B";
KX369547.1		gene	4596	6447		+	0	gene "NS3";
KX369547.1		gene	6447	6828		+	0	gene "NS4A";
KX369547.1		gene	6828	6897		+	0	gene "2K";
KX369547.1		gene	6897	7650		+	0	gene "NS4B";
KX369547.1		gene	7650	10359		+	0	gene "NS5";

This already works for vcf, but it would be quite straight forward to also allow this for fasta alignments. The reference sequence could then be supplied as fasta or simply by name and we don't need to mess around with genbank files anymore. this gff/tsv is much easier to edit.
(it is also straightforward to parse such that we could dump BioGFF and just do it ourselves.)

@emmahodcroft
Copy link
Member

I'd def like to keep supporting GenBank though - at least for me that's the fastest way to get annotations for something new. Genbank's GFF export doesn't seem to capture all features. But agree that if you're working on something long-term (or sufficiently in need of editing) this would be an easier format to edit and maintain.

Definitely try out any new GFF parser on the TB GFF & others before dumping BioGFF for them - they can be much less tidy than your example! 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants