日本語版はこちら。
[TOC]
- What is this?
- How accurate is the conversion?
- Initial setup
- Create DDBJ annotation from GFF3 and FASTA
- Under the Hood
- Customize the behavior
- Troubleshooting
- Known Issues
- Credit
Table of contents generated with markdown-toc
GFF3-to-DDBJ creates the annotation file for submission to DDBJ by taking GFF3 and FASTA files as input. It also works with FASTA alone.
Analogous programs are table2asn and GAG for submissions to NCBI, and EMBLmyGFF3 for submissions to EMBL.
Please take a look at our test directory for examples. Files ending with .ann are the DDBJ annotation files produced by thie program.
While there are many rules a DDBJ annotation file needs to comply with, it's difficult to tell what the correct GFF3→DDBJ conversion is.
There is no examples of fully-functional GFF3 → DDBJ conversion, either. So, we define GFF3-GenBank correspondence in RefSeq as the "correct" examples. (We take the DDBJ side when instructions differ, though.) To evaluate GFF3-to-DDBJ, we use RefSeq data and compare gff3-to-ddbj
output with the other DDBJ annotation from genbank-to-ddbj
using the GenBank format. Please take a look at our evaluation dcoument for the detail as well as the current status. ([TODO] Add the page...)
Here genbank-to-ddbj
is an executable included in this package. It shares codebase with gff3-to-ddbj
, but we believe it does not bring any complexity to our evaluation due to its much simpler internals.
Also note that we also use DDBJ's Parser for checking the annotation files.
# Create a conda environment named "ddbj", and install relevant packages from bioconda channel
conda create -n ddbj -c bioconda -c conda-forge gff3toddbj
# Activate the environment "ddbj"
conda activate ddbj
# Create a conda environment named "ddbj" and install pip
conda create -n ddbj pip
# Activate the environment "ddbj"
conda activate ddbj
# Need bgzip executable in samtools
conda install -c bioconda samtools
# Install from pip
pip install gff3toddbj
# Download
wget https://github.com/yamaton/gff3_to_ddbj/archive/refs/heads/main.zip
# Extract, rename, and change directory
unzip main.zip && mv gff3toddbj-main gff3toddbj && cd gff3toddbj
# Create a conda environment named "ddbj"
conda create -n ddbj
# Activate the environment "ddbj"
conda activate ddbj
# Install dependencies to "ddbj"
conda install -c bioconda -c conda-forge biopython bcbio-gff toml pysam samtools pip build
# Install gff3-to-ddbj and extra tools
python -m build && pip install -e ./
Let's run the main program to get some ideas.
gff3-to-ddbj \
--gff3 myfile.gff3 \ # bare-minimum output if omitted
--fasta myfile.fa \ # <<REQUIRED>>
--metadata mymetadata.toml \ # example metadata used if omitted
--locus_tag_prefix MYOWNPREFIX_ \ # default is "LOCUSTAGPREFIX_"
--transl_table 1 \ # default is 1
--output myawesome_output.ann # standard output if omitted
Here is the options:
--gff3 <FILE>
takes GFF3 file--fasta <FILE>
takes FASTA file--metadata <FILE>
takes the metadata file in TOML--locus_tag_prefix <STRING>
takes the prefix of locus tag obtained from BioSample. You can skip this for now.--transl_table <INT>
: Choose appropriate one from The Genetic Codes. The default value is 1 ("standard").--output <FILE>
sets the path the annotation output.
Here is the list of operations gff3-to-ddbj
will do:
-
Re-compress FASTA with bgzip if the FASTA input is compressed with gzip
- A bgzip file is created if absent like
myfile_bgzip.fa.gz
. - For indexing and saving memory
- The bgzip file should be compatible with gzip
- A bgzip file is created if absent like
-
Rename features and qualifiers following the renaming rules defined here.
- This is the core function of
gff3-to-ddbj
. - The rules are based on Sequence Ontology plus real-world examples.
- For examples,
transcript
in the 3rd column (= called "type") is translated tomisc_RNA
feature because SO:0000673 setes "INSDC_feature:misc_RNA".
- This is the core function of
-
Search for
assembly_gap
s in FASTA, and add the feature. -
Add
/transl_table
to each CDS. -
Insert
source
information from the metadata fie. -
Insert
TOPOLOGY
feature if GFF3 hasIs_circular=true
in an entry.- Also handle origin-spaning features.
-
Join locations of features having the same parent with
join
notation.CDS
,exon
,mat_peptide
,V_segment
,C_region
,D-loop
, andmisc_feature
may be joined.exon
s are NOT joined if havinggene
as the direct parent.
-
Set the location of joined exons as its parent RNA's location, and discard the exons.
-
Add partialness markup (
<
and>
) toCDS
locations if start/stop codon is absent. -
Let CDS have a single
/product
value: Set it to "hypothetical protein" if absent. Move the rest of exising values to/note
.-
This is to conform the instruction on
/product
.-
Even if there are multiple general names for the same product, do not enter multiple names in 'product'. Do not use needless symbolic letters as delimiter for multiple names. If you would like to describe more than two names, please enter one of the most representative name in /product qualifier, and other(s) in /note qualifier.
-
If the name and function are not known, we recommend to describe as "hypothetical protein".
-
-
-
If a
gene
feature has/gene
and/or/gene_synonym
, copy these qualifiers to its children. -
Make
/gene
have a single value, and put the rest to/gene_synonym
.- Reference: Definition of Qualifier key: /gene.
-
Remove duplicates in qualifier values.
-
Sort lines in annotation
- Sort is based on the key (start position, priority of features, end position)
- The priorities are [defined here]((
gff3toddbj/gff3toddbj/transforms.py
Line 763 in 1cea725
source
andTOPOLOGY
to the top.
-
Filter features and qualifiers following the matrix.
gene
feature will be discarded in this process.- Discarded features, and discarded feature-qualifier pairs are displayed as standard error at execution. They look like following:
WARNING: [Discarded] feature -------> gene <------- (count: 49911) WARNING: [Discarded] feature -------> cDNA_match <------- (count: 10692) WARNING: [Discarded] feature -------> match <------- (count: 101) WARNING: [Discarded] feature -------> sequence_conflict <------- (count: 81) WARNING: [Discarded] (Feature, Qualifier) = (source, db_xref) (count: 687) WARNING: [Discarded] (Feature, Qualifier) = (source, Name) (count: 687) WARNING: [Discarded] (Feature, Qualifier) = (source, gbkey) (count: 687) WARNING: [Discarded] (Feature, Qualifier) = (source, genome) (count: 685) WARNING: [Discarded] (Feature, Qualifier) = (mRNA, Parent) (count: 57304) WARNING: [Discarded] (Feature, Qualifier) = (mRNA, db_xref) (count: 114608)
To enter information missing in GFF3 or FASTA, such as submitter names and certain qualifier values, you need to feed a metadata file in TOML, say mymetadata.toml
. Take a look at an example matching the example annotation in the DDBJ page.
The file accommodates following and they are all optional. That is, GFF3-to-DDBJ works even with an empty file.
-
Basic features in the COMMON entry
- ... such as
SUBMITTER
,REFERENCE
, andCOMMENT
.
- ... such as
-
"meta-description" in the COMMON entry
-
Here is an example with this notation:
[COMMON.assembly_gap] estimated_length = "unknown" gap_type = "within scaffold" linkage_evidence = "paired-ends"
-
DDBJ annotation supports "meta" values with features under COMMON such that the items are inserted to each occurrence in the resulting flat file produced by DDBJ. Here is an example to insert
assembly_gap
feature underCOMMON
entry.
-
-
Feature-qualifier items inserted to each occurrence
-
Here is an example: Difference from the previous case is only at
[assembly_gap]
as opposed to[COMMON.assembly_gap]
.[assembly_gap] estimated_length = "unknown" # Set it "<COMPUTE>" to count the number of N's gap_type = "within scaffold" linkage_evidence = "paired-ends"
-
While this should work effectively the same as the "meta-description" item above, use this notation if you insert values repeatedly in the annotation file produced by GFF3-to-DDBJ.
-
Currently supporting
[source]
and[assembly_gap]
only.
-
If metadata file is not specified via --metadata
option, a tentative fallback configuration here is loaded.
For more examples, see annotation examples provided by DDBJ, such as WGS in COMMON and WGS, and the corresponding metadata files metadata_WGS_COMMON.toml and metadata_WGS.toml in this repository.
GFF3 and DDBJ annotation have rough correspondence like:
- GFF3 column 3 "type" → DDBJ annotation column 2 as "Feature"
- GFF3 column 9 "attribute" → DDBJ annotation column 4 and 5 as "Qualifier key", and "Qualifier value"
but nomenclatures in GFF3 often do not conform the annotations set by INSDC. Furthermore, DDBJ lists up the feature-qualifier pairs they accepts, a subset of the INSDC definitions.
To meet convensions with the requirement, GFF3-to-DDBJ comes with a default configuration in TOML to rename (or even translate) feature keys and qualifier keys/values. Note that the Sequence Ontology is helpful in translating a type into a INSDC feature and qualifier(s).
Here is how to customize the renaming configuration.
The default setting renames five_prime_UTR
"type" in GFF3 into 5'UTR
"feature key" in the annotation. This transformation is expressed in TOML as follows:
[five_prime_UTR]
feature_key = "5'UTR"
This is about renaming attributes under arbitrary types. By default, ID=foobar
"attribute" in a GFF3 becomes /note="ID:foobar"
qualifier in the annotation. (Here I follow the convention putting slash (like /note
) to denote qualifier. But DDBJ annotation does NOT include slash hence no slash is used in any of TOML files.)
Here is the TOML defining the transformation. __ANY__
is the special name representing arbitrary types. ID
is the original attribute key. note
is the name of corresponding qualifier key. ID:
is attached as the prefix of the qualifier value.
[__ANY__.ID]
qualifier_key = "note"
qualifier_value_prefix = "ID:" # optional
One can also set a qualifier key and a value together. For example, /pseudo
qualifier is discouraged by DDBJ regardless of features. We may enforce the replacement by,
# /pseudo is always replaced by /pseudogene="unknown"
[__ANY__.pseudo]
qualifier_key = "pseudogene"
qualifier_value = "unknown"
Sometimes we want to replace a certain types with features WITH qualifiers. For example, snRNA
is an invalid feature in INSDC/DDBJ hence we replace it with ncRNA
feature with /ncRNA_class="snRNA"
qualifier. Such transformation is written in TOML as following.
[snRNA]
feature_key = "ncRNA"
qualifier_key = "ncRNA_class"
qualifier_value = "snRNA"
Here is a story in setting the default renaming scheme: Some annotation programs produce a GFF3 line containing RNA
as the type and biotype=misc_RNA
as one of the attributes. But it should be treated as misc_RNA
feature in DDBJ annoation. In such case, we join (feature key, qualifier key, qualifier value) with dot as delimiter, and write as follows.
[RNA.biotype.misc_RNA]
feature_key = "misc_RNA"
To feed a custom translation table, use the CLI option:
--config_rename <FILE>
And here is an example call:
gff3-to-ddbj \
--gff3 myfile.gff3 \
--fasta myfile.fa \
--metadata mymetadata.toml \
--locus_tag_prefix MYOWNPREFIX_ \
--transl_table 1 \
--config_rename my_translate_features_qualifiers.toml \ # Set your customized file here
--output myawesome_output.ann
DDBJ specifies recommended Feature/Qualifier usage matrix. To conform this rule, features and qualifiers appearing in the annotation output are filtered by the filtering file in TOML by default. The file is in TOML format with the structure like this:
CDS = [
"EC_number",
"inference",
"locus_tag",
"note",
"product",
]
exon = [
"gene",
"locus_tag",
"note",
]
The left-hand side of the equal sign =
represents an allowed feature key, and the right-hand side is a list of allowed qualifier keys. In this example, only CDS
and exon
features will show up in the annotation, and qualifiers are limited to the listed items. To customize this filtering function, edit the TOML file first and pass the file with the CLI option:
--config_filter <FILE>
It might be a good practice to validate your GFF3 files. GFF3 online validator is useful though the file size is limited to 50MB.
GFF3_to_DDBJ does not work when GFF3 contains FASTA information inside with ##FASTA
directive. Attached tool split-fasta
reads a GFF3 file and saves GFF3 (without FASTA info) and FASTA.
split-fasta path/to/myfile.gff3 --suffix "_splitted"
This creates two files, myfile_splitted.gff3
and myfile_splitted.fa
.
Letters like =|>" []
are not allowed in the 1st column (= "Entry") of the DDBJ annotation. The attached program normalize-entry-names
renames such entries. This program converts an ID like ERS324955|SC|contig000013
into ERS324955:SC:contig000013
for example.
normalize-entry-names myannotation_output.txt
This command create as files myannotation_output_renamed.txt
if the invalid letters are found. Otherwise, you'll see no output.
- Need to handle location correction and feature
join()
in presence of/trans_splicing
- Need to handle location correction in presence of
/transl_except
at start/stop codon - Needs
/translation
when/exception
exists. - GFF3 handling when the flatfile is supposed to have "between-position" location like
123^124
- Currently the development focuses on accuracy; the software runs slow using a single process.
GFF3-to-DDBJ's design is deeply indebted to EMBLmyGFF3, a versatile coversion for EMBL annotation format.