-
Notifications
You must be signed in to change notification settings - Fork 19
Preparing reference data sets
This document will briefly outline the steps necessary to set up a reference database for the Companion annotation pipeline.
Companion uses pre-compiled information about the reference organisms to drive certain reference-based steps, for example:
- contig-chromosome alignment to order and orientate contigs to form pseudochromosomes,
- protein-DNA alignment for improving ab initio gene finding accuracy and pseudogene annotation,
- transfer of highly conserved gene models from a single reference organism,
- transfer of functional annotation from orthologous features in a single reference organism to the newly annotated genome, and
- quick identification of core genes, missing core genes and singletons w.r.t. a set of reference organisms.
This means that for each reference organism, the pipeline requires gene structures, genome sequence, protein sequences for each coding gene, and an annotation of which sequences in the reference are chromosomes (as they could be intermingled with unassembled contigs in the input sequence file).
Compiling a reference database here means collecting all reference information needed, such as sequence, annotation (structural and functional), trained gene models for AUGUSTUS and/or SNAP, as well as orthologous groups across the reference species to compare to members in the new genome. The next sections will explain what data is needed.
For each reference organism, the input data are needed:
- a descriptive name of the organism
- a short abbreviation for the organism
- the genome sequence in a single file
- a structural gene annotation in GFF3 format (see below for details)
- functional GO annotation in GAF 1.0 format, on the gene level
- a pattern matching chromosome headers, describing how to extract chromosome numbers from them
- AUGUSTUS and (optional) SNAP models, trained on reference genes
Sequences must be in FASTA format, with one-word headers without special characters such as ?<";|\{}[]+=-
etc.
The width of the individual lines does not matter (i.e. whole sequences on one line are fine, as are sequences separated over many lines).
The features in the input GFF3 file should have the following minimal type structure (e.g. should be connected like this in via Parent
attributes):
gene
└───mRNA
└───CDS
pseudogene
└─── pseudogenic_transcript
└───pseudogenic_exon
polypeptide (derives_from mRNA)
gene
└───rRNA
gene
└───tRNA
gene
└───ncRNA
The file must be sorted and strictly valid GFF3! Make sure your file is validated, or when in doubt clean it up with GenomeTools:
gt gff3 -sort -retainids -tidy input.gff3 > input.clean.gff3
Then, continue working with the cleaned file input.clean.gff3
. The command above will also provide error messages if the file is broken in such a way that it cannot be automatically fixed.
Please note that the polypeptide
features should all have a 'product' attribute of the following form (example): product=term%3Dser%2Fthr protein phosphatase family protein
, i.e. the product description for that polypeptide, prefixed by 'term=' and GFF3-escaped.
The exports from GeneDB (ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3) all follow this format so they should be usable as reference input out of the box. For flat file tables from EuPathDB ('Gene Information Tables', e.g. from http://amoebadb.org/common/downloads/Current_Release/EhistolyticaHM1IMSS/txt), we also provide a Python library and script to convert them, EuPathTables.
A reference database is technically just a directory subtree with a predefined structure. Hence it is not a real 'database' in the strongest sense, but filesystem based. It is usually identified by the topmost directory in this subtree, in which each organism in turn has its own subdirectory.
For instance, for a reference directory references
with three reference organisms LmjF
, LbrM
and Tb927
you would have the following tree:
references
├───references-in.json
├───references.json
├───_all
| └─── all_orthomcl.out
├───LmjF
| └─── ...
├───LbrM
| └─── ...
└───Tb927
└─── ...
You might have noticed the files references-in.json
and references.json
. These contain metadata about the reference species and are read by the pipeline to access the data in the reference database.
While this structure might sound complicated, it in fact isn't. The next sections will give a short walkthrough of this process.
The following software is required to create a new reference directory:
- GenomeTools version 1.5.5 or newer
- BLAST or BLAST+
- MCL and OrthoMCL 1.4
To create a new reference set, the steps explained here should be sufficient. First, create a new directory for the set:
$ mkdir reference
$ cd reference
In this directory, create a new file references-in.json
with the following contents (example):
{
"species" : {
"LmjF" : {
"gff" : "../genomes/L_major.gff3",
"genome" : "../genomes/L_major.fasta",
"gaf" : "../genomes/L_major.gaf",
"name" : "Leishmania major Friedlin",
"augustus_model" : "../../data/augustus/species/leishmania_major_sampled/",
"snap_model" : "../../data/snap/LmjF.hmm",
"chromosome_pattern" : "LmjF.(%d+)"
},
"LbrM" : {
"gff" : "../genomes/L_braziliensis.gff3",
"genome" : "../genomes/L_braziliensis.fasta",
"gaf" : "../genomes/L_braziliensis.gaf",
"name" : "Leishmania braziliensis",
"augustus_model": "../../data/augustus/species/leishmania_braziliensis_sampled/",
"chromosome_pattern" : "LbrM.([0-9.]+)$"
},
"Tb927" : {
"gff" : "../genomes/T_brucei_927.gff3",
"genome" : "../genomes/T_brucei_927.fasta",
"gaf" : "../genomes/T_brucei_927.gaf",
"name" : "Trypanosoma brucei TREU927",
"augustus_model": "../../data/augustus/species/trypanosoma_brucei_927_sampled/",
"chromosome_pattern" : "Tb927_(%d+)_v%d"
},
},
"groups" : {
"Leishmania" : ["LmjF", "LbrM"],
"Trypanosoma" : ["Tb927"]
}
}
See the JSON format documentation to learn more about the syntax of this format. Each new reference species must have an entry in the species
hash, with its short abbreviation as the key. The chromosome_pattern
fields must contain a Lua pattern which has to match all the chromosome header in the input sequence set, and also capture a numeric string to be interpreted as the chromosome number. This is needed later for pseudochromosome contiguation using this species as a reference.
The groups
hash assigns each of the reference species to a more general group, e.g. a genus. This is useful in Companion's downstream analysis steps to determine the relevant orthologous genes for tree drawing and family comparison.
The script update_references.lua
in Companion's bin directory takes care of actually collecting the files referenced by the references-in.json
, preprocessing them and finally copying them to their expected positions in the directory tree. Just execute the script while in the directory containing the references-in.json
.
$ update_references.lua
Importing LbrM...
Importing LmjF...
Importing Tb927...
$
The sequence, annotations and models have now been imported from the locations specified in the references-in.json
file. If everything went smoothly, there should now be a references.json
file in the current working directory, alongside the referenced-in.json
. Check whether that file exists, is not empty, and looks lika a proper JSON file.
To use the reference set for annotation only, this is enough to get you going! Just use the directory containing the references.json
file as the value of the ref_dir
parameter in your Companion job parameter file.
For some of the downstream analyses to work, you will also need a set of precomputed orthologs. These are just the cluster output of OrthoMCL 1.4. So to generate them just do something alike to the following (from your reference root directory):
mkdir _all
cd _all
find .. -name 'proteins.fasta' | xargs cat > all_prots.fasta
find .. -name 'ggline.gg' | xargs cat > all_gg.txt
formatdb -i all_prots.fasta
blastall -p blastp -m 8 -e 1e-5 -d all_prots.fasta -i all_prots.fasta > b.out
orthomcl.pl --mode 3 --blast_file b.out --gg_file all_gg.txt
Then, move the all_orthomcl.out
file from the created orthomclXXXX
directory (where XXXXX is a number) to the _all
directory. That's it!
To do de novo prediction with a reference genome as the template, AUGUSTUS and SNAP must be trained on reference genes.
This HOWTO follows the general training approach introduced by: http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html
The basic set of commands is:
new_species.pl --species=my_species_code
gff2gbSmallDNA.pl reference.gff3 reference.fasta 1000 sampled.gb
etraining --species=my_species_code sampled.gb
This will create the necessary files in the corresponding species subdirectory in the AUGUSTUS_CONFIG_PATH
directory that can be referenced in the references-in.json
file.
Please note that it might improve the performance of the model if the optimize_augustus.pl
script is run on the resulting models. Please check the training HOWTO linked above for more details on this step.
We recommend that SNAP is not used (it is disable in the configuration included in the distribution), as it has not provided useful results in this pipeline.