-
Notifications
You must be signed in to change notification settings - Fork 21
Variant Annotations
The annotation tool SnpEff annotates variants and predicts their effects on genes by using an interval forest approach.
In grenepipe, SnpEff is used when the config.yaml
key settings: snpeff
is set to true
.
For SnpEff to work, we need to select a reference genome (by name) that SnpEff understands, and set it in the params: snpeff
section of the config.yaml
.
First, we need to figure out how your species is called in the SnpEff database (its online repositories). If you already know that, jump to step 4 in the list below, and use that name there. If not, follow steps 1-3 here first: We install SnpEff and list all available online resources to find the one corresponding to your species.
-
First, we install SnpEff in a conda environment, so that we can work cleanly.
Create the file
env-snpeff.yaml
:channels: - conda-forge - bioconda dependencies: - snpeff =4.3.1t
Now, create and activate the conda environment:
conda env create --name snpeff --file env-snpeff.yaml conda activate snpeff
-
Next, find the name of your reference genome in the SnpEff database. Here, we filter the database for some term that we expect to find. Of course, change "thaliana" to what you are looking for, or just omit the grep to get the whole list.
snpEff databases | grep -i "thaliana"
-
The first column of the output is what we are looking for:
Arabidopsis_thaliana
is the name of the reference genome for A. thaliana in the SnpEff database. -
Put this name into your grenepipe
config.yaml
file at theparams: snpeff: name
key.
With a correct name in place, and with settings: snpeff
set to true
, grenepipe will take care of downloading the respective SnpEff databases.
Alternatively, you can also find settings in our config.yaml
to specify a custom database, for instance if SnpEff does not have anything available for your species, or if you need some extra customization. See the SnpEff documentation for how to build a custom database
SnpEff produces an annotated vcf file, as well as an html report, both located in the annotation
directory of the pipeline run directory.
Note that SnpEff changed its output format at some point, which can be changed for backwards
compatibility via the -formatEff
option, see https://pcingola.github.io/SnpEff/se_commandline/
for details. That option can set via your config.yaml
file (along with other needed options of
course) as needed under the key params: snpeff: extra
.
The Ensembl Variant Effect Predictor VEP is another variant annotation tool.
In grenepipe, VEP is used when the config.yaml
key settings: vep
is set to true
.
For VEP to work, we need to select a reference genome (by name) that VEP understands, and set it in the params: vep
section of the config.yaml
. This is a bit tricky and does not seem to be documented all to well on their web page. In particular, we need to find the species name and the database build name and release version to automatically be able to download the data, called the "cache" in VEP.
It is important to note that the download FTP URL might have to be set, and this can be hard to find on their website. Follow the links for FTP directories that you find on the vep_download and vep_cache pages, for example, and look for links of the form
http://ftp.ensembl.org/pub/current_variation/indexed_vep_cache/
ftp://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/variation/vep
If it starts with http://
(as the first of the two links above does), simply replace that by ftp://
in the grenepipe config file cache-url
setting, see below.
For example, the following can be set in our grenepipe config file:
params:
vep:
species: "homo_sapiens"
build: "GRCh38"
release: 98
cache-url: "" # The VEP default works for Homo sapiens
or
params:
vep:
species: "arabidopsis_thaliana"
build: "TAIR10"
release: 104
cache-url: "ftp://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/variation/vep"
In the latter example, Arabidopsis thaliana is a plant species, and hence not found in the default metazoan list that VEP uses. Hence, we have to set the cache URL accordingly.
If you find a simpler way of finding the necessary settings, please let us know!
VEP produces an annotated vcf file, as well as an html report, both located in the annotation
directory of the pipeline run directory.