Skip to content

Variant Annotations

Lucas Czech edited this page Jul 15, 2024 · 11 revisions

SnpEff

The annotation tool SnpEff annotates variants and predicts their effects on genes by using an interval forest approach.

In grenepipe, SnpEff is used when the config.yaml key settings: snpeff is set to true.

Reference genome

For SnpEff to work, we need to select a reference genome (by name) that SnpEff understands, and set it in the params: snpeff section of the config.yaml.

First, we need to figure out how your species is called in the SnpEff database (its online repositories). If you already know that, jump to step 4 in the list below, and use that name there. If not, follow steps 1-3 here first: We install SnpEff and list all available online resources to find the one corresponding to your species.

  1. First, we install SnpEff in a conda environment, so that we can work cleanly.

    Create the file env-snpeff.yaml:

    channels:
      - conda-forge
      - bioconda
    dependencies:
      - snpeff =4.3.1t
    

    Now, create and activate the conda environment:

    conda env create --name snpeff --file env-snpeff.yaml
    conda activate snpeff
    
  2. Next, find the name of your reference genome in the SnpEff database. Here, we filter the database for some term that we expect to find. Of course, change "thaliana" to what you are looking for, or just omit the grep to get the whole list.

    snpEff databases | grep -i "thaliana"
    
  3. The first column of the output is what we are looking for: Arabidopsis_thaliana is the name of the reference genome for A. thaliana in the SnpEff database.

  4. Put this name into your grenepipe config.yaml file at the params: snpeff: name key.

With a correct name in place, and with settings: snpeff set to true, grenepipe will take care of downloading the respective SnpEff databases.

Alternatively, you can also find settings in our config.yaml to specify a custom database, for instance if SnpEff does not have anything available for your species, or if you need some extra customization. See the SnpEff documentation for how to build a custom database

Output format

SnpEff produces an annotated vcf file, as well as an html report, both located in the annotation directory of the pipeline run directory.

Note that SnpEff changed its output format at some point, which can be changed for backwards compatibility via the -formatEff option, see https://pcingola.github.io/SnpEff/se_commandline/ for details. That option can set via your config.yaml file (along with other needed options of course) as needed under the key params: snpeff: extra.

VEP

The Ensembl Variant Effect Predictor VEP is another variant annotation tool.

In grenepipe, VEP is used when the config.yaml key settings: vep is set to true.

Reference genome

For VEP to work, we need to select a reference genome (by name) that VEP understands, and set it in the params: vep section of the config.yaml. This is a bit tricky and does not seem to be documented all to well on their web page. In particular, we need to find the species name and the database build name and release version to automatically be able to download the data, called the "cache" in VEP.

It is important to note that the download FTP URL might have to be set, and this can be hard to find on their website. Follow the links for FTP directories that you find on the vep_download and vep_cache pages, for example, and look for links of the form

http://ftp.ensembl.org/pub/current_variation/indexed_vep_cache/
ftp://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/variation/vep

If it starts with http:// (as the first of the two links above does), simply replace that by ftp:// in the grenepipe config file cache-url setting, see below.

For example, the following can be set in our grenepipe config file:

params:
  vep:
    species: "homo_sapiens"
    build: "GRCh38"
    release: 98
    cache-url: "" # The VEP default works for Homo sapiens

or

params:
  vep:
    species: "arabidopsis_thaliana"
    build: "TAIR10"
    release: 104
    cache-url: "ftp://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/variation/vep"

In the latter example, Arabidopsis thaliana is a plant species, and hence not found in the default metazoan list that VEP uses. Hence, we have to set the cache URL accordingly.

If you find a simpler way of finding the necessary settings, please let us know!

Output format

VEP produces an annotated vcf file, as well as an html report, both located in the annotation directory of the pipeline run directory.