The annotation support in Variant Transforms has two goals:
- First, we want to be able to properly parse annotation fields in input VCFs.
- The second goal is to be able to annotate VCF files during the import process, using open annotation tools/databases like Variant Effect Predictor (VEP).
The annotation field format that is currently supported is based on this spec developed in the SnpEff project. VEP uses the same format and here is a trimmed example from a file annotated by VEP (taken from the gnomAD dataset):
VCF Header:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|ALLELE_NUM|...>
Corresponding annotations for a sample variant record where REF=AAC
and
ALT=TAC,A
:
CSQ=-|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000423562|unprocessed_pseudogene|||2,...,T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000518655|transcribed_unprocessed_pseudogene|||1
For each variant record, multiple annotation lists are provided which are
separated by comma. The first field of each set refers to an alternate allele or
ALT
(for details of the format see
the spec).
Note that the actual record of the above example, has more fields and also each
alternate allele can be repeated multiple times (with different annotations);
here we have only selected a small sub-set, one for each ALT
.
When --annotation_fields
flag is used, annotation lists are split and moved under
their corresponding alternate_bases
records. In other words, there will be a
repeated record field CSQ
under each alternate_bases
:
As it can be seen in the above example, the alternate alleles listed in the
CSQ
field are different from ALT
strings of the variant record (for example
the AAC->C
variant is represented by a single -
which means a deletion).
This is because, the gnomAD dataset is run through VEP with the
--minimal
option. Variant Transforms supports matching alt fields in the minimal mode. There are actually
three supported modes of ALT
matching:
-
If no
ALT
matching flag is set, the spec is implemented by default (support for cancer samples is not currently implemented and is tracked by this issue). -
With --use_allele_num, the value of the
ALLELE_NUM
annotation field is looked up, which is a feature of VEP. This is the case in the above example. -
Variant Transforms can also simulate the
--minimal
mode of VEP with --minimal_vep_alt_matching but this is not recommended as it can result in ambiguous matching. The number of such ambiguous cases are counted and reported in the Dataflow dashboard. Also with this flag, a new field is added to the output BigQuery table to mark ambiguous cases. This should be used only as a last resort.
You can run VEP on input VCF files before importing them into BigQuery. The
minimum number of flags to enable this feature is --run_annotation_pipeline
and --annotation_output_dir [GCS_PATH]
where [GCS_PATH]
is a path in a GCS
bucket that your project owns.
While parsing annotation fields and breaking them into separate fields is a useful feature but it still requires pre-annotated VCFs. To address this need, we have started an effort to provide options to annotate input files as part of the BigQuery import process, i.e., automatically done by Variant Transforms.
Currently, this option is only supported for VEP and that is done by bringing up Virtual Machines (VM) on Google Cloud Platform (GCP) and running VEP docker images. This is slow and improving this experience is on our roadmap.
Relevant flags for this feature are:
-
--run_annotation_pipeline
enables the automatic annotation feature. -
--annotation_output_dir
a path on Google Cloud Storage (GCS) where VEP output files are copied. The file hierarchy of input files is replicated at this location with the same file names followed by_vep_output.vcf
. Note that if this directory already exists, then Variant Transforms fails. This is to prevent unintentional overwriting of old annotated VCFs. -
--shard_variants
by default, the input files are sharded into smaller temporary VCF files before running VEP annotation. If the input files are small, i.e., each VCF file contains less than 50,000 variants, setting this flag can be computationally wasteful. -
--number_of_variants_per_shard
the maximum number of variants written to each shard ifshard_variants
is true. The default value should work for most cases. You may change this flag to a smaller value if you have a dataset with a lot of samples. Notice that pipeline may take longer to finish for smaller value of this flag. -
--vep_image_uri
the docker image for VEP created using the Dockerfile in variant-annotation GitHub repo. By defaultgcr.io/cloud-lifesciences/vep_91
is used which is a public image that Google maintains (VEP version 91). -
--vep_cache_path
the GCS location that has the compressed version of VEP cache. This file can be created using build_vep_cache.sh script. By defaultgs://cloud-lifesciences/vep/vep_cache_homo_sapiens_GRCh38_91.tar.gz
is used which is good for human genome aligned with GRCh38 reference sequence. -
--vep_info_field
by default, an INFO field calledCSQ_VT
is added to hold the new annotations; use this field to override the field name. -
--vep_num_fork
number of forks when running VEP, see--fork
option of VEP.
For other parameters, like how many VMs to use or where the VMs should be
located, the same parameters for
Dataflow pipeline execution
are reused, e.g., --num_workers
. It is recommended to provide one worker per
50,000 variants. For instance, for a dataset with 5,000,000 variants, you may
set --num_workers
to 100.
In the annotation_output_dir
, beside output annotated VCFs, there is a logs
directory which contains the logs from virtual machines on which VEP was run.
If VEP fails, this is a good place to look for causes
(in addition to usual Variant Transforms log files).
Finally, the --check_ref
option of VEP is enabled for these runs and the above log files contain a
report of variants that do not match the reference. If you are using the right
VEP cache, there should be no mismatch. Note that mismatches are dropped and
hence are not present in the final BigQuery output table.
We imported the gnomAD genomes table using the features described in Parsing Annotation Fields.
The gnomAD sample query notebook has additional queries you can run to practice querying annotation data.
Here is another sample query to extract high impact variants in the HBB gene which encodes the beta chain of Haemoglobin:
#standardSQL
SELECT reference_name, start_position, reference_bases, ALT.alt, CSQ.*
FROM vcf_imports_external.gnomad_genomes_chr_hg19 AS T, T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE CSQ.SYMBOL = "HBB" AND CSQ.IMPACT = "HIGH"
ORDER BY start_position
and here is a snapshot from the BigQuery UI for this query:
Note that the gnomAD dataset does not have any calls or samples but it is fairly simple practice to extend this query to pick samples that have these high impact variants.
You can also join with other databases that are available in BigQuery. For instance, there are several pathway databases that are available publicly in BigQuery. Here is a sample query to extract high impact variants that are in the double-strand break repair pathway as defined by the GO (Gene Ontology) biological process (GO:0006302).
#standardSQL
SELECT
reference_name, start_position, reference_bases, ALT.alt,
CSQ.Consequence, CSQ.Impact, CSQ.SYMBOL
FROM
`vcf_imports_external.gnomad_genomes_chr_hg19` AS T,
T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE
# Note: Matching based on symbol is "best effort" as the names may not be
# standardized across sources.
CSQ.SYMBOL IN (SELECT DB_Object_Symbol
FROM `isb-cgc.genome_reference.GO_Annotations`
WHERE GO_ID = 'GO:0006302')
AND CSQ.IMPACT = "HIGH"
ORDER BY
start_position
As part of the ISB-CGC project, several more public databases are also made available in BigQuery including Reactome (table) and WikiPathways (table).