Skip to content

Individual level PrediXcan: introduction, tutorials and manual

Hae Kyung Im edited this page Jun 3, 2020 · 11 revisions

Introduction

In the following, we focus on the individual-level implementation of PrediXcan.

The method was originally implemented in this repository. However, maintenance and new features were hard to enact, so we re-implemented the method to integrate with latest MetaXcan features.

PrediXcan consists of two steps:

  • Predict gene expression (or whatever biology the models predict) in a cohort with available genotypes
  • Run associations to a trait measured in the cohort

The first step is implemented in Predict.py. The second step is implemented in PrediXcanAssociation.py. Both steps are wrapped in PrediXcan.py, although I find it more convenient to run each step at a time. Running one step at a time is easier to troubleshoot when running on a cohort for the first time.

The prediction models are trained and pre-compiled on specific data sets with their own human genome releases and variant definitions. We implemented a few rules to support variant matching from genotypes based on different variant definitions. In the following, mapping refers to the process of assigning a model variant to a genotype variant.

Originally, PrediXcan was applied to genes so we say "gene expression" a lot as it was the mechanism we initially studied. But conceptually, everything said here applies to any intermediate/molecular mechanism such as splicing or brain morphology. Whenever we say "gene", it generally could mean a splicing intron event, etc.

You can find a specification of command-line arguments below. We also show some examples of running predixcan.

Features

  • Handles multiple input genotype formats (vcf, bgen, text format)
  • For bgen, only those variants available to the model get loaded, saving a lot of time
  • Output can be stored in HDF5 for granular and efficient access.
  • Supports prediction models built upon different versions of the human genome.
    • If your genotypes are in (say) hg19 and use one of the latest models defined on hg38, genotype variants can be lifted over behind the scenes from hg19 to hg38
  • If you use non-standard variants, you can provide a table mapping from your genotype variants to PrediXcan models.
  • It loads one variant at a time, so that the memory consumption is bounded by the predicted expression matrix (roughly genotype sample size * number of genes * float size). When using HDF5 storeage, this matrix is stored in disk so that memory is negligible.

The software assumes that no genotypes are missing. In vcf's, if an allele is denoted is missing, it is counted as zero dosage of that allele.

Command line overview

In the following, mapping refers to implemented rules to match genotype variants to model variants.

Predict.py:

  • --model_db_path: path to a sqlite file containing prediction models
  • --model_db_snp_key: optional. If provided, will load variant ids from an alternative column in the db. By default, PrediXcan uses rsids, and this works with Elastic Net models. For the more sophisticated MASHR models, --model_db_snp_key varID must be specified with this argument.
  • --liftover: Optional. Path to liftover chain. If provided, genomic coordinates will be mapped with liftover.
  • --zero_based_positions Optional flag. Lets PrediXcan note that genomic coordinates start at 0 (genome browser starts at 1).
  • --skip_palindromic: Optional flag. Palindromic/ambiguous snv (such as C/G or A/T) from the genotype will be ignored.
  • --bgen_genotypes: pattern of files with genotypes in bgen format (e.g. data/ukb/chr*.bgen)
    • --bgen_use_rsid: Load rsid as variant id, otherwise load id. Rsids are convenient for some models where each variant is guaranteed to have an rsid. Not so much for other models.
    • --force_colon: Some files have mixed separators in variant ids such as chr1:100_A_T. This converts on-the-fly to chr1:100:A:T.
  • --vcf_genotypes: pattern of vcf genotype files.
  • --vcf_mode:
    • genotyped is meant for phased, genotyped vcfs that contain counts of each allele at each chromosome pair.
    • imputed will load DS field as dosage. This is meant to work with imputed vcfs as generated by the Michigan Imputation Server.
  • --force_mapped_metadata: Optional. Mapping variants is inherently heuristic; uses the mapped variants definitions (coordinates and alleles) instead of the genotyped properites. i.e. a genotype variant might be chr1 100 C T and end up mapped to chr1_100_T_C_b38; this option tells PrediXcan to use T C instead of C T, and swaps allele dosages and frequencies appropriately, before matching to the model. This option is mostly meant for development purposes, as the PrediXcan method will match alleles of genotype and model variants anyway.
  • --text_genotypes: use the text format as specified below
  • --text_sample_ids: a file specifying samples. If specified alongside bgen or vcf, will use the file's IDs instead of the ones in the vcf or bgen.
  • --generate_sample_ids: Optional. Provide a number here to generate ids programmatically based on sample size.
  • --prediction_output: specify output (and output type) of predicted expression matrix (e.g. --prediction_output results/vcf_1000G_hg37_en_c/Whole_Blood__predict.h5 HDF5)
  • --prediction_summary_output: Optional. A separate file that will contain some additional information on the predictions (such as number of snps in the gene's models, number of snps used, etc).
  • --variant_mapping: Optional. Specify a way to map from genotype variant to model variants. This argument takes three values:
    • A file with a table that states how to convert from genotype variants to model variants
    • name of column with genotype variant ids
    • name of column with model target variant ids
  • --on_the_fly_mapping: Optional. Specify a pattern to build a variant id from genotype variant properties. e.g. --on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" will take the genotype variant's chromosome, position, alleles to build a variant id like chr1_123_A_G_b38. This will use the genotype properties, or if liftover is specified, the lifted coordinates.
  • --only_entries: Optional. This convenience argument specifies a whitelist of genes/splicing/etc to be computed, anything else will be disregarded.

PrediXcanAssociation.py:

  • --hdf5_expression_file: Load predicted expression matrix from an HDF5 file (as generated by Predict.py).
  • --expression_file: Alternatively, load predicted expression matrix from a text file (as generated by Predict.py).
  • --input_phenos_file: path to (text, tab separated file) file specifying phenotype values for each individual.
  • --input_phenos_column: name of column to be used as phenotype.
  • --covariates_file: path to (text, tab-separated) file specifying covariates for each individual.
  • --covariates: names of columns from the above file to use as covariates.
  • --output: path were to save the results.
  • --mode: can be linear (default), or logistic for dichotomic phenotypes.

PrediXcan.py takes the same arguments from PrediXcanAssociation.py and Predict.py. If no prediction output is specified, the predicted transcriptome matrix will not be stored. This is disrecommended and the matrix contains useful information.

All scripts support a --throw option: in case of error, the full software stack trace will be printed instead of a user-friendly error message.

Text dosage format

The dosage format consists of gzipped, tab separated text files without header. Each line looks like:

chromosome variant_id position allele1 allele2 MAF id1 ..... idn with:

  • chromosome: the chromosome identifier (i.e. 1)
  • position: position of variant in the chromosome
  • variant_id: a unique string identifying the variant. It can be an rsid, it can be a string encoding properties (chr1_123_C_T_b38), or whatever.
  • allele1: The non-effect allele (sometimes called "ancestral", or "ref" in Thousand Genomes)
  • allele2: effect allele (the one which dosage will be used to predict expression/splicing/etc)
  • MAF: allele frequency of allele 2 (unused at the moment)
  • id1 .. idn: each entry is the dosage (count/probabilities of allele 2 across the chromosomes)

It generally expects a samples file, which is a headerless, space-separated file with columns: family id and individual id. Example:

HG00096 HG00096
HG00097 HG00097
HG00099 HG00099
HG00100 HG00100
...

Examples

You can try PrediXcan by downloading the following example data from Zenodo. The package contains minimalist example cases to show in the following tutorials. To unpack, run the following in a UNIX-like command line:

tar -xvpf predixcan_sample_data.tar

Prerequisites

In the following, we assume that you have downloaded the MetaXcan software software and all dependencies (such as Python 3.5 or later) as listed in MetaXcan's readme.

For convenience we assume that the path to MetaXcan is exposed to to your environment like so:

export $METAXCAN=/path/to/code/MetaXcan/software

Also, the folder with the sample data should be exposed like this:

export $DATA=/path/to/sample/data/untared

In the following, python3 stands for a valid python 3.5 (or greater) installation with all dependencies.

Example 1: VCF hg19-based genotype, on GTEx v8 Elastic Net Model

The easiest example is one where the genotypes and models use the same variants. Here we show how to run on Thousand Genomes genotypes from hg37 with the Elastic Net models:

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/vcf_1000G_hg37_en/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_en/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/vcf_1000G_hg37_en/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_en/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Option: use the PrediXcan wrapper

You can run both steps in one command like so:

python3 $METAXCAN/PrediXcan.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/vcf_1000G_hg37_en_b/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_en_b/Whole_Blood__summary.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_en_b/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Option: HDF5 storage

The following uses HDF5 for predicted expression. It is generally slower, as it stores the prediction matrix in disk, but it consumes a dramatically smaller amount of memory. This was implemented to run on large cohorts such as the UK Biobank

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__predict.h5 HDF5 \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--hdf5_expression_file $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__predict.h5 \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Example 2: bgen hg19-based genotype, on GTEx v8 Elastic Net Model

Then next example shows how to run with a bgen. The data in the example bgen is actually Thousand Genomes from hg37.

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--bgen_genotypes $DATA/1000G_hg37_bgen/chr22.bgen \
--bgen_use_rsid \
--prediction_output $RESULTS/bgen_1000G_hg37_en/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/bgen_1000G_hg37_en/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/bgen_1000G_hg37_en/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/bgen_1000G_hg37_en/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Example 3: VCF hg38-based genotype without rsids, on GTEx v8 Elastic Net Model

An interesting example consists of a genotype without variant ids whatsoever. However, using the genotype variants' coordinates, we can match to model variants.

#printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg38/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--vcf_mode genotyped \
--variant_mapping $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz id rsid \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/vcf_1000G_hg38_en/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg38_en/Whole_Blood__summary.txt \
--verbosity 9 \
--throw


printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/vcf_1000G_hg38_en/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg38/random_pheno_1000G_hg38.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg38_en/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Example 4: VCF hg19-based genotype, on GTEx v8 MASHR Model

The following example is trickier. It assumes a genotype based on hg19, and we want to apply GTEx v8 mashr models. This poses a problem because the MASHR models include many variants that lack an rsid, so a mapping rule must be used. Also, the MASHR models are defined on hg38 so that the genotype's coordinates do not match the models'. We therefore specify a lift over conversion and pattern to match from converted coordinates to model variants:

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_mashr/mashr_Whole_Blood.db \
--model_db_snp_key varID \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--liftover $DATA/hg19ToHg38.over.chain.gz \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__association.txt \
--verbosity 9 \
--throw