Command Line Reference

Command Line Tools

This repository contains several scripts for running MetaXcan's computations, and building support data. Most scripts are meant to be run from the command line, and expect command line arguments to control them.

MetaXcan supports a wild and vast of input data. It can work with individual level data (PrediXcan.py) or GWAS summary statistics (SPrediXcan.py through a precomputed compilation of reference LD).

If you are using precomputed GWAS statistics from Im's lab (such as the ones from the tutorial), you will only need to use SPrediXcan.py or the underlying implementations M03_betas.py and M04_zscores.py. If you have access to individual's dosage data in the supported formats, you can build support data or LD references more appropriate for your application.

script	functionality
PrediXcan.py	The original, individual-level data method to compute gene-trait associations.
M00_prerequisites.py	Filter down individual dosage data, and convert to supported 'PrediXcan' format. Deprecated.
M01_covariances_correlations.py	Build SNP (related by transcriptome model) data covariance matrices. Correlation matrices optional. Deprecated.
M02_variances.py	Build SNP dosage variance. Optional, and unused by current MetaXcan pipeline. Deprecated.
M03_betas.py	Take GWAS summary statistics in different formats, filter them to match precomputed statistics, and output GWAS' betas in a specific format supported by MetaXcan
M04_zscores.py	Compute MetaXcan from betas (and support statistics)
SPrediXcan.py	Convenience wrapper for M03_betas.py and M04_zscores.py
MetaMany.py	Convenience wrapper for M03_betas.py and M04_zscores.py, that runs MetaXcan on a GWAS using different models and covariances sequentially.
PrediXcan.py	Minimalistic implementation of PrediXcan that takes predicted gene expression as input.
MulTiXcan.py	Multi-Tissue PrediXcan, takes multiple gene expression files as input.
SMulTiXcan.py	Summary-stats based Multi-Tissue PrediXcan.

There are some other undocumented scripts, but their implementation is only too volatile and should be considered private. Feel free to use them anyways if they turn out to be helpful.

Also, you might notice that most scripts support undocumented options. Same caveats apply.

PrediXcan.py

Explained separately in this page.

PrediXcan.py is a thin wrapper over Predict.py and PrediXcanAssociation.py. The output of Predict.py can be used for running MulTiXcan.py.

M00_prequisites.py

This is no longer maintained and remains only for future reference.

This tool filters dosage data according to population and snp criteria, so that posterior steps run on less data, thus faster. They can read/write to/from files in two formats:

IMPUTE2
PrediXcan format, see below

PrediXcan format is preferred by most tools in this repository, although some support IMPUTE.

Suppported arguments are:

Argument	Description	Example Value [default]
`--verbosity`	Controls logging verbosity. The lower value, the more logging you will get. In general, `5` is logging at individual SNP data level, and `10` is usually critical/status logic information. So, this is more of a -log parsimony- thing.	`5`
`--dosage_folder`	Folder where sample data is to be read from.	`data/dosage`
`--snp_list`	GZipped text file containing whitelisted SNP's rsids, one per row. Every SNP not in this file will be excluded.	`data/whitelist.gz`
`--output_folder`	Folder were filtered data will be output	`samples`
`--file_pattern`	Optional. Regular expression to identify dosage files in input folder, separating chromosome number in file name. Only used when (IMPUTE input format, PrediXcan output format).	`1000GP_Phase3_(.*)`
`--population_group_filters`	Whitelisted values for group column value in sample file.	`EUR EAS`
`--individual_filters`	Optional. Regular expression for filtering individual's IDs	`HG* NA2*`
`--input_format`	Format that describes input data. Possible values are IMPUTE or PrediXcan.	`IMPUTE` [`PrediXcan`]
`--output_format`	Format that will be used for data output. Possible values are IMPUTE or PrediXcan.	`PrediXcan` [`IMPUTE`]

This script takes extensive genotype data sets such as Thousand Genomes Haplotypes (here) allows for some filtering based on population or ID, and selects those entries present in hapmap2 biallelic SNP's.

It supports input and output into a custom format that we call internally "PrediXcan Format". It is composed of gzip compressed textfiles, without headers, where each row contains:

chromosome SNP_id SNP_position reference_allele effect_allele allele_frequency ...
# "..." stands for allele dosage data, one value per sample individual, value in [0, 2]

It is assumed that a "samples file" is provided, describing each sample individual's metadata, which is a plain text file that looks like:

ID POP GROUP SEX
HG00096 GBR EUR male
HG00097 GBR EUR female
HG00099 GBR EUR female
HG00100 GBR EUR female
HG00101 GBR EUR male
HG00102 GBR EUR female
...

This script is hardly ever necessary. You might use it occasionally to reduce file sizes used at the covariance script step (M01_covariances_correlations.py).

M01_covariances_correlations.py

This is no longer maintained and remains only for future reference.

This script takes dosages in PrediXcan (see below) or IMPUTE2 formats, and figures out the covariance and/or correlation matrices between snp data, grouped according to transcriptome model. Output is a gzipped text file with each line holding a pairwise covariance/correlation value between two SNP's; they are labelled by gene to group them according to the transcriptome model.

Some of this script's interface should be treated as 'private' because of implementation volatility. The following table shows the public interface:

Argument	Description	Example Value [default]
`--verbosity`	Controls logging verbosity. The lower value, the more logging you will get. In general, `5` is logging at individual SNP data level, and `10` is usually critical/status logic information. So, this is more of a -log parsimony- thing.	`5`
`--weight_db`	Path to the sqlite file holding transcriptome model information	`data/DGN-WB_0.5.db`
`--input_folder`	Path to folder containing SNP dosage data	`temp/filtered`
`--correlation_output`	Path were correlation matrix will be output. If not provided, will not output correlations.	`temp/cor/my_cor.txt.gz` [None]
`--covariance_output`	Path were covariances matrix will be written to. Defaults to a name based on the transcriptome model name.	`temp/cov/my_cov.txt.gz`
`--input_format`	Format of input dosage data. Possible values are IMPUTE or PrediXcan.	`IMPUTE` `[PrediXcan]`
`--min_maf_filter`	Optional. Ignore SNP's with frequencies below this value	`0.05` [None]
`--max_maf_filter`	Optional. Ignore SNP's with frequencies above this value	`0.95` [None]

This script builds the covariance matrices needed at M04_zscores.py. You will run it once in a while, if ever. It takes input from M00_prerequisites.py's output, and a genetic expression model database, such as in the example data.

It will build the correlation matrix between SNP's in a same gene's model, for each gene, and save them in a gzip-compressed text file.

M02_variances.py

Nothing to see here, move along.

Just kidding. This script was deprecated. It calculates SNP variances grouped by transcriptome model; so, it is sort of related to M01_covariances_correlations.py. Options are where to load dosage data from, which transcriptome model to use, and where to output the variances.

M03_betas.py

This script exists as an adapter to different GWAS file formats. It can handle text files (plain text or gzip compressed) with tables of SNP GWAS data. Most of the options are concerned with identifying which column holds which data.

Once columns are specified, it will read the files and:

Exclude any non-SNP data
Exclude any SNP's not in the transcriptome
Flip beta or Odd Ratio value if reference allele is the converse of what's identified by the transcriptome model

It will try to use the following (in that order) if available from the command line arguments and input GWAS file:

use a z-score column if available from the arguments and input file;
use a p-value column and either effect, odd ratio or direction column;
use effect size (or odd ratio) and standard error columns if available.

The outputs are gzipped text files with the fixed format read by MetaXcan. See section below.

Note: z-score, p-value, effect size, effect size standard error are assumed to have no missing values (technically, every variant in the models should either have a value in the GWAS, or be absent)

Note: effect size and effect size standard errorcolumns from the GWAS are to preferred over p-value and zscore, when available.

SNP, Effect Allele, Other/Reference Allele are considered mandatory.

Argument	Description	Example Value [default]
`--verbosity`	Controls logging verbosity. The lower value, the more logging you will get. In general, `5` is logging at individual SNP data level, and `10` is usually critical/status logic information. So, this is more of a -log parsimony- thing.	`5`
`--model_db_path`	Path to the sqlite file holding transcriptome model information	`data/DGN-WB_0.5.db`
`--output_folder`	Path were the processed files will be output.	`temp/beta_f`
`--gwas_folder`	Path to the folder containing GWAS data	`data/GWAS`
`--gwas_file_pattern`	Optional. Regular expression used to filter file names in the GWAS folder (for example, if log files are present).	`".*gz"`
`--gwas_file`	Alternative to gwas folder and file pattern, to specify path to a single GWAS file.	`data/GWAS/mygwas.txt.gz`
`--separator`	Optional. Specify which character separates column data. If any whitespace is the actual separator, don't specify it.	`,` [Any whitespace]
`--snp_column`	Name of column holding SNP data.	`SNP_ID` [`SNP`]
`--non_effect_allele_column`	Name of column holding "other/non effect" allele data.	`REFERENCE` [`A2`]
`--effect_allele_column`	Name of column holding effect allele data.	`EFFECT` [`A1`]
`--or_column`	Name of column holding Odd Ratio data.	`OR`
`--beta_column`	Name of column holding beta (effect size) data.	`BETA`
`--beta_sign_column`	Name of column holding sign of beta.	`direction`
`--zscore_column`	Name of column holding zscore of beta.	`Z`
`--pvalue_column`	Name of column holding p-values data.	`P`
`--throw`	Option to throw exception on error. I.E. output a full stack trace in case of error.

M04_zscores.py

This script holds the actual MetaXcan computation. It expects output as built from M03_betas.py (see section below on formats).

Argument	Description	Example Value [default]
`--verbosity`	Controls logging verbosity. The lower value, the more logging you will get. In general, `5` is logging at individual SNP data level, and `10` is usually critical/status logic information. So, this is more of a -log parsimony- thing.	`5`
`--model_db_path`	Path to the sqlite file holding transcriptome model information	`data/DGN-WB_0.5.db`
`--covariance`	Path to file containing the covariance matrices for the SNP dosage and transcriptome. Expects output as built from M01_covariances_correlations.py	`temp/cov_eur/dgn.cov.txt.gz`
`--beta_folder`	Path to folder containing betas. Expects files as built by M03_betas.py	`temp/beta_f`
`--output_file`	Path were the MetaXcan results will be output.	`results/zscores.csv`
`--overwrite`	Overwrite the resulting file if it already exists.
`--throw`	Option to throw exception on error. I.E. output a full stack trace in case of error.
`--remove_ens_version`	Option to remove ensemble id's version suffix from genes.

This script expects that the covariance file has information matching the genes and snps in the tissue transcriptome model. That is, if a tissue transcriptome model contains certain snps grouped under a gene, the covariance file should have a covariance matrix made of those same snps for that gene. Both parameters are independent, since you may have multiple covariances matching a transcriptome model, but calculated on different populations.

SPrediXcan.py

This script is a convenience wrapper around M03_betas.py and M04_zscores.py. It takes GWAS input data and outputs MetaXcan's association results. It has the same user interface as those scripts.

Note: z-score, p-value, effect size, effect size standard error are assumed to have no missing values (technically, every variant in the models should either have a value in the GWAS, or be absent).

Note: effect size and effect size standard errorcolumns from the GWAS are to preferred over p-value and zscore, when available.

MetaMany.py

This is another wrapper around M03_betas.py and M04_zscores.py. It will serially perform multiple MetaXcan runs on a GWAS study using multiple tissues in a single command.

Differences from standard MetaXcan parameter set

MetaMany was written to simplify the execution of multiple MetaXcan runs where the user is interested in looking at multiple tissues. In order to facilitate this in the most convenient manner, MetaMany's argument set is slightly different than those of the regular program:

--model_db_path doesn't exist in MetaMany

Unlike MetaXcan proper, the user points directly to one or more weight databases directly on the command line with no argument prefix. The program will iterate over each of these in serial fashion and perform the same analysis that would have been performed had the user done each individually using MetaXcan. The matching covariance file will be inferred from the file name, and you can customize the relationship between covariance file name and model file name.

--covariance is now --covariance_directory in MetaMany

Rather than specifying a single covariance file, users must specify a single directory and MetaMany will look for a matching covariance file inside that directory. To find a matching covariance file, MetaMany strips the tissue database filename of the ".db" extension and replaces it with ".cov.txt.gz". If such a file is not found, the program will halt and an error will be reported.

--output_file is now --output_directory in MetaMany

Results are written to the directory specified by --output_directory under the filename similar to the tissue database where the ".db" extension is replaced by ".csv".

MetaMany Important Note

This script is based on SPrediXcan.py, to run the analysis over multiple tissues in serial fashion. In order for this script to work, there is a major assumption about the file arrangement of the tissue databases and covariates:

Databases and covariances must be named identically except for extensions (as can be seen in the current version of the GTEX tissue databases). The script allows for separate directories for each of the two types of data, but they must be named identically up to a certain point. For instance, CrossTissue_elasticNet0_0.5.db has a corresponding covariance file named CrossTissue_elasticNet0_0.5.cov.txt.gz.

Example MetaMany Statement

The following command line would perform typical MetaXcan analysis on for the output in GWAS_Results for each of the tissues starting with TW_Brain_ inside the GTEx-2016-02-29 directory. It will look inside the directory GTEx-2016-02-29/covariances/0.5/ for appropriate covariance files for each of the databases (the default behavior is to have the covariances in the same folder, in such a case the covariance argument is not needed). Results would be written to the directory, results, with similar file names as each of the corresponding databases.

$ MetaMany.py  \
    --gwas_folder GWAS_Results/ \
    --beta_column beta \
    --pvalue_column p \
    --se_column se \
    --frequency_column maf \
    --snp_column markername \
    --effect_allele_column effect_allele \
    --non_effect_column other_allele\
    --covariance_directory GTEx-2016-02-29/covariances/0.5/ \
    --output_directory results \
    GTEx-2016-02-29/TW_Brain_*.DB

MulTiXcan.py

This script computes a gene-level association from predicted gene expression to a human trait, using multiple studies for each gene jointly. It supports adjusting for covariates. It inputs predicted expression files as generated by Predict.py (see PrediXcan manual)

Argument	Description	Example Value [default]
--hdf5_expression_folder	Folder with predicted gene expressions (files in HDF5 format). Format of the files explained below.	`data/ukb_expression`
--expression_folder	Folder with predicted gene expressions (plain text file format). Format of the files explained below.	`data/tgf_expression`
--memory_efficient	If using plain text expression files, be memory efficient about it. Will be slower. Will read the expression text file in batches, so that several passes at the file will be performed.
--expression_pattern	Patterns to select expression files in the folder. Format of the files explained below.	`pred_TW_Brain_(.*)_0.5_hrc_hapmap.h5`
--input_phenos_file	Path to text file (or gzip-compressed) where one column will be used as phenotype. Format explained below.	`phenotype/my_trait.txt.gz`
--covariates	Specify which covariates in the file should be used	`PC PC2 Platform EatsPizzaOrNot`
--covariates_file	Path to text file (or gzip-compressed) with covariate data. If provided, will force OLS regression. Format explained below.
--input_phenos_column	Specify name of column from input file to be used as phenotype
--output	Specify where where output will be saved
--verbosity	Log verbosity level. 1 is everything being logged. 10 is only high level messages, above 10 will hardly log anything. So, this is more of a -log parsimony- thing.	`10`
--throw	Option to throw exception on error. I.E. output a full stack trace in case of error.
--mode	Type of regression. Can be: `linear` or `logistic`	`linear`
--pc_condition_number	Principal components condition number	`30`
--pc_eigen_ratio	Principal components filter, cutoff at proportion to max eigenvalue (alternative to condition number)
--coefficient_output	Path to file where the tissues' coefficients (for each gene's regression) will be stored (optional)
--loadings_output	Path to file where the the loadings (for each gene's regression) will be stored (optional)

Example:

./MulTiXcan.py \
--hdf5_expression_folder data/gene_expression/ \
--expression_pattern "pred_TW_(.*)_0.5_hrc_hapmap.h5" \
--input_phenos_file data/variables.txt.gz \
--covariates_file data/variables.txt.gz \
--covariates S1 S2 \
--input_phenos_column height \
--output results/mt_predixcan_c_covariates_cn_c.txt \
--mode linear \
--verbosity 6 \
--pc_condition_number 30 \
--throw

The results look like:

gene: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for most gene model releases. Can also be a intron's id for splicing model releases.
pvalue: significance p-value of MultiXcan association
n_models: number of models (tissues) available for this gene
n_samples: number of individuals available to this gene-phenotype combination (k.e. inner join of phenotype and predictions)
p_i_best: best p-value of single-tissue S-PrediXcan association.
m_i_best: name of best single-tissue S-PrediXcan association.
p_i_worst: worst p-value of single-tissue S-PrediXcan association.
m_i_worst: name of worst single-tissue S-PrediXcan association.
status: If there was any error in the computation, it is stated here
n_used: number of independent components of variation kept among the tissues' predictions. (Synthetic independent tissues)
max_eigen: In the PCA decomposition of predicted expression, the maximum eigenvalue.
min_eigen: In the PCA decomposition of predicted expression, the minimum eigenvalue.
min_eigen_kept: In the PCA decomposition of predicted expression, the minimum eigenvalue kept (i.e. surviving SVD)

If you specify --loadings_output, you'll get a file specify the loadings of the PC decomposition of predicted expressions for each gene:

gene: Ensemble Id (or intron id) being analized
pc: identifier of principal component
tissue: tissue being analyzed
weight: coefficient of loading from tissues to PC

If you specify --coefficient_output, you get a file with effect sizes for the tissues involved in each gene:

param: effect size of the PCA-regularized regression. (i.e. effect sizes of the PC components, converted to tissue-space)
variable: tissue being analyzed
gene: ensemble ID (or intron id)

SMulTiXcan.py

This script takes S-PrediXcan's results and estimates MulTiXcan association. It also needs:

LD from a refence panel
either of the following options:
- a list of "cleared snps" to be used as the intersection of snps between the models and the GWAS,
- the original GWAS from which S-PrediXcan was computed, and its parsing parameters, and prediction models, so that the GWAS/Model intersection can be computed

Argument	Description	Example Value [default]
--models_folder	Path to folder with prediction models.	`data/gtex_v6p`
--models_name_filter	List of regular expressions to filter input models	`".Lung.db" ".Whole_Blood.db"`
--models_name_pattern	Regular expression to detect tissue name from model file names.	`TW_(.*)_0.5.db`
--gwas_folder	Name of folder containing GWAS data. All files in the folder are assumed to belong to a single study.	`data/my_gwas`
--gwas_file_pattern	Pattern to recognice GWAS files in folders (in case there are extra files and you don't want them selected).	`chr.*.assoc.txt.gz`
--gwas_file	Path to GWAS file; alternative to gwas_folder and gwas_pattern	`dat/my_gwas.txt.gz`
. . .	the same GWAS parsing parameters as in `SPrediXcan.py`	`--zscore_column z --effect_allele_column EA --non_effect_allele_column NEA`
--cleared_snps	SNPS to analyze. This is an alternative to providing the GWAS and parsing	`data/hapmap_ceu.txt.gz`
--regularization	Add a regularization term to the matrix diagonal, to correct for expression covariance matrix singularity.	`0.01`
--cutoff_condition_number	Condition number of eigen values to use when truncating SVD components .	`30`
--cutoff_eigen_ratio	Ratio of eigenvalues to the max eigenvalue, as threshold to use when truncating SVD components.	`0.001`
--cutoff_threshold	Threshold of variance eigenvalues when truncating SVD	`0.4`
--cutoff_trace_ratio	Ratio of eigenvalues to trace, to use when truncating SVD	`0.01`
--metaxcan_folder	Path to folder with S-PrediXcan files	`data/metaxcan_results`
--metaxcan_filter	Regular expression to filter results files	`.*csv`
--metaxcan_file_name_parse_pattern	Optional regular expression to get phenotype name and model name from MetaXcan result files. Assumes that a first group will be matched to phenotype name, and the second to model name.	`spredixcan_(.)_TW_(.)_0.5.csv`
--snp_covariance	Path to LD reference/snp covariance. Same format as S-PrediXcan covariances.	`data/gtex_v6p_snp_covariance.txt.gz`
--trimmed_ensemble_id	Use ensemble ids without version. Necessary if your S-PrediXcan results' gene ids lack the version.
--output	Path where output will be saved	`results/smultixcan.txt`
--verbosity	Log verbosity level. 1 is everything being logged. 10 is only high level messages, above 10 will hardly log anything. So, this is more of a -log parsimony- thing.	`10`
--throw	Option to throw exception on error. I.E. output a full stack trace in case of error.

Example:

./SMulTiXcan.py \
--models_folder data/models_v6p \
--models_name_pattern "gtex_v6p_(.*)_signif.db" \
--snp_covariance data/gtex_v6p_snp_covariance.txt.gz \
--metaxcan_folder results/sp_v6p \
--metaxcan_filter "spredixcan_gtexv6pqdir_ADIPOGen_Adiponectin__PM__(.*).csv" \
--metaxcan_file_name_parse_pattern "spredixcan_gtexv6pqdir_(.*)__PM__(.*).csv" \
--gwas_file data/SummaryResults/Production/ADIPOGen/Adipogen.txt \
--snp_column marker --non_effect_allele_column other_allele --effect_allele_column reference_allele --beta_column beta --pvalue_column pvalue --se_column se \
--cutoff_condition_number 30 \
--verbosity 7 \
--throw \
--output results/smt_v8qdir/smultixcan_gtexv8qdir_ADIPOGen_Adiponectin_ccn30.csv

The results look like:

gene    gene_name       pvalue  n       n_indep p_i_best        t_i_best        p_i_worst       t_i_worst       eigen_max       eigen_min       eigen_min_kept  z_min   z_max   z_mean  z_sd    tmi     status
ENSG00000175793 SFN     0.0450499793962 5       5       0.0156585962572 Brain_Caudate_basal_ganglia     0.898228858542  Skin_Not_Sun_Exposed_Suprapubic 1.6646390465    0.501908734482  0.501908734482  -2.4167772492   1.04549349016   -0.502751770349 1.24761302688   5.0     0
ENSG00000060642 PIGV    0.0590186246954 19      1       0.0295433994674 Lung    0.765611065903  Esophagus_Mucosa        14.0901681665   0.00296245803497        14.0901681665   0.298120670239  2.17615865235   1.61343826535   0.522964426062  1.0     0
...

where:

gene: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for most gene model releases. Can also be a intron's id for splicing model releases.
gene_name: gene name as listed by the Transcriptome Model, typically HUGO for a gene. It can also be an intron's id.
pvalue: significance p-value of S-MultiXcan association
n: number of "tissues" available for this gene
n_indep: number of independent components of variation kept among the tissues' predictions. (Synthetic independent tissues)
p_i_best: best p-value of single-tissue S-PrediXcan association.
t_i_best: name of best single-tissue S-PrediXcan association.
p_i_worst: worst p-value of single-tissue S-PrediXcan association.
t_i_worst: name of worst single-tissue S-PrediXcan association.
eigen_max: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the top independent component
eigen_min: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the last independent component
eigen_min_kept: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the smalles independent component that was kept.
z_min: minimum z-score among single-tissue S-Predican associations.
z_max: maximum z-score among single-tissue S-Predican associations.
z_mean: mean z-score among single-tissue S-Predican associations.
z_sd: standard deviation of the mean z-score among single-tissue S-Predican associations.
tmi: trace of T * T', where Tis correlation of predicted expression levels for different tissues multiplied by its SVD pseudo-inverse. It is an estimate for number of indepent components of variation in predicted expresison across tissues (typically close to n_indep)
status: If there was any error in the computation, it is stated here

Supported File Formats

IMPUTEv2

IMPUTE2 is supported at scripts that require allele dosage input data, such as M00_prerequisites.py and M01_covariances_correlations.py. This format specifies two files per chromosome: one for the marker information (such as its type) and another for the population dosage

PrediXcan dosage format

This is a format for a describing a population's allele dosages. Used at M00_prerequisites.py. It holds information in gzipped-compressed text files without a header, and each line holds the following information:

#chr number snp_id   #position #ref_allele #eff_allele #allele average #individual dosage data
chr1        rs940550 88169     C           T           0.157057654076  [ ... ]

Seventh column and onward hold individual's allele dosage. These files should be accompanied by a samples text file describing individual's information such as:

ID  POP GROUP SEX
123 GBR EUR   male
...

Individual-Level HDF5 Gene Expression File

Used for PrediXcan.py and MulTiXcan.py.

The file is an HDF5 file with three data sets:

pred_expr: a list where each entry is a list of predicted expression values for all individuals
genes : a list with the gene names, in the same order as appear in pred_expr
samples: a list with the individuals ids', n the same order as all entries of pred_expr

i.e in IPython they would look like:

In [17]: k["genes"][0:4]
Out[17]: 
array(['ENSG00000000457.9', 'ENSG00000000971.11', 'ENSG00000001036.9',
       'ENSG00000001167.10'], dtype='|S30')

In [18]: k["pred_expr"][0]
Out[18]: 
array([-0.01055908, -0.01055908, -0.19805908, ..., -0.13175908,
       -0.39325908, -0.01055908], dtype=float32)

In [19]: k["samples"][0:4]
Out[19]: array(['2476612', '5595764', '5172041', '3487211'], dtype='|S25')

Individual-Level Gene Expression Text File

Used for PrediXcan.py and MulTiXcan.py.

This is a tab-separated text file (optionally gzip-compressed) where each column stands for predicted gene expression of a given gene, and each row is an individual. The files look like:

ENSG00000000457.9 ENSG00000000460.12 ENSG00000001036.9 ENSG00000001084.6 ...
0.111434276       -0.369352366       0.11707494573     0.0013880712      ...
-0.437815712      -0.1411258693      -0.036093907      -0.193494845      ...
...

There are tools in the official PrediXcan repository for generating these files.

Individual level covariates and phenotype files

Used for PrediXcan.py and MulTiXcan.py. Both covariates and phenotypes files have the same format, and you can actually pass the same file with both types of variables.

These are text files where each column stands for a feature such as a trait, principal component, etc; and each row is an individual. -999 and NA are supported as encoding missing values.

a                   b               c
NA                  0.318533581926  1
-0.9077357424220001 -2.90862133768  1
0.219812718984      0.237473143796  -999
-0.292915728007     -0.231308887004 -999
-0.306529717231     0.435623400034  0
...

MetaXcan's input files

M04_zscores.csv takes gzipped-compressed flat column table files as input. At its most basic level, it needs two columns:

snp id
zscore of beta So that a minimum input file could look like:

rsid      beta_z
rs5746887 -0.91
rs5748664 -0.20
rs874836  -0.01
...

Covariance file

These files are gzip-compressed text files with the following format:

GENE RSID1 RSID2 VALUE
ENSG00000239789.1 rs12718973 rs12718973 0.156645782674
ENSG00000239789.1 rs12718973 rs13232099 0.156645782674
ENSG00000239789.1 rs13232099 rs13232099 0.156645782674
ENSG00000183742.8 rs3094989 rs3094989 0.22
...

For each gene, all entries in the upper triangular part of the covariance matrix are saved in each line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly