-
Notifications
You must be signed in to change notification settings - Fork 92
Command Line Reference
This repository contains several scripts for running MetaXcan's computations, and building support data. Most scripts are meant to be run from the command line, and expect command line arguments to control them.
MetaXcan supports a wild and vast of input data. It can work with individual level data (PrediXcan.py) or GWAS summary statistics (SPrediXcan.py through a precomputed compilation of reference LD).
If you are using precomputed GWAS statistics from Im's lab (such as the ones from the tutorial), you will only need to use SPrediXcan.py or the underlying implementations M03_betas.py and M04_zscores.py. If you have access to individual's dosage data in the supported formats, you can build support data or LD references more appropriate for your application.
script | functionality |
---|---|
PrediXcan.py | The original, individual-level data method to compute gene-trait associations. |
M00_prerequisites.py | Filter down individual dosage data, and convert to supported 'PrediXcan' format. Deprecated. |
M01_covariances_correlations.py | Build SNP (related by transcriptome model) data covariance matrices. Correlation matrices optional. Deprecated. |
M02_variances.py | Build SNP dosage variance. Optional, and unused by current MetaXcan pipeline. Deprecated. |
M03_betas.py | Take GWAS summary statistics in different formats, filter them to match precomputed statistics, and output GWAS' betas in a specific format supported by MetaXcan |
M04_zscores.py | Compute MetaXcan from betas (and support statistics) |
SPrediXcan.py | Convenience wrapper for M03_betas.py and M04_zscores.py |
MetaMany.py | Convenience wrapper for M03_betas.py and M04_zscores.py, that runs MetaXcan on a GWAS using different models and covariances sequentially. |
PrediXcan.py | Minimalistic implementation of PrediXcan that takes predicted gene expression as input. |
MulTiXcan.py | Multi-Tissue PrediXcan, takes multiple gene expression files as input. |
SMulTiXcan.py | Summary-stats based Multi-Tissue PrediXcan. |
There are some other undocumented scripts, but their implementation is only too volatile and should be considered private. Feel free to use them anyways if they turn out to be helpful.
Also, you might notice that most scripts support undocumented options. Same caveats apply.
Explained separately in this page.
PrediXcan.py is a thin wrapper over Predict.py and PrediXcanAssociation.py. The output of Predict.py can be used for running MulTiXcan.py.
This is no longer maintained and remains only for future reference.
This tool filters dosage data according to population and snp criteria, so that posterior steps run on less data, thus faster. They can read/write to/from files in two formats:
- IMPUTE2
- PrediXcan format, see below
PrediXcan format is preferred by most tools in this repository, although some support IMPUTE.
Suppported arguments are:
Argument | Description |
Example Value [default] |
---|---|---|
--verbosity |
Controls logging verbosity. The lower value, the more logging you will get. In general, 5 is logging at individual SNP data level, and 10 is usually critical/status logic information. So, this is more of a -log parsimony- thing. |
5 |
--dosage_folder |
Folder where sample data is to be read from. | data/dosage |
--snp_list |
GZipped text file containing whitelisted SNP's rsids, one per row. Every SNP not in this file will be excluded. | data/whitelist.gz |
--output_folder |
Folder were filtered data will be output | samples |
--file_pattern |
Optional. Regular expression to identify dosage files in input folder, separating chromosome number in file name. Only used when (IMPUTE input format, PrediXcan output format). | 1000GP_Phase3_(.*) |
--population_group_filters |
Whitelisted values for group column value in sample file. | EUR EAS |
--individual_filters |
Optional. Regular expression for filtering individual's IDs | HG* NA2* |
--input_format |
Format that describes input data. Possible values are IMPUTE or PrediXcan. |
IMPUTE [ PrediXcan ] |
--output_format |
Format that will be used for data output. Possible values are IMPUTE or PrediXcan. |
PrediXcan [ IMPUTE ] |
This script takes extensive genotype data sets such as Thousand Genomes Haplotypes (here) allows for some filtering based on population or ID, and selects those entries present in hapmap2 biallelic SNP's.
It supports input and output into a custom format that we call internally "PrediXcan Format". It is composed of gzip compressed textfiles, without headers, where each row contains:
chromosome SNP_id SNP_position reference_allele effect_allele allele_frequency ...
# "..." stands for allele dosage data, one value per sample individual, value in [0, 2]
It is assumed that a "samples file" is provided, describing each sample individual's metadata, which is a plain text file that looks like:
ID POP GROUP SEX
HG00096 GBR EUR male
HG00097 GBR EUR female
HG00099 GBR EUR female
HG00100 GBR EUR female
HG00101 GBR EUR male
HG00102 GBR EUR female
...
This script is hardly ever necessary. You might use it occasionally to reduce file sizes used at the covariance script step (M01_covariances_correlations.py).
This is no longer maintained and remains only for future reference.
This script takes dosages in PrediXcan (see below) or IMPUTE2 formats, and figures out the covariance and/or correlation matrices between snp data, grouped according to transcriptome model. Output is a gzipped text file with each line holding a pairwise covariance/correlation value between two SNP's; they are labelled by gene to group them according to the transcriptome model.
Some of this script's interface should be treated as 'private' because of implementation volatility. The following table shows the public interface:
Argument | Description |
Example Value [default] |
---|---|---|
--verbosity |
Controls logging verbosity. The lower value, the more logging you will get. In general, 5 is logging at individual SNP data level, and 10 is usually critical/status logic information. So, this is more of a -log parsimony- thing. |
5 |
--weight_db |
Path to the sqlite file holding transcriptome model information | data/DGN-WB_0.5.db |
--input_folder |
Path to folder containing SNP dosage data | temp/filtered |
--correlation_output |
Path were correlation matrix will be output. If not provided, will not output correlations. |
temp/cor/my_cor.txt.gz [None] |
--covariance_output |
Path were covariances matrix will be written to. Defaults to a name based on the transcriptome model name. | temp/cov/my_cov.txt.gz |
--input_format |
Format of input dosage data. Possible values are IMPUTE or PrediXcan. |
IMPUTE [PrediXcan]
|
--min_maf_filter |
Optional. Ignore SNP's with frequencies below this value |
0.05 [None] |
--max_maf_filter |
Optional. Ignore SNP's with frequencies above this value |
0.95 [None] |
This script builds the covariance matrices needed at M04_zscores.py. You will run it once in a while, if ever. It takes input from M00_prerequisites.py's output, and a genetic expression model database, such as in the example data.
It will build the correlation matrix between SNP's in a same gene's model, for each gene, and save them in a gzip-compressed text file.
Nothing to see here, move along.
Just kidding. This script was deprecated. It calculates SNP variances grouped by transcriptome model; so, it is sort of related to M01_covariances_correlations.py. Options are where to load dosage data from, which transcriptome model to use, and where to output the variances.
This script exists as an adapter to different GWAS file formats. It can handle text files (plain text or gzip compressed) with tables of SNP GWAS data. Most of the options are concerned with identifying which column holds which data.
Once columns are specified, it will read the files and:
- Exclude any non-SNP data
- Exclude any SNP's not in the transcriptome
- Flip beta or Odd Ratio value if reference allele is the converse of what's identified by the transcriptome model
It will try to use the following (in that order) if available from the command line arguments and input GWAS file:
- use a z-score column if available from the arguments and input file;
- use a p-value column and either effect, odd ratio or direction column;
- use effect size (or odd ratio) and standard error columns if available.
The outputs are gzipped text files with the fixed format read by MetaXcan. See section below.
Note: z-score, p-value, effect size, effect size standard error are assumed to have no missing values (technically, every variant in the models should either have a value in the GWAS, or be absent)
Note: effect size and effect size standard errorcolumns from the GWAS are to preferred over p-value and zscore, when available.
SNP, Effect Allele, Other/Reference Allele are considered mandatory.
Argument | Description |
Example Value [default] |
---|---|---|
--verbosity |
Controls logging verbosity. The lower value, the more logging you will get. In general, 5 is logging at individual SNP data level, and 10 is usually critical/status logic information. So, this is more of a -log parsimony- thing. |
5 |
--model_db_path |
Path to the sqlite file holding transcriptome model information | data/DGN-WB_0.5.db |
--output_folder |
Path were the processed files will be output. | temp/beta_f |
--gwas_folder |
Path to the folder containing GWAS data | data/GWAS |
--gwas_file_pattern |
Optional. Regular expression used to filter file names in the GWAS folder (for example, if log files are present). | ".*gz" |
--gwas_file |
Alternative to gwas folder and file pattern, to specify path to a single GWAS file. | data/GWAS/mygwas.txt.gz |
--separator |
Optional. Specify which character separates column data. If any whitespace is the actual separator, don't specify it. |
, [Any whitespace] |
--snp_column |
Name of column holding SNP data. |
SNP_ID [ SNP ] |
--non_effect_allele_column |
Name of column holding "other/non effect" allele data. |
REFERENCE [ A2 ] |
--effect_allele_column |
Name of column holding effect allele data. |
EFFECT [ A1 ] |
--or_column |
Name of column holding Odd Ratio data. | OR |
--beta_column |
Name of column holding beta (effect size) data. | BETA |
--beta_sign_column |
Name of column holding sign of beta. | direction |
--zscore_column |
Name of column holding zscore of beta. | Z |
--pvalue_column |
Name of column holding p-values data. | P |
--throw |
Option to throw exception on error. I.E. output a full stack trace in case of error. |
This script holds the actual MetaXcan computation. It expects output as built from M03_betas.py (see section below on formats).
Argument | Description |
Example Value [default] |
---|---|---|
--verbosity |
Controls logging verbosity. The lower value, the more logging you will get. In general, 5 is logging at individual SNP data level, and 10 is usually critical/status logic information. So, this is more of a -log parsimony- thing. |
5 |
--model_db_path |
Path to the sqlite file holding transcriptome model information | data/DGN-WB_0.5.db |
--covariance |
Path to file containing the covariance matrices for the SNP dosage and transcriptome. Expects output as built from M01_covariances_correlations.py | temp/cov_eur/dgn.cov.txt.gz |
--beta_folder |
Path to folder containing betas. Expects files as built by M03_betas.py | temp/beta_f |
--output_file |
Path were the MetaXcan results will be output. | results/zscores.csv |
--overwrite |
Overwrite the resulting file if it already exists. | |
--throw |
Option to throw exception on error. I.E. output a full stack trace in case of error. | |
--remove_ens_version |
Option to remove ensemble id's version suffix from genes. |
This script expects that the covariance file has information matching the genes and snps in the tissue transcriptome model. That is, if a tissue transcriptome model contains certain snps grouped under a gene, the covariance file should have a covariance matrix made of those same snps for that gene. Both parameters are independent, since you may have multiple covariances matching a transcriptome model, but calculated on different populations.
This script is a convenience wrapper around M03_betas.py and M04_zscores.py. It takes GWAS input data and outputs MetaXcan's association results. It has the same user interface as those scripts.
Note: z-score, p-value, effect size, effect size standard error are assumed to have no missing values (technically, every variant in the models should either have a value in the GWAS, or be absent).
Note: effect size and effect size standard errorcolumns from the GWAS are to preferred over p-value and zscore, when available.
This is another wrapper around M03_betas.py and M04_zscores.py. It will serially perform multiple MetaXcan runs on a GWAS study using multiple tissues in a single command.
MetaMany was written to simplify the execution of multiple MetaXcan runs where the user is interested in looking at multiple tissues. In order to facilitate this in the most convenient manner, MetaMany's argument set is slightly different than those of the regular program:
Unlike MetaXcan proper, the user points directly to one or more weight databases directly on the command line with no argument prefix. The program will iterate over each of these in serial fashion and perform the same analysis that would have been performed had the user done each individually using MetaXcan. The matching covariance file will be inferred from the file name, and you can customize the relationship between covariance file name and model file name.
Rather than specifying a single covariance file, users must specify a single directory and MetaMany will look for a matching covariance file inside that directory. To find a matching covariance file, MetaMany strips the tissue database filename of the ".db" extension and replaces it with ".cov.txt.gz". If such a file is not found, the program will halt and an error will be reported.
Results are written to the directory specified by --output_directory under the filename similar to the tissue database where the ".db" extension is replaced by ".csv".
This script is based on SPrediXcan.py, to run the analysis over multiple tissues in serial fashion. In order for this script to work, there is a major assumption about the file arrangement of the tissue databases and covariates:
Databases and covariances must be named identically except for extensions (as can be seen in the current version of the GTEX tissue databases). The script allows for separate directories for each of the two types of data, but they must be named identically up to a certain point. For instance, CrossTissue_elasticNet0_0.5.db has a corresponding covariance file named CrossTissue_elasticNet0_0.5.cov.txt.gz.
The following command line would perform typical MetaXcan analysis on for the output in GWAS_Results
for each of the tissues starting with TW_Brain_ inside the GTEx-2016-02-29
directory. It will look
inside the directory GTEx-2016-02-29/covariances/0.5/
for appropriate covariance files for each
of the databases (the default behavior is to have the covariances in the same folder, in such a case the covariance argument is not needed). Results would be written to the directory, results, with similar file names as each
of the corresponding databases.
$ MetaMany.py \
--gwas_folder GWAS_Results/ \
--beta_column beta \
--pvalue_column p \
--se_column se \
--frequency_column maf \
--snp_column markername \
--effect_allele_column effect_allele \
--non_effect_column other_allele\
--covariance_directory GTEx-2016-02-29/covariances/0.5/ \
--output_directory results \
GTEx-2016-02-29/TW_Brain_*.DB
This script computes a gene-level association from predicted gene expression to a human trait, using multiple studies for each gene jointly. It supports adjusting for covariates. It inputs predicted expression files as generated by Predict.py (see PrediXcan manual)
Argument | Description |
Example Value [default] |
---|---|---|
--hdf5_expression_folder | Folder with predicted gene expressions (files in HDF5 format). Format of the files explained below. | data/ukb_expression |
--expression_folder | Folder with predicted gene expressions (plain text file format). Format of the files explained below. | data/tgf_expression |
--memory_efficient | If using plain text expression files, be memory efficient about it. Will be slower. Will read the expression text file in batches, so that several passes at the file will be performed. | |
--expression_pattern | Patterns to select expression files in the folder. Format of the files explained below. | pred_TW_Brain_(.*)_0.5_hrc_hapmap.h5 |
--input_phenos_file | Path to text file (or gzip-compressed) where one column will be used as phenotype. Format explained below. | phenotype/my_trait.txt.gz |
--covariates | Specify which covariates in the file should be used | PC PC2 Platform EatsPizzaOrNot |
--covariates_file | Path to text file (or gzip-compressed) with covariate data. If provided, will force OLS regression. Format explained below. | |
--input_phenos_column | Specify name of column from input file to be used as phenotype | |
--output | Specify where where output will be saved | |
--verbosity | Log verbosity level. 1 is everything being logged. 10 is only high level messages, above 10 will hardly log anything. So, this is more of a -log parsimony- thing. | 10 |
--throw | Option to throw exception on error. I.E. output a full stack trace in case of error. | |
--mode | Type of regression. Can be: linear or logistic
|
linear |
--pc_condition_number | Principal components condition number | 30 |
--pc_eigen_ratio | Principal components filter, cutoff at proportion to max eigenvalue (alternative to condition number) | |
--coefficient_output | Path to file where the tissues' coefficients (for each gene's regression) will be stored (optional) | |
--loadings_output | Path to file where the the loadings (for each gene's regression) will be stored (optional) |
Example:
./MulTiXcan.py \
--hdf5_expression_folder data/gene_expression/ \
--expression_pattern "pred_TW_(.*)_0.5_hrc_hapmap.h5" \
--input_phenos_file data/variables.txt.gz \
--covariates_file data/variables.txt.gz \
--covariates S1 S2 \
--input_phenos_column height \
--output results/mt_predixcan_c_covariates_cn_c.txt \
--mode linear \
--verbosity 6 \
--pc_condition_number 30 \
--throw
The results look like:
-
gene
: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for most gene model releases. Can also be a intron's id for splicing model releases. -
pvalue
: significance p-value of MultiXcan association - n_models: number of models (tissues) available for this gene
- n_samples: number of individuals available to this gene-phenotype combination (k.e. inner join of phenotype and predictions)
-
p_i_best
: best p-value of single-tissue S-PrediXcan association. -
m_i_best
: name of best single-tissue S-PrediXcan association. -
p_i_worst
: worst p-value of single-tissue S-PrediXcan association. -
m_i_worst
: name of worst single-tissue S-PrediXcan association. -
status
: If there was any error in the computation, it is stated here -
n_used
: number of independent components of variation kept among the tissues' predictions. (Synthetic independent tissues) -
max_eigen
: In the PCA decomposition of predicted expression, the maximum eigenvalue. -
min_eigen
: In the PCA decomposition of predicted expression, the minimum eigenvalue. -
min_eigen_kept
: In the PCA decomposition of predicted expression, the minimum eigenvalue kept (i.e. surviving SVD)
If you specify --loadings_output
, you'll get a file specify the loadings of the PC decomposition of predicted expressions for each gene:
-
gene
: Ensemble Id (or intron id) being analized -
pc
: identifier of principal component -
tissue
: tissue being analyzed -
weight
: coefficient of loading from tissues to PC
If you specify --coefficient_output
, you get a file with effect sizes for the tissues involved in each gene:
-
param
: effect size of the PCA-regularized regression. (i.e. effect sizes of the PC components, converted to tissue-space) -
variable
: tissue being analyzed -
gene
: ensemble ID (or intron id)
This script takes S-PrediXcan's results and estimates MulTiXcan association. It also needs:
- LD from a refence panel
- either of the following options:
- a list of "cleared snps" to be used as the intersection of snps between the models and the GWAS,
- the original GWAS from which S-PrediXcan was computed, and its parsing parameters, and prediction models, so that the GWAS/Model intersection can be computed
Argument | Description |
Example Value [default] |
---|---|---|
--models_folder | Path to folder with prediction models. | data/gtex_v6p |
--models_name_filter | List of regular expressions to filter input models | ".*Lung.*db" ".*Whole_Blood.*db" |
--models_name_pattern | Regular expression to detect tissue name from model file names. | TW_(.*)_0.5.db |
--gwas_folder | Name of folder containing GWAS data. All files in the folder are assumed to belong to a single study. | data/my_gwas |
--gwas_file_pattern | Pattern to recognice GWAS files in folders (in case there are extra files and you don't want them selected). | chr.*.assoc.txt.gz |
--gwas_file | Path to GWAS file; alternative to gwas_folder and gwas_pattern | dat/my_gwas.txt.gz |
. . . | the same GWAS parsing parameters as in SPrediXcan.py
|
--zscore_column z --effect_allele_column EA --non_effect_allele_column NEA |
--cleared_snps | SNPS to analyze. This is an alternative to providing the GWAS and parsing | data/hapmap_ceu.txt.gz |
--regularization | Add a regularization term to the matrix diagonal, to correct for expression covariance matrix singularity. | 0.01 |
--cutoff_condition_number | Condition number of eigen values to use when truncating SVD components . | 30 |
--cutoff_eigen_ratio | Ratio of eigenvalues to the max eigenvalue, as threshold to use when truncating SVD components. | 0.001 |
--cutoff_threshold | Threshold of variance eigenvalues when truncating SVD | 0.4 |
--cutoff_trace_ratio | Ratio of eigenvalues to trace, to use when truncating SVD | 0.01 |
--metaxcan_folder | Path to folder with S-PrediXcan files | data/metaxcan_results |
--metaxcan_filter | Regular expression to filter results files | .*csv |
--metaxcan_file_name_parse_pattern | Optional regular expression to get phenotype name and model name from MetaXcan result files. Assumes that a first group will be matched to phenotype name, and the second to model name. | spredixcan_(.*)_TW_(.*)_0.5.csv |
--snp_covariance | Path to LD reference/snp covariance. Same format as S-PrediXcan covariances. | data/gtex_v6p_snp_covariance.txt.gz |
--trimmed_ensemble_id | Use ensemble ids without version. Necessary if your S-PrediXcan results' gene ids lack the version. | |
--output | Path where output will be saved | results/smultixcan.txt |
--verbosity | Log verbosity level. 1 is everything being logged. 10 is only high level messages, above 10 will hardly log anything. So, this is more of a -log parsimony- thing. | 10 |
--throw | Option to throw exception on error. I.E. output a full stack trace in case of error. |
Example:
./SMulTiXcan.py \
--models_folder data/models_v6p \
--models_name_pattern "gtex_v6p_(.*)_signif.db" \
--snp_covariance data/gtex_v6p_snp_covariance.txt.gz \
--metaxcan_folder results/sp_v6p \
--metaxcan_filter "spredixcan_gtexv6pqdir_ADIPOGen_Adiponectin__PM__(.*).csv" \
--metaxcan_file_name_parse_pattern "spredixcan_gtexv6pqdir_(.*)__PM__(.*).csv" \
--gwas_file data/SummaryResults/Production/ADIPOGen/Adipogen.txt \
--snp_column marker --non_effect_allele_column other_allele --effect_allele_column reference_allele --beta_column beta --pvalue_column pvalue --se_column se \
--cutoff_condition_number 30 \
--verbosity 7 \
--throw \
--output results/smt_v8qdir/smultixcan_gtexv8qdir_ADIPOGen_Adiponectin_ccn30.csv
The results look like:
gene gene_name pvalue n n_indep p_i_best t_i_best p_i_worst t_i_worst eigen_max eigen_min eigen_min_kept z_min z_max z_mean z_sd tmi status
ENSG00000175793 SFN 0.0450499793962 5 5 0.0156585962572 Brain_Caudate_basal_ganglia 0.898228858542 Skin_Not_Sun_Exposed_Suprapubic 1.6646390465 0.501908734482 0.501908734482 -2.4167772492 1.04549349016 -0.502751770349 1.24761302688 5.0 0
ENSG00000060642 PIGV 0.0590186246954 19 1 0.0295433994674 Lung 0.765611065903 Esophagus_Mucosa 14.0901681665 0.00296245803497 14.0901681665 0.298120670239 2.17615865235 1.61343826535 0.522964426062 1.0 0
...
where:
-
gene
: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for most gene model releases. Can also be a intron's id for splicing model releases. -
gene_name
: gene name as listed by the Transcriptome Model, typically HUGO for a gene. It can also be an intron's id. -
pvalue
: significance p-value of S-MultiXcan association -
n
: number of "tissues" available for this gene -
n_indep
: number of independent components of variation kept among the tissues' predictions. (Synthetic independent tissues) -
p_i_best
: best p-value of single-tissue S-PrediXcan association. -
t_i_best
: name of best single-tissue S-PrediXcan association. -
p_i_worst
: worst p-value of single-tissue S-PrediXcan association. -
t_i_worst
: name of worst single-tissue S-PrediXcan association. -
eigen_max
: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the top independent component -
eigen_min
: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the last independent component -
eigen_min_kept
: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the smalles independent component that was kept. -
z_min
: minimum z-score among single-tissue S-Predican associations. -
z_max
: maximum z-score among single-tissue S-Predican associations. -
z_mean
: mean z-score among single-tissue S-Predican associations. -
z_sd
: standard deviation of the mean z-score among single-tissue S-Predican associations. -
tmi
: trace ofT * T'
, whereT
is correlation of predicted expression levels for different tissues multiplied by its SVD pseudo-inverse. It is an estimate for number of indepent components of variation in predicted expresison across tissues (typically close ton_indep
) -
status
: If there was any error in the computation, it is stated here
IMPUTE2 is supported at scripts that require allele dosage input data, such as M00_prerequisites.py and M01_covariances_correlations.py. This format specifies two files per chromosome: one for the marker information (such as its type) and another for the population dosage
This is a format for a describing a population's allele dosages. Used at M00_prerequisites.py. It holds information in gzipped-compressed text files without a header, and each line holds the following information:
#chr number snp_id #position #ref_allele #eff_allele #allele average #individual dosage data
chr1 rs940550 88169 C T 0.157057654076 [ ... ]
Seventh column and onward hold individual's allele dosage. These files should be accompanied by a samples text file describing individual's information such as:
ID POP GROUP SEX
123 GBR EUR male
...
Used for PrediXcan.py and MulTiXcan.py.
The file is an HDF5 file with three data sets:
-
pred_expr
: a list where each entry is a list of predicted expression values for all individuals -
genes
: a list with the gene names, in the same order as appear inpred_expr
-
samples
: a list with the individuals ids', n the same order as all entries ofpred_expr
i.e in IPython they would look like:
In [17]: k["genes"][0:4]
Out[17]:
array(['ENSG00000000457.9', 'ENSG00000000971.11', 'ENSG00000001036.9',
'ENSG00000001167.10'], dtype='|S30')
In [18]: k["pred_expr"][0]
Out[18]:
array([-0.01055908, -0.01055908, -0.19805908, ..., -0.13175908,
-0.39325908, -0.01055908], dtype=float32)
In [19]: k["samples"][0:4]
Out[19]: array(['2476612', '5595764', '5172041', '3487211'], dtype='|S25')
Used for PrediXcan.py and MulTiXcan.py.
This is a tab-separated text file (optionally gzip-compressed) where each column stands for predicted gene expression of a given gene, and each row is an individual. The files look like:
ENSG00000000457.9 ENSG00000000460.12 ENSG00000001036.9 ENSG00000001084.6 ...
0.111434276 -0.369352366 0.11707494573 0.0013880712 ...
-0.437815712 -0.1411258693 -0.036093907 -0.193494845 ...
...
There are tools in the official PrediXcan repository for generating these files.
Used for PrediXcan.py and MulTiXcan.py. Both covariates and phenotypes files have the same format, and you can actually pass the same file with both types of variables.
These are text files where each column stands for a feature such as a trait, principal component, etc; and each row is an individual. -999
and NA
are supported as encoding missing values.
a b c
NA 0.318533581926 1
-0.9077357424220001 -2.90862133768 1
0.219812718984 0.237473143796 -999
-0.292915728007 -0.231308887004 -999
-0.306529717231 0.435623400034 0
...
M04_zscores.csv takes gzipped-compressed flat column table files as input. At its most basic level, it needs two columns:
- snp id
- zscore of beta So that a minimum input file could look like:
rsid beta_z
rs5746887 -0.91
rs5748664 -0.20
rs874836 -0.01
...
These files are gzip-compressed text files with the following format:
GENE RSID1 RSID2 VALUE
ENSG00000239789.1 rs12718973 rs12718973 0.156645782674
ENSG00000239789.1 rs12718973 rs13232099 0.156645782674
ENSG00000239789.1 rs13232099 rs13232099 0.156645782674
ENSG00000183742.8 rs3094989 rs3094989 0.22
...
For each gene, all entries in the upper triangular part of the covariance matrix are saved in each line.