GEPSi: GWAS Epistatic Phenotype Simulator

GEPSi is a toolkit to simulate phenotypes for GWAS analysis, given input genotype data for a population.

Installation

System requirements

Python 3.6+

Build from Source

1. Clone repository

Latest released version

This will clone the repo to the main branch, which contains code for latest released version and hot-fixes.

git clone --recursive -b master https://github.com/clara-genomics/GEPSi.git

2. Install dependencies

Install Package and its associated dependencies from requirements.txt

pip install .

3. Tests

Run unit tests to verify that installation was successful

```
python -m pytest tests/
```

Workflow

1. Formatting genotype data

Genotype data should be supplied in a .raw format along with a .bim snplist file. GEPS gives us the ability to format the genotype data matrix and associated annotations into an annotated csv file.

    gepsi genotype -data_path /GWAS/data/chr21/ --matrix_name genotype.raw --snplist_name full_snplist.bim

Results in the creation of a .h5 file containing a Person X SNP matrix with Genotype Values of 0,1,2 and and annotated snplist .csv that is needed to run the phenotype simulation. The snplist has columns for Chromosome, Feature ID, Position, Allele 1, Allele 2, and Risk Allele.

The .raw and .bim files can be produced from other formats using PLINK. PLINK can also be used to filter SNPs within selected regions (exons, transcripts, or genes) as well as filter SNPs based on their allele frequencies.

For example, we used the following PLINK v1.9 command to filter and format genotype data for human chromosome 21:

/plink \
  --gen gensim_chr21_100k.controls.gen.gz \
  --sample gensim_chr21_100k.sample \
  --maf 0.01 \
  --extract range <BED file containing exon positions for chr21> \
  --allow-no-sex \
  --snps-only \
  --recode A \
  --oxford-single-chr 21 \
  --out genotype
  
/plink \
  --gen gensim_chr21_100k.controls.gen.gz \
  --sample gensim_chr21_100k.sample \
  --maf 0.01 \
  --extract range <BED file containing exon positions for chr21> \
  --allow-no-sex \
  --snps-only \
  --oxford-single-chr 21 \
  --make-just-bim \
  --out full_snplist

Resulting in the creation of

/GWAS/data/genotype.raw: a Person X SNP Genotype Matrix
/GWAS/data/full_snplist.bim: Meta data for each SNP

2. Generating Phenotypes

Create Phenotypes for generated phenotypes using default values.

gepsi phenotype --data_path /GWAS/data/chr21/ --data_identifier chr21_100k --prefilter exon --phenotype_experiment_name example_name

Results in the creation of

/DLGWAS/data/chr21/phenotype_chr21_100k_exon_example_name.pkl
/DLGWAS/data/chr21/effect_size_chr21_100k_exon_example_name.pkl
/DLGWAS/data/chr21/interactive_snps_chr21_100k_exon_example_name.pkl
/DLGWAS/data/chr21/causal_snp_idx_chr21_100k_exon_example_name.pkl
/DLGWAS/data/chr21/causal_genes_chr21_100k_exon_example_name.pkl

phenotype_chr21_100k_exon_example_name.pkl: a list of binary phenotypes for each person defined by the Genotype Matrix
effect_size_chr21_100k_exon_example_name.pkl: a dictionary with key SNP index and value a list of the genotype indexed effect sizes
interactive_snps_chr21_100k_exon_example_name.pkl: a dictionary that maps causal snp indices to a list of length 3 [Interactive SNP Index Pair, Interaction Coefficient, Partner Risk Allele]
causal_snp_idx_chr21_100k_exon_example_name.pkl: a dictionary mapping SNP ID to its mapped Gene Risk
causal_genes_chr21_100k_exon_example_name.pkl: a dictionary mapping the causal Gene Feature IDs to Gene Risk Scores

Histograms of the sampling distributions are created and saved for every major statistical product.

Parameter Documentation

Genotype Parameters	Default Value	Definition
-h --help	None	List all parameters
-dp --data_path	/GWAS/data/	path to 1000 GP Data
-data --data_identifier	chr1_100k	genotype file name identifier
-ant --annotation_name	gencode.v19.annotation.gtf	Name of Annotations file for gene/exon mapping
-f --features	["gene", "transcript", "exon"]	List of features for filtering
-rr --risk_rare	False	Use the rare allele as the risk allele
-sep --separator	\t	Genetic file separator
-ign_map --ignore_gene_map	False	Skip Gene Mapping
-low_mem --memory_cautious	False	Use batched reading of Matrix raw file
-chunk --matrix_chunk_size	1000	Chunk size for low memory matrix read
-mtx --matrix_name	genotype.raw	Genotype Matrix (0,1,2)
-snplist --snplist_name	genotype.snplist	SNP meta data

Phenotype Parameters	Default Value	Definition
-h --help	None	List all parameters
-dp --data_path	/GWAS/data/	path to data
-hd --heritability	1	Heritability of phenotype
-data --data_identifier	chr1_100k	genotype file name identifier
-pname --phenotype_experiment_name	""	Name of phenotype simulation
-cut --interactive_cut	0.2	Fraction of causal SNPs to experience epistatic effects
-mask --mask_rate	0.1	Fraction of inter-SNP interactions that are masking
-df --dominance_frac	0.1	Fraction of causal SNPs whose effects are dominant
-rf --recessive_frac	0.1	Fraction of causal SNPs whose effects are recessive
-mic --max_interaction_coeff	2	Upper bound for Interaction Coefficient between two SNPs
-st --stratify	False	Stratify individuals in the population based on given groups
-cf --case_frac	0.5	Fraction of individuals to be classified as cases. Set to 0 to output raw phenotype scores instead of case/control.
--causal_snp_mode	"gene"	Method to select causal SNPs {gene, random}
-num_snps --n_causal_snps	100	Number of Causal SNPs required for random mode
-cgc --causal_gene_cut	0.05	Fraction of Causal Genes required for gene mode
-mgr --max_gene_risk	5	Upper bound for Gene Risk Coefficient required for gene mode

If --stratify is used, two additional files must be provided in --data_path. These are groups_{data_identifier}.csv and group_coefficients_{data_identifier}.csv. groups_{data_identifier}.csv should contain a group ID for each individual in the population, one per line, in the same order as individuals in the genotype matrix. group_coefficients_{data_identifier}.csv should be a comma-separated file with two columns, the first column listing the unique group IDs in groups_{data_identifier}.csv and the second giving a numeric coefficient to be added to the genetic risk score for all individuals with the given group ID.

Results

TODO Overview of paper and LINK

Simulation Playground

Exploratory Notebook details the custom genotype data creation process for phenotype simulation.

Utilizing randomly generated SNPs, the notebook walks through how to form custom genotype datasets for phenotype simulation. Generated outputs are stored in the Chromosome 0 directory and are used to test the validity of the package.

The command below can be run inside the GEPS directory to create sample data for testing purposes.

gepsi phenotype -dp ./sample_data/ --data_identifier chr0_test --phenotype_experiment_name playground_example

External contributions

To contribute to GEPSi, please see NVIDIA_CLA_v1.0.1.docx.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
configs		configs
sample_data		sample_data
scripts		scripts
tests		tests
Example_Simulation_Playground.ipynb		Example_Simulation_Playground.ipynb
LICENSE.md		LICENSE.md
NVIDIA_CLA_v1.0.1.docx		NVIDIA_CLA_v1.0.1.docx
README.md		README.md
__init__.py		__init__.py
accompanying_licenses.txt		accompanying_licenses.txt
exon_annotation.png		exon_annotation.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEPSi: GWAS Epistatic Phenotype Simulator

Installation

System requirements

Build from Source

1. Clone repository

Latest released version

2. Install dependencies

3. Tests

Workflow

1. Formatting genotype data

2. Generating Phenotypes

Parameter Documentation

Results

Simulation Playground

External contributions

About

Releases

Packages

Languages

License

torbjorgen/GEPSi

Folders and files

Latest commit

History

Repository files navigation

GEPSi: GWAS Epistatic Phenotype Simulator

Installation

System requirements

Build from Source

1. Clone repository

Latest released version

2. Install dependencies

3. Tests

Workflow

1. Formatting genotype data

2. Generating Phenotypes

Parameter Documentation

Results

Simulation Playground

External contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages