STOAT is a specialized tool developed for conducting Genome-Wide Association Studies (GWAS) with a unique focus on snarl structures within pangenome graphs. Unlike traditional GWAS tools that analyze linear genome variants, STOAT processes VCF files to extract and analyze snarl regions—complex structural variations that capture nested and overlapping variant patterns within a pangenome. This approach allows for a more nuanced understanding of genetic variations in diverse populations and complex traits.
STOAT supports both binary and quantitative phenotypes:
-
For binary phenotypes (e.g., case vs control studies), it utilizes chi-squared tests and Fisher’s exact test to evaluate associations between phenotype groups and snarl variants, providing robust statistical validation even in cases of sparse data.
-
For quantitative phenotypes (e.g., traits measured on a continuous scale), the tool employs linear regression models to assess the association between snarl structures and phenotype values, allowing for continuous trait mapping with greater precision.
git clone https://github.com/Plogeur/STOAT.git
cd STOAT
pip install .
- Python version 3.10
- cyvcf2
- numpy
- pandas
- bdsg
- ...
Required files :
- pg : Pangenome graph file, formats accepted: .pg or .xg.
- dist : Distance file generated with vg dist, format: .dist.
- vcf : Merged VCF file, created using bcftools merge, formats: .vcf or .vcf.gz. (ex :
bcftools merge -m none -Oz -o test
) - phenotype : phenotype file organise in three-column with FID (family/sample name), IID (sample name), and PHENO (integer/float). Format: .txt or .tsv (tab-separated).
Optional file :
- paths : Two-column file containing snarl names and the list of paths through the snarl's netgraph, separated by tabs. Format: .txt or .tsv.
- ref : VCF referenting chromosomes and positions in the pangenome graph generated with vg deconstruct, formats: .vcf (only).
Use stoat tool
if you want to launch the full tool at once, starting from snarl path identification (identifying the multiple paths that can be taken by a sample based on the pangenome graph) and ending with the results plots (Manhattan plot and QQ plot).
- Run full tool :
# binary trait
stoat -p <pg.pg> -d <dist.dist> -v <vcf.vcf.gz> -b <phenotype.txt> -o output
# quantative trait
stoat -p <pg.pg> -d <dist.dist> -v <vcf.vcf.gz> -q <phenotype.txt> -o output
Explanation of all options:
-p Path to the input pangenome `.pg` file (mutually exclusive, required if `-l` not provided).
-d Path to the input distance index `.dist` file (mutually exclusive, required if `-l` not provided).
-t Specifies the children threshold (optional).
-v Path to the merged VCF file (`.vcf` or `.vcf.gz`) (required).
-r Path to the VCF file referencing all snarl positions (`.vcf` or `.vcf.gz`) (mutually exclusive, required if `-p and -d` not provided).
-l, --listpath
Path to the list of paths file (optional).
-b, --binary
Path to the binary group file (`.txt` or `.tsv`) (mutually exclusive, required if `-q` not provided).
-q, --quantitative
Path to the quantitative phenotype file (`.txt` or `.tsv`) (mutually exclusive, required if `-b` not provided).
-c, --covariate (working progress)
Path to the covariate file (`.txt` or `.tsv`), LMM analysis (optional).
-g, --gaf
Prepare binary GWAS output for a GAF file and create a GAF file with the top 10 significant paths (optional).
-o, --output
Specifies the base path for the output directory (optional).
Alternatively, you can specify the script you want to launch, depending on your desired task:
-
list_snarl_paths.py
: Identifies snarl paths. (By default, it excludes snarls with more than 50 children or 2 second computation by snarl)
Input: Graph information files (.dist
and.pg
)
Output: A two-column string file (.tsv
) containing the snarl reference and the corresponding list of paths. -
children_distribution.py
: An optional script to help decide the thresholds forlist_snarl_paths.py
based on the children distribution in the netgraph (pangenome graph).
Input: Graph information files (.dist
and.pg
)
Output: None -
snarl_analyser.py
: The main script. It parses and analyzes all input files to return p-value associations.
Input:- Snarl list paths (
.tsv
or.txt
) - Binary or quantitative phenotype (
.tsv
or.txt
) - Merged VCF (
.vcf.gz
or.vcf
)
Output: P-value associations.
- Snarl list paths (
-
gaf_creator.py
: Creates a GAF file from a list of binary p-value associations (output ofsnarl_analyser.py
).
Input: Binary p-value association file.
Output: GAF file. -
p_value_analysis.py
: Creates a Manhattan plot and QQ plot (output ofsnarl_analyser.py
).
Input: Binary or Quantitative p-value association file.
Output: 2 PNG file plots. -
Example of specific script of tool :
# decompose pangenome
python3 stoat/list_snarl_paths.py -p <pg.pg> -d <dist.dist> -o <paths.txt>
# binary trait with list_path already computed and gaf creation
stoat -l <paths.txt> -v <vcf.vcf.gz> -r <ref.vcf.gz> -b <phenotype.txt> --gaf -o output.tsv
# quantitative trait with list_path already computed
stoat -l <paths.txt> -v <vcf.vcf.gz> -r <ref.vcf.gz> -q <phenotype.txt> -o output.tsv
Column Name | Description |
---|---|
CHR | Chromosome name where the variation occurs. |
POS | Position of the snarl within the chromosome. |
SNARL | Identifier for the variant, snarl name/id |
TYPE | Type of genetic variation (e.g., SNP, INS, DEL). |
REF | Sequence of the reference allele. |
ALT | Sequence of the alternate allele. |
P_FISHER | P-value calculated using Fisher's exact test (binary analysis). |
P_CHI2 | P-value calculated using the Chi-squared test (binary analysis). |
ALLELE_NUM | Total number of alleles that pass in this snarl. |
MIN_ROW_INDEX | Minimum group of samples that pass through one path of the snarl. (binary analysis). |
NUM_COLUM | Number of paths in the snarl. (binary analysis). |
INTER_GROUP | Sum of the minimum samples that pass through each path. (binary analysis). |
AVERAGE | Average number of total samples passing through this snarl, divided by the number of paths. (binary analysis). |
P | P-value calculated using linear regression (quantitative analysis). |
RSQUARED | R-squared value, proportion of variance explained by the model (quantitative analysis). |
SE | Mean Standard error, estimatation coefficients of all paths in a snarl (quantitative analysis). |
BETA | Mean Beta coefficients, estimatation effect sizes of the prediction of all paths in a snarl (quantitative analysis). |
Below is an example of the output for a binary phenotype analysis:
CHR POS SNARL TYPE REF ALT P_FISHER P_CHI2 ALLELE_NUM MIN_ROW_INDEX NUM_COLUM INTER_GROUP AVERAGE
1 12 5262721_5262719 SNP A T 0.4635 0.5182 286 2 137 46 143.0
1 15 5262719_5262717 INS A ATT 0.8062 0.8747 286 2 141 34 143.0
1 18 5262717_5262714 DEL AA T 0.2120 0.2363 286 2 134 32 143.0
Below is an example of the output for a quantitative phenotype analysis (-q option) :
CHR POS SNARL TYPE REF ALT RSQUARED BETA SE P
1 12 5262721_5262719 SNP A T 8.3697e-01 1.3878e+01 6.5108e+00 4.0376e-01
1 15 5262719_5262717 INS A ATT 4.4237e-01 1.3238e+01 6.5345e+00 4.6574e-01
1 18 5262717_5262714 DEL AA T 6.3237e-01 1.6458e+01 6.6453e+00 4.7484e-01
1 19 5262717_5262714 COMPLEX C NA 4.2342e-01 2.3242e+01 5.3251e+00 1.3245e-01
STOAT will generated a manhattan and a QQ plot for binary and quantitatif analysis.
Use gaf_creator.py
or stoat --gaf
to geneate a GAF file and sequenceTubeMap tool to visualize your gwas binary region results.
python3 stoat/gaf_creator.py -s <binary_gwas_stoat_output.tsv> -l <paths.tsv> -p <pg.pg>
Description : Color represente the different paths group (red : group 1 & blue : group 0) and opacity represente the number of samples in that paths (number of samples passing trought each paths % 60).
-
Implement LMM (Linear Mixed Models).
-
Correct GitHub Actions.