The package provides a thin wrapper around pysam and rsidx to parse VCF files containing GWAS summary statistics and trait metadata. See also gwasvcf an R package for parsing GWAS-VCF files.
Parses GWAS-VCF with version 1.0 of the specification
pip install pygwasvcf
Download over 10,000 GWAS-VCF files from https://gwas.mrcieu.ac.uk
Read GWAS trait/study metadata
import pygwasvcf
with pygwasvcf.GwasVcf("/path/to/gwas.vcf.gz") as g:
# print dictionary of GWAS metadata
print(g.get_metadata())
Query variant-trait association(s) by chromosome and position location
import pygwasvcf
with pygwasvcf.GwasVcf("/path/to/gwas.vcf.gz") as g:
# query by chromosome and position interval
for variant in g.query(contig="1", start=1, stop=1):
print(variant)
Query variant-trait association(s) by dbSNP rsID
import pygwasvcf
with pygwasvcf.GwasVcf("/path/to/gwas.vcf.gz") as g:
# index on dbSNP identifier
# based on [rsidx](https://github.com/bioforensics/rsidx)
# only need to do this once and then provide the index path to the constructor
# i.e. GwasVcf("/path/to/gwas.vcf.gz", rsidx_path="/path/to/gwas.vcf.gz.rsidx")
g.index_rsid()
# rapid query by rsID
for variant in g.query(variant_id="rs1245"):
print(variant)
Extract summary statistics from a variant object
import pygwasvcf
with pygwasvcf.GwasVcf("/path/to/gwas.vcf.gz") as g:
# query by chromosome and position interval
for variant in g.query(contig="1", start=1, stop=1):
# print variant-trait P value
print(pygwasvcf.VariantRecordGwasFuns.get_pval(variant, "trait_name"))
# print variant-trait SE
print(pygwasvcf.VariantRecordGwasFuns.get_se(variant, "trait_name"))
# print variant-trait beta
print(pygwasvcf.VariantRecordGwasFuns.get_beta(variant, "trait_name"))
# print variant-trait allele frequency
print(pygwasvcf.VariantRecordGwasFuns.get_af(variant, "trait_name"))
# print variant-trait ID
print(pygwasvcf.VariantRecordGwasFuns.get_id(variant, "trait_name"))
# create and print ID on-the-fly if missing
print(pygwasvcf.VariantRecordGwasFuns.get_id(variant, "trait_name", create_if_missing=True))
# print variant-trait sample size
print(pygwasvcf.VariantRecordGwasFuns.get_ss(variant, "trait_name"))
# print variant-trait total sample size from header if per-variant is missing
print(pygwasvcf.VariantRecordGwasFuns.get_ss(variant, "trait_name", g.get_metadata()))
# print variant-trait number of cases
print(pygwasvcf.VariantRecordGwasFuns.get_nc(variant, "trait_name"))
# print variant-trait total cases from header if per-variant is missing
print(pygwasvcf.VariantRecordGwasFuns.get_nc(variant, "trait_name", g.get_metadata()))
API documentation available from https://mrcieu.github.io/pygwasvcf
The variant call format provides efficient and robust storage of GWAS summary statistics. Lyon MS, Andrews SJ, Elsworth B, Gaunt TR, Hemani G, Marcora E. Genome Biol. 22, 32 (2021).