Skip to content

A command-line parser for VCF files designed for population genetics analyses.

License

Notifications You must be signed in to change notification settings

endreth/jVCFparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub GitHub top language GitHub language count GitHub code size in bytes GitHub repo size

jVCFparser

A command-line parser for VCF files designed for population genetics analyses.

screen1

This is a beta version of the jVCFparser command line tool for processing variant call format (VCF) files. The tool uses memory-efficient descriptive statistics to load VCF file data into memory and then perform population genetics calculations on it. Since the tool only stores allele and genotype frequencies, it is able to process large files. Although reading the files may take some time, all calculations are extremely fast. The tool has been tested on VCF versions 4.0 and 4.2.

VCF 4.2 description (08.23.2022): The manual can be accessed on SAMtools site.

Date of last modification: 04.04.2023

Usage (example)

Get the JAR artifact HERE!

$ java -jar jVCFparser.jar -f ".\populations.snps.vcf" -mg 

screen2

Flag LFlag Description
-mg -missg Missing genotype counts
-ra -refa REF allele counts
-aa -alta ALT allele counts
-gc -gcounts Genotype counts
-dgc -diffgcounts Different genotype counts
-da -dacounts Different allele counts
-ea -eacounts Effective allele counts
-het -hetcounts Heterozygote counts
-hom -homcounts Homozygote counts
-oh -obshet Average Observed Heterozygosity (Ho)
-eh -exphet Average Expected Heterozygosity (He)
-ueh -uexphet Average Unbiased Expected Heterozygosity (uHe)
-sh -shann Average Shannon's Information Index (H)
-si -simp Average Simpson's Diversity Index (D)
-fx -fix Average Fixation Index (F)
-ar -arich Average Allelic Richness (Ar)

Requirements:
GNU/Linux, Microsoft Windows, or macOS
JRE (JDK 11 or later)

Sample data used for testing:
~25K SNP (loci) and 180 sample matrix: Sessile oak SNP dataset; de novo assembly; File size: ~15MB
~50K SNP (loci) and ~20K sample matrix: SoySNP50K iSelect BeadChip, Wm82.a1; File size: ~3.5GB
~1.3M SNP (loci) and ~2K sample matrix: 1000 genomes project, Phase 3, Chromosome 21; File size: ~10GB

Allele and genotype counts for reading the files are the following:

  • Number of reference allele '0'
  • Number of alternative allele '1'
  • Number of unique alleles
  • Number of homozygote genotypes (e.g. 0/0 or 1/1)
  • Number of heterozygote genotypes (e.g. 0/1 or 1/0)
  • Number of missing genotypes (e.g. ./.)
  • Number of unique genotypes

Currently implemented diversity-ralated descriptive statistics, and calculations (and counts) as follows:

  • Number of missing genotypes
  • Number of REF alleles
  • Number of ALT alleles
  • Number of genotypes
  • Number of heterozygotes
  • Number of homozygotes
  • Average number of different genotypes (Ng)
  • Average number of different alleles (Na)
  • Average number of effective alleles (Ne)
  • Average Observed Heterozygosity (Ho)
  • Average Expected Heterozygosity (He)
  • Average Unbiased Expected Heterozygosity (uHe)
  • Average Shannon's Information Index (SI)
  • Average Simpson's Diversity Index (D)
  • Average Fixation Index (F)
  • Average Allelic Richness (Ar)
Calculation details [Formulas used in calculations and their references.]

Average number of different genotypes (Ng):

avg_ng
Ng represents the mean number of distinct genotypes across n loci, denoted as gi for i = 1,2,...,n. It is computed as the sum of the distinct genotypes at each locus, divided by the total number of loci.

Average number of different alleles (Na):

avg_na
Na represents the mean number of different alleles across n loci, denoted as ai for i = 1,2,...,n. It is computed as the sum of the different alleles at each locus, divided by the total number of loci.

Average number of effective alleles (Ne):

avg_ne
Ne represents the mean number of effective alleles across n genetic loci, denoted as pi for i = 1,2,...,n. It is calculated as the inverse of the sum of allele frequencies, divided by the total number of loci. Based on Brown & Weir (1983).

Average Observed Heterozygosity (Ho):

ho
Ho represents the average proportion of heterozygous individuals across n genetic loci. For each locus i = 1,2,...,n, the proportion of heterozygous individuals is computed as the ratio of the number of heterozygotes to the total number of individuals N. Observed heterozygosity is then calculated as the mean of these proportions across all n loci. Based on Hartl & Clark (1997).

Average Expected Heterozygosity (He):

he
He represents the mean probability that two randomly chosen alleles at a given locus are different, across n genetic loci, denoted as pi and qi for i = 1,2,...,n. It is calculated as the average of 1 minus the sum of squared allele frequencies. Based on the intra locus gene diversity (H = 1-p2-q2) derived from the Hardly-Weinberg equilibrium.

Average Unbiased Expected Heterozygosity (uHe):

uhe
uHe represents the mean probability that two randomly chosen alleles at a given locus are different, across n genetic loci, adjusted for sample size and population size bias. It is calculated as the average of 1 minus the sum of squared allele frequencies, adjusted for sample size bias. Based on Peakall & Smouse (2006).

Average Shannon's Information Index (H):

shannon
H represents the Average Shannon's Information Index, defined as the average amount of uncertainty associated with predicting the identity of a randomly chosen allele at a given locus, across n loci. It is calculated as the negative average of the product of the frequency of the i-th allele, pi, and the natural logarithm of pi. Based on Brown & Weir (1983).

Average Simpson's Diversity Index (D):

simpson
D represents the Average Simpson's Diversity Index, defined as the probability that two randomly chosen alleles at a given locus are identical, across n loci. It is calculated as 1 minus the average of the sum of squared allele frequencies. Based on Simpson (1949) and Morris et al. (2014).

Average Fixation Index (F):

f
F represents the Average Fixation Index, averaged across n loci. It is calculated as the difference between observed heterozygosity (Ho) and expected heterozygosity (He), normalized by expected heterozygosity and averaged across n loci. Based on Hartl & Clark (1997).

Average Allelic Richness (Ar):

ar
Ar represents the Average Allelic Richness, defined as the expected number of species in a sample of n genotypes selected at random from a collection containing N alleles ("genes") from S loci. It is calculated as the number of alleles observed in a sample of size Ni, normalized by the sample size Ni and averaged across S loci. Based on Hurlbert (1971) and El Mousadik & Petit (1996). NOTE: Not designed and not suitable for big data!

References
Brown, A. H., & Weir, B. S. (1983). Measuring genetic variability in plant populations. Isozymes in plant genetics and breeding, part A, 219-239.

El Mousadik, A., & Petit, R. J. (1996). High level of genetic differentiation for allelic richness among populations of the argan tree [Argania spinosa (L.) Skeels] endemic to Morocco. Theoretical and applied genetics, 92, 832-839.

Hartl, D. L., & Clark, A. G. (1997). Principles of population genetics (Vol. 116). Sunderland: Sinauer associates.

Hurlbert, S. H. (1971). The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4), 577-586.

Morris, E. K., Caruso, T., Buscot, F., Fischer, M., Hancock, C., Maier, T. S., ... & Rillig, M. C. (2014). Choosing and using diversity indices: insights for ecological applications from the German Biodiversity Exploratories. Ecology and evolution, 4(18), 3514-3524.

Peakall, R. O. D., and Peter E. Smouse. "GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research." Molecular ecology notes 6.1 (2006): 288-295.

Simpson, E. H. (1949). Measurement of diversity. nature, 163(4148), 688-688.