Skip to content

Allele frequency imputation through linkage-informed k-nearest neighbour-derived weighted mean allele frequency prediction

License

Notifications You must be signed in to change notification settings

jeffersonfparil/imputef

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

imputef

Impute allele frequencies to reduce sparsity of genotype data from polyploids, pooled individuals, and populations.

Build Status License
License: GPL v3

Overview

Installation

Pre-compiled binaries

  1. Download the appropriate executable binary compatible with your system
  1. Configure and execute
  • In GNU/Linux (macOS binary pending):
chmod +x imputef
./imputef
  • In Windows, open command prompt via: Win + R, type "cmd" and press enter. Navigate to your download folder and execute, e.g. imputef-x86_64-windows.exe -h.

Compile from source

  1. Clone the repository
git clone https://jeffersonfparil:<API_KEY>@github.com/jeffersonfparil/imputef.git main
  1. Load the Rust development environment via Conda (please see Conda installation instructions if you do not have Conda pre-installed)
cd imputef/
conda env create --file res/rustenv.yml
conda activate rustenv
  1. Compile and optionally create an alias or a symbolic link or add to $PATH
cargo build --release
target/release/imputef -h
# Option 1: alias
echo alias imputef="$(pwd)/target/release/imputef" >> ~/.bashrc
source ~/.bashrc
# Option 2: symlink
sudo ln -s $(pwd)/target/release/imputef /usr/bin/imputef
# Option 3: add to $PATH (Note that you need to place this in your ~/.bashrc or the appropriate shell initialisation file)
export PATH=${PATH}:$(pwd)/target/release
### Check
type -a imputef
cd ~; imputef -h; cd -

Usage

Quickstart

imputef -h
imputef -f tests/test.tsv   # allele frequency table as input (tab-delimited)
imputef -f tests/test.csv   # allele frequency table as input (comma-separated)
imputef -f tests/test.ssv   # allele frequency table as input (semi-colon-separated)
imputef -f tests/test.sync  # synchronised pileup file as input
imputef -f tests/test.vcf   # variant call format as input without missing data
imputef -f tests/test_2.vcf   # variant call format as input
imputef -f tests/test_2.vcf --method mean # use mean value imputation
imputef -f tests/test_2.vcf --min-loci-corr=0.75 --max-pool-dist=0.25 # define some minimum loci correlation and maximum genetic distance thresholds

Arguments

Argument or flag Description
--fname Filename of the genotype file to be imputed in uncompressed vcf, sync, or allele frequency table. Details on these genotype formats are available below.
--method Imputation method. Use "mean" for mean value imputation or "aldknni" for allele frequency LD-kNN imputation. [Default = "aldknni"]
--min-coverage Minimum coverage per locus, i.e. if at a locus, a pool falls below this value (does not skip missing data, i.e. missing locus has a depth of zero), then the whole locus is omitted. Set this to zero if the vcf has been filtered and contains missing values, i.e. ./. or `.
--min-allele-frequency Minimum allele frequency per locus, i.e. if at a locus, a pool has all its alleles below this value and/or above the additive complement of this value (skipping missing data), then the entire locus is omitted. [Default = 0.0001]
--max-missingness-rate-per-locus Maximum fraction of pools missing per locus, i.e. if at a locus, there were more pools missing than the coverage dictated by this threshold, then the locus is omitted. [Default = 1.00]
--pool-sizes Vector of pool sizes, i.e. the number of individuals included in each pool. Enter the pool sizes separated by commas ,. This can also be set to a single arbitrarily large value for example 100 for individual polyploids or if allele frequency estimates are expected to be accurate. [Default = 100.0]
--min-depth-below-which-are-missing Minimum depth at which loci with depth below this threshold are set to missing. Set to one if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an integer above zero. [Default = 1.00]
--max-depth-above-which-are-missing Maximum depth at which loci with depth above this threshold are set to missing. Set to some large arbitrarily large value (e.g. 1000000) if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an integer above zero. [Default = 1000000.0]
--frac-top-missing-pools Fraction of pools with the highest number of missing loci to be omitted. Set to zero if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to a decimal number between zero and one. [Default = 0.0]
--frac-top-missing-loci Fraction of loci with the highest number of pools with missing data to be omitted. Set to zero if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an decimal number between zero and one. [Default = 0.0]
--min-loci-corr Minimum correlation (Pearson's correlation) between the locus requiring imputation and other loci deemed to be in linkage with it. Ranges from 0.0 to 1.0, but use -1.0 or any negative number to perform per locus optimisations to find the best value minimising imputation. [Default = 0.9]
--max-pool-dist Maximum genetic distance (mean absolute difference in allele frequencies) between the pool or sample requiring imputation and pools or samples deemed to be the closest neighbours. Ranges from 0.0 to 1.0, but use -1.0 or any negative number to perform per locus optimisations to find the best value minimising imputation. [Default = 0.1]
--min-l-loci Minimum number of linked loci to be used in estimating genetic distances between the pool or sample requiring imputation and other pools or samples (minimum value of 1). This argument overrides --min-loci-corr, i.e. the minimum number of loci will be met regardless of the minimum loci correlation threshold. [Default = 20]
--min-k-neighbours Minimum number of k-nearest neighbours of the pool or sample requiring imputation (minimum value of 1). This argument overrides --max-pool-dist, i.e. the minimum number of k-nearest neighbours will be met regardless of the maximum genetic distance threshold. [Default = 5]
--restrict-linked-loci-per-chromosome Restrict the choice of linked loci to within the chromosome the locus requiring imputation belongs to? [default: false] [Default = false; i.e. no flag]
--n-reps Number of replications for the estimation of imputation accuracy in terms of mean absolute error (MAE). It is used to define the number of random non-missing samples to use as replicates for the estimation of MAE and optimisation (minimum value of 1). [Default = 10]
--n-threads Number of computing threads or processor cores to use in the computations. [Default = 2]
--fname-out-prefix Prefix of the output files including the imputed allele frequency table (<fname-out-prefix>-<time>-<random-id>-IMPUTED.tsv). [Default = ""; which corresponds to the name of the input genotype file]

Genotype data formats

Note

Header line and comments should be prepended by '#'.

Variant call format (vcf)

  • canonical variant calling or genotype data format for individual samples. This should include the AD field (allele depth), and may or may not have genotypes called (e.g. generated via bctools mpileup -a AD,DP ...). If the GT field is present but the AD field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool.
  • See VCFv4.2 and VCFv4.3 for details in the format specifications.
  • The allele depth information (AD; i.e. the unfiltered allele depth which includes the reads which did not pass the variant caller filters) is used to calculate allele frequencies.
  • If the GT field is present but the AD field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool. If both GT and AD fields are present, then the AD field takes priority.
  • See tests/test.vcf for an example.

Synchronised pileup (sync)

  • an extension of popoolation2's sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
  • tab-delimited
  • Header line: header line including the names of the samples or pools, i.e. #chr\tpos\tref\t<id_1>\t<id_2>\t<id_3>\t<id_4>\t<id_5>
  • Column 1: chromosome or scaffold name
  • Column 2: locus position
  • Column 3: reference allele, e.g. A, T, C, G
  • Column/s 4 to n: colon-delimited allele counts: A:T:C:G:DEL:N, where "DEL" refers to insertion/deletion, and "N" is unclassified. A pool or population or polyploid individual is represented by a single column of this colon-delimited allele counts.
  • See tests/test.sync for an example.

Allele frequency table (tsv)

  • tab-delimited
  • Header line: #chr\tpos\tallele\t<pool_name_1>\t...\t<pool_name_n>
  • each locus is represented by 1 or more rows, e.g. 1 for biallelic loci (representing the reference or alternative or minor allele), 2 for biallelic loci representing both alleles, and >2 for multi-allelic loci
  • This is the sole output format of the imputation process, regardless of the format of the input genotype file.
  • See tests/test.tsv for an example.

Details

Imputation of allele frequency from polyploids and pooled samples via (1) mean value imputation, and (2) linkage disequillibrium (LD) k-nearest neighbour (kNN)-based weighted allele frequency prediction. The latter is an extension of the LD-kNNi method of Money et al, 2015, i.e. LinkImpute, which was an extension of the kNN imputation of Troyanskaya et al, 2001. Similar to LD-kNNi, LD is estimated using Pearson's product moment correlation per pair of loci. Mean absolute difference in allele frequencies is used to define the genetic distance between samples, instead of taxicab or Manhattan distance used in LD-kNNi. Four parameters can be set by the user:

  1. minimum loci correlation threshold - dictates the minimum LD between the locus requiring imputation and other loci which will be used to estimate genetic distance between samples;
  2. maximum genetic distance threshold - sets the maximum genetic distance between the sample requiring imputation and the samples (i.e. nearest neighbours) to be used in weighted mean imputation of missing allele frequencies;
  3. minimum number of loci linked to the locus requiring imputation - overrides minimum loci correlation threshold if this minimum is not met; and
  4. minimum k-nearest neighbours - overrides maximum genetic distance threshold if this minimum is not met.

The first two parameters (minimum loci correlation and maximum genetic distance thresholds) can be optimised per locus by setting --min-loci-corr=-1.0 and/or --max-pool-dist=-1.0.

--method="mean": mean value imputation

This imputation uses the arithmetic mean of the observed allele frequencies across all samples where the locus was genotyped:

$$ \hat q_{r,j} = { {1 \over (n-m)} { \sum_{i \ne r}^{n} q_{i,j} } } $$

where:

  • $\hat q_{r,j}$ is the imputed allele frequency of sample $r$ at the $j^{\text {th}}$ locus,
  • $n$ is the total number of samples,
  • $m$ is the number of samples which are missing data at the $j^{\text {th}}$ locus, and
  • $q_{i,j}$ is the known allele frequency of the $i^{\text {th}}$ sample at the $j^{\text {th}}$ locus.

--method="aldknni": allele frequency linkage disequilibrium (LD)-based k-nearest neighbour imputation of genotype data

This is an extension of the LD-kNNi method of Money et al, 2015, i.e. LinkImpute, which was an extension of the kNN imputation of Troyanskaya et al, 2001. Similar to LD-kNNi, linkage disequilibrium (LD) is estimated using Pearson's product moment correlation per pair of loci, which is computed per chromosome by default, but can be computed across the entire genome. We use the mean absolute difference/error (MAE) between allele frequencies among linked loci as an estimate of genetic distance between samples. Fixed values for the minimum correlation threshold to identify loci used in distance estimation, and maximum genetic distance threshold to select the k-nearest neighbours can be defined. Additionally, minimum number of loci to include in distance estimation, and minimum number of nearest neighbours can be set. Moreover, the minimum correlation and maximum genetic distance can be optimised per locus by minimising the MAE between predicted and expected allele frequencies using simulated missing or masked data. The number of masked data is controlled by the number of replications (--n-reps) and the total number of samples non-missing at the locus requiring imputation.

The imputed allele frequency is computed as:

$$ \hat q_{r,j} = { \sum_{i \ne r}^{k} q_{i,j} (1 - \delta_{i,r}) } $$

with:

$$ \delta_{i,r} = { {1 \over \sum d_{i,r}} d_{i,r} } $$

and

$$ d_{i,r} = { {1 \over c} { \sum_{j=1}^{c} |q_{i,j} - q_{r,j}| } } $$

where:

  • $\hat q_{r,j}$ is the imputed allele frequency of sample $r$ at the $j^{\text {th}}$ locus,
  • $n$ is the total number of samples,
  • $m$ is the number of samples which are missing data at the $j^{\text {th}}$ locus,
  • $q_{i,j}$ is the known allele frequency of the $i^{\text {th}}$ sample at the $j^{\text {th}}$ locus,
  • $k$ is the number of nearest neighbours or the samples most closely related to the sample requiring imputation, i.e. sample $r$ at locus $j$, and
  • $\delta_{i,r}$ is scaled $d_{i,r}$ which is the genetic distance between the $i^{\text {th}}$ sample and sample $r$. This distance is the mean absolute difference in allele frequencies between the two samples across $c$ linked loci.

The variables $k$ and $c$ are proportional to the user inputs --max-pool-dist (default=0.1) and --min-loci-corr (default=0.9), respectively. The former defines the maximum distance of samples to be considered as one of the k-nearest neighbours, while the latter refers to the minimum correlation with the locus requiring imputation to be included in the estimation of the genetic distance.

LD estimation and imputation per se are multi-threaded, and the imputation output is written into disk as an allele frequency table. The structs, traits, methods, and functions defined in this tool are subsets of poolgen, and will eventually be merged.

Optimisation approaches

  1. Using the built-in grid-search optimisation for minimum loci correlation and maximum genetic distance thresholds per locus:
imputef -f tests/test_2.vcf --min-loci-corr=-1 --max-pool-dist=-1
  1. Grid-search optimisation for all four parameters which assumes a common set of optimal parameters across all loci
### Find the optimal l-loci to use for distance estimation, k-nearest neighbours, minimum loci correlation, and maximum distance
### i.e. the combination of these 4 parameters which minimises the mean absolute error between expected and predicted allele frequencies.
echo 'l,k,corr,dist,mae' > grid_search.csv
for l in 5 10 15
do
    for k in 1 2 3 4 5
    do
        for corr in 0.75 0.95 1.00
        do
            for dist in 0.0 0.10 0.25
            do
                echo "@@@@@@@@@@@@@@@@@@@@@@@@"
                echo ${l},${k},${corr},${dist}
                imputef -f tests/test_2.vcf \
                    --min-l-loci=${l} \
                    --min-k-neighbours=${k} \
                    --min-loci-corr=${corr} \
                    --max-pool-dist=${dist} \
                    --n-reps=3 > log.tmp
                fname_out=$(tail -n1 log.tmp | cut -d':' -f2 | tr -d ' ')
                mae=$(grep "Expected imputation accuracy in terms of mean absolute error:" log.tmp | cut -d':' -f2 | tr -d ' ')
                echo ${l},${k},${corr},${dist},${mae} >> grid_search.csv
                rm $fname_out log.tmp
            done
        done
    done
done
awk -F',' 'NR == 1 || $5 < min {min = $5; min_line = $0} END {print min_line}' grid_search.csv

Performance evaluation

Datasets:

  • autotetraploid Dactylis glomerata (Cocksfoot/Orchardgrass; 2n=4x=28; ~3.5 Gb genome since diploids are 1.77Gb see Huang et al., 2020; 51 samples x 50,281 biallelic loci; Huang et al., 2020 using biallelic SNPs filtered by minimum depth of 17X, maximum depth of 1000X, and minimum allele frequency of 0.05)
  • diploid Vitis vinifera (Grape; 2n=2x=38; 0.5 Gb genome; 77 samples x 8,506 loci biallelic; Money et al., 2015) with the 2.90% missing data prior to sparsity simulations
  • pools of diploid Glycine max (Soybean; 2n=2x=20; 1.15 Gb genome; 478 pools (each pool comprised of 42 individuals) x 39,636 biallelic loci; source: http://gong_lab.hzau.edu.cn/Plant_imputeDB/#!/download_soybean)

Performance metrics:

  • Concordance: $c = {{1 \over n} \Sigma_{i=1}^{n} p_i}$, where:

$$ p_i= \begin{cases} 0 \text{ if } \hat g \ne g_{true}\\ 1 \text{ if } \hat g = g_{true} \end{cases} $$

This is used for genotype classes, i.e., binned allele frequencies: $g = {{1 \over {ploidy}} round(q*ploidy)}$, here $q = P(allele)$. Note that there is alternative way of defining these genotype classes with strict boundaries, i.e., homozygotes have fixed allele frequencies.

  • Mean absolute error: $mae = {{1 \over n} \Sigma_{i=1}^{n}|\hat q - q_{true}|}$.
  • Coefficient of determination: $R^2 = { 1 - {{\Sigma_{}^{}(\hat q - q_{true})^2} \over {\Sigma_{}^{}(\hat q_{true} - \bar q_{true})^2}} }$



Autotetraploid (Cocksfoot/Orchardgrass) mean absolute error

mae_barplots

Pool (Soybean pools) mean absolute error

mae_barplots

Diploid (Grape) mean absolute error

mae_barplots




Autotetraploid (Cocksfoot/Orchardgrass) concordance of observed and imputed genotype classes

concordance_genotype_classes_barplots

Pool (Soybean pools) concordance of observed and imputed genotype classes

concordance_genotype_classes_barplots

Diploid (Grape) concordance of observed and imputed genotype classes

concordance_genotype_classes_barplots




Troubleshooting

  • Out-of-memory (OOM) error is likely due to the pairwise LD estimation across the genome. We have taken steps to reduce the likelihood of this happening by using u8 instead of f64 in calculating Pearson's correlation between pairs of loci. However, if you encountered this OOM error, please consider using the flag --restrict-linked-loci-per-chromosome to estimate pairwise LD per chromosome only. This assumes that you have a dense coverage of the genome, i.e., there are enough markers with non-missing data to accurately determine relationships between loci and samples to yield good imputation accuracies.

References

  • Money D, Gardner K, Migicovsky Z, Schwaninger H, Zhong GY, Myles S. LinkImpute: fast and accurate genotype imputation for nonmodel organisms. G3: Genes|Genomes|Genetics. 2015;5(11):2383–90. doi:10.1534/g3.115.021667.
  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T et al. , 2001 Missing value estimation methods for DNA microarrays. Bioinformatics 17: 520–525.
  • Schwender H, 2012 Imputing missing genotypes with weighted k nearest neighbors. J. Toxicol. Environ. Health A 75: 438–446.

Acknowledgements

This work was conceived and developed during my employment in Agriculture Victoria. The imputation algorithm in this repository was inspired by the algorithms presented in the 3 papers above and Luke Pembletton's tetraploid imputation algorithm written in R. The core data structures, traits, and methods are largely shared with my open-source (GPLv3) project poolgen.

About

Allele frequency imputation through linkage-informed k-nearest neighbour-derived weighted mean allele frequency prediction

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages