revised docs + ready for prelim tests

jeffersonfparil · Jan 15, 2024 · 37b41e1 · 37b41e1
1 parent e76c720
commit 37b41e1
Show file tree

Hide file tree

Showing 4 changed files with 131 additions and 126 deletions.
diff --git a/R/imputef.R b/R/imputef.R
@@ -15,7 +15,7 @@
 #'        n_threads=2,
 #'        fname_out_prefix="")
 #' @param fname
-#' filename of the uncompressed vcf file which has the AD (allele depth) field and may or may not have genotypes called (e.g. generated via `bctools mpileup -a AD,DP ...`). If the GT field is present but the AD field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool.
+#' name of the genotype file to be imputed in uncompressed vcf, sync or allele frequency table format. See genotype format details below.
 #' @param min_coverage
 #' minimum coverage per locus, i.e. if at a locus, a pool falls below this value (does not skip missing data, i.e. missing locus has a depth of zero), then the whole locus is omitted. Set this to zero if the vcf has been filtered and contains missing values, i.e. `./.` or `.|.`. [Default=0]
 #' @param min_allele_frequency
@@ -35,30 +35,30 @@
 #' @param n_threads
 #' number of computing threads or processor cores to use in the computations. [Default=2]
 #' @param fname_out_prefix
-#' prefix of the output files (sync and csv files). [Default="" which will use the entire name of the input vcf file as the prefix]
+#' prefix of the output files including the [imputed allele frequency table](#allele-frequency-table-csv) (`<fname_out_prefix>-<time>-<random_id>-IMPUTED.csv`). [Default="" which will use the name of the input file as the prefix including the path]
 #' @details
 #' This mean value imputation method simply uses the arithmetic mean of the allele frequencies across all non-missing samples to impute missing data.
 #' This function prints out the expected mean absolute error (MAE) of the imputation using 10% simulated missing data. Repeat the imputation manually to get a range of these MAEs for a better estimate of the expected MAE.
-#' This function does not import any genotype data into R. Most processes are multi-threaded and outputs are written into disk as text files, i.e. sync file, and csv file of imputed allele frequencies. 
-#' It converts vcf into sync with locus filtering based on minimum depth, minimum allele frequency, and maximum missingness rate with minimal memory footprint as large vcf files are split into chunks equivalent to the number of threads and processed line-by-line.
 #' The allele depth information (AD), i.e. the unfiltered allele depth which includes the reads which did not pass the variant caller filters are used to calculate allele frequencies. 
-#' The genotype calls (GT), if present are not used, hence the variant caller filtering is unimportant as only the allele frequencies are extracted from the the vcf file. 
-#' However, if the AD information is absent then the GT information will be used assuming that each sample is an individual diploid, i.e., neither a polyploid nor a pool.
-#' The entire sync file is then loaded into memory and imputed in parallel across windows. 
-#' The structs, traits, methods, and functions defined in this library are subsets of [poolgen](https://github.com/jeffersonfparil/poolgen), and will eventually be merged with it. 
-#' @returns sync file:
-#' - filename: <fname_out_prefix>-<time>.sync
-#' - an extension of [popoolation2's](https://academic.oup.com/bioinformatics/article/27/24/3435/306737) sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
-#'  + *Header line/s*:  optional header line/s including the names of the pools, e.g. `# chr pos ref pool1 pool2 pool3 pool4 pool5`
-#'  + *Column 1*:       chromosome or scaffold name
-#'  + *Column 2*:       locus position 
-#'  + *Column 3*:       reference allele, e.g. A, T, C, G 
-#'  + *Column/s 4 to n*:  colon-delimited allele counts: A:T:C:G:DEL:N, where "DEL" refers to insertion/deletion, and "N" is unclassified. A pool or population or polyploid individual is represented by a single column of this colon-delimited allele counts.
-#' @returns imputed allele frequencies:
-#' - filename: <fname_out_prefix>-<time>-<random_id>-IMPUTED.csv
-#' - comma-separated file
-#' - header: ` #chr,pos,allele,<pool_names>,...`
-#' - each locus is represented by 2 or more rows, i.e. 2 for biallelic loci, and >2 for multi-allelic loci
+#' Optional filtering steps based on minimum depth, minimum allele frequency, and maximum sparsity are available. 
+#' If the GT field is present but the AD field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool. 
+#' Genotype data are not imported into R, LD estimation and imputation per se are multi-threaded, and imputation output is written into disk as an allele frequency table. 
+#' The structs, traits, methods, and functions defined in this library are subsets of [poolgen](https://github.com/jeffersonfparil/poolgen), and will eventually be merged. 
+#' Genotype file formats:
+#' - vcf: canonical variant calling or genotype data format for individual samples. This should include the `AD` field (allele depth), and may or may not have genotypes called (e.g. generated via bctools mpileup -a AD,DP ...). If the `GT` field is present but the `AD` field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool. The [`vcf2sync`](#vcf2sync) utility is expected to work with vcf versions 4.2 and 4.3. See [VCFv4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf) and [VCFv4.3](https://samtools.github.io/hts-specs/VCFv4.3.pdf) for details in the format specifications.
+#' - sync: an extension of [popoolation2's](https://academic.oup.com/bioinformatics/article/27/24/3435/306737) sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
+#'    + tab-delimited
+#'    + *Header line/s*:  optional header line/s including the names of the pools, e.g. `# chr pos ref pool1 pool2 pool3 pool4 pool5`
+#'    + *Column 1*:       chromosome or scaffold name
+#'    + *Column 2*:       locus position 
+#'    + *Column 3*:       reference allele, e.g. A, T, C, G 
+#'    + *Column/s 4 to n*:  colon-delimited allele counts: A:T:C:G:DEL:N, where "DEL" refers to insertion/deletion, and "N" is unclassified. A pool or population or polyploid individual is represented by a single column of this colon-delimited allele counts.
+#' - allele frequency table
+#'    + comma-delimited
+#'    + *Header line*: ` #chr,pos,allele,<pool_name_1>,...,<pool_name_n>`
+#'    + each locus is represented by 2 or more rows, i.e. 2 for biallelic loci, and >2 for multi-allelic loci
+#' @returns 
+#' imputed allele frequency table with the following filename: `<fname_out_prefix>-<time>-<random_id>-IMPUTED.csv`. Additionally, a sync file is generated if the input was an uncompressed vcf.
 #' @export
 mvi = function(fname, 
                 min_coverage=0,
@@ -125,7 +125,7 @@ mvi = function(fname,
 #'         n_threads=2,
 #'         fname_out_prefix="")
 #' @param fname
-#' filename of the uncompressed vcf file which has the AD (allele depth) field and may or may not have genotypes called (e.g. generated via `bctools mpileup -a AD,DP ...`). If the GT field is present but the AD field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool.
+#' name of the genotype file to be imputed in uncompressed vcf, sync or allele frequency table format. See genotype format details below.
 #' @param min_coverage
 #' minimum coverage per locus, i.e. if at a locus, a pool falls below this value (does not skip missing data, i.e. missing locus has a depth of zero), then the whole locus is omitted. Set this to zero if the vcf has been filtered and contains missing values, i.e. `./.` or `.|.`. [Default=0]
 #' @param min_allele_frequency
@@ -169,32 +169,36 @@ mvi = function(fname,
 #' @param n_threads
 #' number of computing threads or processor cores to use in the computations. [Default=2]
 #' @param fname_out_prefix
-#' prefix of the output files (sync and csv files). [Default="" which will use the entire name of the input vcf file as the prefix]
+#' prefix of the output files including the [imputed allele frequency table](#allele-frequency-table-csv) (`<fname_out_prefix>-<time>-<random_id>-IMPUTED.csv`). [Default="" which will use the name of the input file as the prefix including the path]
 #' @details
 #' This is an attempt to extend the [LD-kNNi method of Money et al, 2015, i.e. LinkImpute](https://doi.org/10.1534/g3.115.021667), which was an extension of the [kNN imputation of Troyanskaya et al, 2001](https://doi.org/10.1093/bioinformatics/17.6.520). 
-#' Similar to LD-kNNi, LD is estimated using Pearson's product moment correlation across loci per pair of samples, but instead of computing this across all the loci, we divide the genome into windows which respect chromosomal/scaffold boundaries. 
-#' We use the mean absolute difference (MAD or MAE where E stands for error) between allele frequencies as an estimate of distance between samples. 
-#' This function prints out the expected mean absolute error (MAE) of the imputation using 10% simulated missing data. Repeat the imputation manually to get a range of these MAEs for a better estimate of the expected MAE.
-#' This function does not import any genotype data into R. Most processes are multi-threaded and outputs are written into disk as text files, i.e. sync file, and csv file of imputed allele frequencies. 
-#' It converts vcf into sync with locus filtering based on minimum depth, minimum allele frequency, and maximum missingness rate with minimal memory footprint as large vcf files are split into chunks equivalent to the number of threads and processed line-by-line.
+#' Similar to LD-kNNi, linkage disequilibrium (LD) is estimated using Pearson's product moment correlation per pair of loci, which is computed per chromosome by default, but can be computed across the entire genome. 
+#' We use the mean absolute difference/error (MAE) between allele frequencies among linked loci as an estimate of genetic distance between samples. 
+#' Fixed values for the minimum correlation to identify loci used in distance estimation, and maximum genetic distance to select the k-nearest neighbours can be defined. 
+#' Additionally, minimum number of loci to include in distance estimation, and minimum number of nearest neighbours can be set. 
+#' Moreover, all four parameters can be optimised, i.e. the minimum correlation and/or maximum distance and/or minimum number of loci and/or minimum number of nearest neighbours which minimises the MAE between predicted and expected allele frequencies after simulating 10% missing data are identified.
 #' The allele depth information (AD), i.e. the unfiltered allele depth which includes the reads which did not pass the variant caller filters are used to calculate allele frequencies. 
-#' The genotype calls, if present are not used, hence the variant caller filtering is unimportant as only the allele frequencies are extracted from the the vcf file. 
-#' However, if the AD information is absent then the GT information will be used assuming that each sample is an individual diploid, i.e., neither a polyploid nor a pool.
-#' The entire sync file is then loaded into memory and imputed in parallel across windows. 
-#' The structs, traits, methods, and functions defined in this library are subsets of [poolgen](https://github.com/jeffersonfparil/poolgen), and will eventually be merged with it. 
-#' @returns sync file:
-#' - filename: <fname_out_prefix>-<time>.sync
-#' - an extension of [popoolation2's](https://academic.oup.com/bioinformatics/article/27/24/3435/306737) sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
-#'  + *Header line/s*:  optional header line/s including the names of the pools, e.g. `# chr pos ref pool1 pool2 pool3 pool4 pool5`
-#'  + *Column 1*:       chromosome or scaffold name
-#'  + *Column 2*:       locus position 
-#'  + *Column 3*:       reference allele, e.g. A, T, C, G 
-#'  + *Column/s 4 to n*:  colon-delimited allele counts: A:T:C:G:DEL:N, where "DEL" refers to insertion/deletion, and "N" is unclassified. A pool or population or polyploid individual is represented by a single column of this colon-delimited allele counts.
-#' @returns imputed allele frequencies:
-#' - filename: <fname_out_prefix>-<time>-<random_id>-IMPUTED.csv
-#' - comma-separated file
-#' - header: ` #chr,pos,allele,<pool_names>,...`
-#' - each locus is represented by 2 or more rows, i.e. 2 for biallelic loci, and >2 for multi-allelic loci
+#' If the GT field is present but the AD field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool. 
+#' Optional filtering steps based on minimum depth, minimum allele frequency, and maximum sparsity are available. 
+#' Genotype data are not imported into R, LD estimation and imputation per se are multi-threaded, and imputation output is written into disk as an allele frequency table. 
+#' The structs, traits, methods, and functions defined in this library are subsets of [poolgen](https://github.com/jeffersonfparil/poolgen), and will eventually be merged. 
+#' Genotype file formats:
+#' - vcf: canonical variant calling or genotype data format for individual samples. This should include the `AD` field (allele depth), and may or may not have genotypes called (e.g. generated via bctools mpileup -a AD,DP ...). If the `GT` field is present but the `AD` field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool. The [`vcf2sync`](#vcf2sync) utility is expected to work with vcf versions 4.2 and 4.3. See [VCFv4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf) and [VCFv4.3](https://samtools.github.io/hts-specs/VCFv4.3.pdf) for details in the format specifications.
+#' - sync: an extension of [popoolation2's](https://academic.oup.com/bioinformatics/article/27/24/3435/306737) sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
+#'    + tab-delimited
+#'    + *Header line/s*:  optional header line/s including the names of the pools, e.g. `# chr pos ref pool1 pool2 pool3 pool4 pool5`
+#'    + *Column 1*:       chromosome or scaffold name
+#'    + *Column 2*:       locus position 
+#'    + *Column 3*:       reference allele, e.g. A, T, C, G 
+#'    + *Column/s 4 to n*:  colon-delimited allele counts: A:T:C:G:DEL:N, where "DEL" refers to insertion/deletion, and "N" is unclassified. A pool or population or polyploid individual is represented by a single column of this colon-delimited allele counts.
+#' - allele frequency table
+#'    + comma-delimited
+#'    + *Header line*: ` #chr,pos,allele,<pool_name_1>,...,<pool_name_n>`
+#'    + each locus is represented by 2 or more rows, i.e. 2 for biallelic loci, and >2 for multi-allelic loci
+#' @returns 
+#' imputed allele frequency table with the following filename: `<fname_out_prefix>-<time>-<random_id>-IMPUTED.csv`. Additionally, a sync file is generated if the input was an uncompressed vcf.
+
+
 #' @export
 aldknni = function(fname, 
                     min_coverage=0,