trying to fix issues #3 and #4

jeffersonfparil · Jan 11, 2024 · 72ceea4 · 72ceea4
1 parent 4f7748d
commit 72ceea4
Show file tree

Hide file tree

Showing 5 changed files with 276 additions and 234 deletions.
diff --git a/README.md b/README.md
@@ -69,7 +69,7 @@ Rscript src/main.R \
     -g "test/grape.vcf" \
     -p "test/grape_pheno.txt" \
     --phenotype-delimiter "\t" \
-    --phenotype-column-data 4 \
+    --phenotype-column-data "3,4" \
     --k-folds 5 \
     --n-reps 1 \
     --n-threads 8 \
@@ -100,7 +100,7 @@ Rscript src/main.R \
     -g "test/grape.vcf" \
     -p "test/grape_pheno.txt" \
     --phenotype-delimiter "\t" \
-    --phenotype-column-data 4 \
+    --phenotype-column-data "3,4" \
     --k-folds 5 \
     --n-reps 1 \
     --n-threads ${SLURM_CPUS_PER_TASK} \
@@ -113,7 +113,7 @@ Rscript src/main.R \
     - [Variant call format (`*.vcf`)](https://www.internationalgenome.org/wiki/Analysis/vcf4.0/#:~:text=VCF%20is%20a%20text%20file,for%20each%20position%20or%20not.) which requires either the genotype call (**GT**) or allele depth (**AD**) field, where the AD field has precedence if both are present (i.e., allele frequencies rather than binary genotypes are assumed by default). See [test/test.vcf](test/test.vcf) for a very small example.
     - [R object file format (`*.rds`)](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS) which codes for a matrix where each row correspond to a sample and each column to a locus (or an allele of a multi-allelic locus). See [test/test.rds](test/test.rds) for a very small example.
         + The rows are expected to have names of the samples corresponding to the names in the phenotype file.
-        + The columns are expected to contain the chromosome or scaffold names, the position in bases, and allele names separated by underscores, e.g., `chr1_12345_A` - specifying the locus is at chr1, position 12,345 and the data encoded in the column represent the dosage of allele A.
+        + The columns are expected to contain the chromosome or scaffold names, the position in bases, and allele names separated by underscores, e.g., `chr1_12345_A` - specifying the locus is at chr1, position 12,345 and the data encoded in the column represent the dosage of allele A. Each locus may be represented by all of its alleles or less one allele, e.g. one column per locus for biallelic loci. Additional control to this is available via the `--retain-minus-one-alleles-per-locus` parameter which may be `TRUE` of `FALSE`.
     - It is highly recommended to use genotype data without missing values. Missing data will be naively imputed using mean genotype value per allele/locus.
 2. Phenotype data in text-delimited format (`-p; --phenotype-file`)
     - e.g. tab-delimited (`*.txt`) and comma-separated (`*.csv`)
@@ -145,11 +145,13 @@ Rscript src/main.R \
 
 Tables of genomic prediction cross-validation results and predicted phenotypes (full path and filenames of these output files are printed out by the main Rscript):
 
-1. `*-METRICS_WITHIN_POP-{uniqueID}.csv`: table of genomic prediction accuracy metrics across replications, folds, and models per population.
-2. `*-YPRED_WITHIN_POP-{uniqueID}.csv`: table of predicted and expected phenotypes across samples, replications and models per population.
-3. `*-METRICS_ACROSS_POP-{uniqueID}.csv`: table of genomic prediction accuracy metrics across 1 replication, 1 fold, and all models per validation population **(present if there are at least 2 populations)**.
-4. `*-YPRED_ACROSS_POP-{uniqueID}.csv`: table of predicted and expected phenotypes across samples, 1 replication and all models per validation population **(present if there are at least 2 populations)**.
-5. `*-GENOMIC_PREDICTIONS-{uniqueID}.csv`: table of predicted phenotypes of samples with known genotype and missing phenotypes **(present if there are samples with known genotype and unknown phenotype data)**.
+-{time and random integer}.Rds
+
+1. `*-ETRICS_WITHIN_POP`: table of genomic prediction accuracy metrics across replications, folds, and models per population.
+2. `*-PRED_WITHIN_POP`: table of predicted and expected phenotypes across samples, replications and models per population.
+3. `*-ETRICS_ACROSS_POP`: table of genomic prediction accuracy metrics across 1 replication, 1 fold, and all models per validation population, i.e., via leave-one-population-out cross-validation **(present if there are at least 2 populations)**.
+4. `*-PRED_ACROSS_POP`: table of predicted and expected phenotypes across samples, 1 replication and all models per validation population, i.e., via leave-one-population-out cross-validation **(present if there are at least 2 populations)**.
+5. `*-ENOMIC_PREDICTIONS`: table of predicted phenotypes of samples with known genotype and missing phenotypes **(present if there are samples with known genotype and unknown phenotype data)**.
 
 With the following fields:
 

diff --git a/src/load.R b/src/load.R
@@ -115,7 +115,12 @@ fn_load_phenotype = function(fname_csv_txt, sep=",", header=TRUE, idx_col_id=1,
     pop = as.character(df[, idx_col_pop])
     y = df[, idx_col_y]
     names(y) = entry
-    return(list(y=y, pop=pop))
+    if (header==TRUE) {
+        trait_name = colnames(df)[idx_col_y]
+    } else {
+        trait_name = paste0("trait_", idx_col_y)
+    }
+    return(list(y=y, pop=pop, trait_name=trait_name))
 }
 
 fn_filter_outlying_phenotypes = function(list_y_pop, verbose=FALSE) {
@@ -124,6 +129,7 @@ fn_filter_outlying_phenotypes = function(list_y_pop, verbose=FALSE) {
     # sigma = 1
     y = list_y_pop$y
     pop = list_y_pop$pop
+    trait_name = list_y_pop$trait_name
     ### Identify outliers with boxplot, i.e. values beyond -2.698 standard deviations (definition of R::boxplot whiskers)
     b = boxplot(y, plot=FALSE)
     idx = which(!(y %in% b$out))
@@ -135,7 +141,7 @@ fn_filter_outlying_phenotypes = function(list_y_pop, verbose=FALSE) {
         print(paste0("n=", length(y[idx])))
         txtplot::txtdensity(y[idx][!is.na(y[idx])])
     }
-    return(list(y=y[idx], pop=pop[idx]))
+    return(list(y=y[idx], pop=pop[idx], trait_name=trait_name))
 }
 
 ### Merge genotype and phenotype data by their entry names, i.e. rownames for G and names for y
@@ -272,6 +278,7 @@ tests_load = function() {
             yex = c(0.32, 0.67, 0.93); names(yex) = paste0("Entry-", 1:3)
             expect_equal(y, yex)
             expect_equal(list_y_pop$pop, c("pop-A", "pop-A", "pop-A"))
+            expect_equal(list_y_pop$trait_name, "yield")
         }
     )