Skip to content

Commit

Permalink
trying to fix issues #3 and #4
Browse files Browse the repository at this point in the history
  • Loading branch information
jeffersonfparil committed Jan 11, 2024
1 parent 4f7748d commit 72ceea4
Show file tree
Hide file tree
Showing 5 changed files with 276 additions and 234 deletions.
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Rscript src/main.R \
-g "test/grape.vcf" \
-p "test/grape_pheno.txt" \
--phenotype-delimiter "\t" \
--phenotype-column-data 4 \
--phenotype-column-data "3,4" \
--k-folds 5 \
--n-reps 1 \
--n-threads 8 \
Expand Down Expand Up @@ -100,7 +100,7 @@ Rscript src/main.R \
-g "test/grape.vcf" \
-p "test/grape_pheno.txt" \
--phenotype-delimiter "\t" \
--phenotype-column-data 4 \
--phenotype-column-data "3,4" \
--k-folds 5 \
--n-reps 1 \
--n-threads ${SLURM_CPUS_PER_TASK} \
Expand All @@ -113,7 +113,7 @@ Rscript src/main.R \
- [Variant call format (`*.vcf`)](https://www.internationalgenome.org/wiki/Analysis/vcf4.0/#:~:text=VCF%20is%20a%20text%20file,for%20each%20position%20or%20not.) which requires either the genotype call (**GT**) or allele depth (**AD**) field, where the AD field has precedence if both are present (i.e., allele frequencies rather than binary genotypes are assumed by default). See [test/test.vcf](test/test.vcf) for a very small example.
- [R object file format (`*.rds`)](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS) which codes for a matrix where each row correspond to a sample and each column to a locus (or an allele of a multi-allelic locus). See [test/test.rds](test/test.rds) for a very small example.
+ The rows are expected to have names of the samples corresponding to the names in the phenotype file.
+ The columns are expected to contain the chromosome or scaffold names, the position in bases, and allele names separated by underscores, e.g., `chr1_12345_A` - specifying the locus is at chr1, position 12,345 and the data encoded in the column represent the dosage of allele A.
+ The columns are expected to contain the chromosome or scaffold names, the position in bases, and allele names separated by underscores, e.g., `chr1_12345_A` - specifying the locus is at chr1, position 12,345 and the data encoded in the column represent the dosage of allele A. Each locus may be represented by all of its alleles or less one allele, e.g. one column per locus for biallelic loci. Additional control to this is available via the `--retain-minus-one-alleles-per-locus` parameter which may be `TRUE` of `FALSE`.
- It is highly recommended to use genotype data without missing values. Missing data will be naively imputed using mean genotype value per allele/locus.
2. Phenotype data in text-delimited format (`-p; --phenotype-file`)
- e.g. tab-delimited (`*.txt`) and comma-separated (`*.csv`)
Expand Down Expand Up @@ -145,11 +145,13 @@ Rscript src/main.R \

Tables of genomic prediction cross-validation results and predicted phenotypes (full path and filenames of these output files are printed out by the main Rscript):

1. `*-METRICS_WITHIN_POP-{uniqueID}.csv`: table of genomic prediction accuracy metrics across replications, folds, and models per population.
2. `*-YPRED_WITHIN_POP-{uniqueID}.csv`: table of predicted and expected phenotypes across samples, replications and models per population.
3. `*-METRICS_ACROSS_POP-{uniqueID}.csv`: table of genomic prediction accuracy metrics across 1 replication, 1 fold, and all models per validation population **(present if there are at least 2 populations)**.
4. `*-YPRED_ACROSS_POP-{uniqueID}.csv`: table of predicted and expected phenotypes across samples, 1 replication and all models per validation population **(present if there are at least 2 populations)**.
5. `*-GENOMIC_PREDICTIONS-{uniqueID}.csv`: table of predicted phenotypes of samples with known genotype and missing phenotypes **(present if there are samples with known genotype and unknown phenotype data)**.
-{time and random integer}.Rds

1. `*-ETRICS_WITHIN_POP`: table of genomic prediction accuracy metrics across replications, folds, and models per population.
2. `*-PRED_WITHIN_POP`: table of predicted and expected phenotypes across samples, replications and models per population.
3. `*-ETRICS_ACROSS_POP`: table of genomic prediction accuracy metrics across 1 replication, 1 fold, and all models per validation population, i.e., via leave-one-population-out cross-validation **(present if there are at least 2 populations)**.
4. `*-PRED_ACROSS_POP`: table of predicted and expected phenotypes across samples, 1 replication and all models per validation population, i.e., via leave-one-population-out cross-validation **(present if there are at least 2 populations)**.
5. `*-ENOMIC_PREDICTIONS`: table of predicted phenotypes of samples with known genotype and missing phenotypes **(present if there are samples with known genotype and unknown phenotype data)**.

With the following fields:

Expand Down
11 changes: 9 additions & 2 deletions src/load.R
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,12 @@ fn_load_phenotype = function(fname_csv_txt, sep=",", header=TRUE, idx_col_id=1,
pop = as.character(df[, idx_col_pop])
y = df[, idx_col_y]
names(y) = entry
return(list(y=y, pop=pop))
if (header==TRUE) {
trait_name = colnames(df)[idx_col_y]
} else {
trait_name = paste0("trait_", idx_col_y)
}
return(list(y=y, pop=pop, trait_name=trait_name))
}

fn_filter_outlying_phenotypes = function(list_y_pop, verbose=FALSE) {
Expand All @@ -124,6 +129,7 @@ fn_filter_outlying_phenotypes = function(list_y_pop, verbose=FALSE) {
# sigma = 1
y = list_y_pop$y
pop = list_y_pop$pop
trait_name = list_y_pop$trait_name
### Identify outliers with boxplot, i.e. values beyond -2.698 standard deviations (definition of R::boxplot whiskers)
b = boxplot(y, plot=FALSE)
idx = which(!(y %in% b$out))
Expand All @@ -135,7 +141,7 @@ fn_filter_outlying_phenotypes = function(list_y_pop, verbose=FALSE) {
print(paste0("n=", length(y[idx])))
txtplot::txtdensity(y[idx][!is.na(y[idx])])
}
return(list(y=y[idx], pop=pop[idx]))
return(list(y=y[idx], pop=pop[idx], trait_name=trait_name))
}

### Merge genotype and phenotype data by their entry names, i.e. rownames for G and names for y
Expand Down Expand Up @@ -272,6 +278,7 @@ tests_load = function() {
yex = c(0.32, 0.67, 0.93); names(yex) = paste0("Entry-", 1:3)
expect_equal(y, yex)
expect_equal(list_y_pop$pop, c("pop-A", "pop-A", "pop-A"))
expect_equal(list_y_pop$trait_name, "yield")
}
)

Expand Down
Loading

0 comments on commit 72ceea4

Please sign in to comment.