From f7e135891bf89901e643bce456cc9030bfd2edaa Mon Sep 17 00:00:00 2001 From: The Open Journals editorial robot <89919391+editorialbot@users.noreply.github.com> Date: Wed, 10 Aug 2022 13:45:13 +0100 Subject: [PATCH] Creating 10.21105.joss.04530.jats --- joss.04530/10.21105.joss.04530.jats | 649 ++++++++++++++++++++++++++++ 1 file changed, 649 insertions(+) create mode 100644 joss.04530/10.21105.joss.04530.jats diff --git a/joss.04530/10.21105.joss.04530.jats b/joss.04530/10.21105.joss.04530.jats new file mode 100644 index 0000000000..c27e7d92c0 --- /dev/null +++ b/joss.04530/10.21105.joss.04530.jats @@ -0,0 +1,649 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +4530 +10.21105/joss.04530 + +rTASSEL: An R interface to TASSEL for analyzing genomic +diversity + + + +0000-0001-6797-1221 + +Monier +Brandon + + + + +0000-0001-7602-0487 + +Casstevens +Terry M. + + + + +0000-0003-3825-8480 + +Bradbury +Peter J. + + + + + +0000-0002-3100-371X + +Buckler +Edward S. + + + + + + +Institute for Genomic Diversity, Cornell University, +Ithaca, NY 14853 + + + + +United States Department of Agriculture-Agricultural +Research Service, Robert W. Holley Center for Agriculture and Health, +Ithaca, NY 14853 + + + + +2 +6 +2022 + +7 +76 +4530 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +R +GWAS +genetics +genomic diversity +genomic prediction + + + + + + Summary +

The need for efficient tools and applications for analyzing genomic + diversity is essential for any genetics research or breeding program. + One commonly used tool, TASSEL (Trait + Analysis by aSSociation, + Evolution, and Linkage), provides many core + methods for genomic analyses. Despite its efficiency, TASSEL has + limited automation potential for reproducible research and to interact + with other analytical tools. Here we present an R package, + rTASSEL, that is a front-end to connect to a + variety of highly used TASSEL methods and analytical tools. The goal + of this package is to create a unified scripting workflow that + leverages the analytical prowess of TASSEL, in conjunction with R’s + data handling and visualization capabilities, without ever having the + user switch between these two environments.

+
+ + Statement of need +

As breakthroughs in genotyping technologies allow for increasing + available variant resources, methods and implementations to analyze + complex traits in a diverse array of organisms are needed. One such + resource is TASSEL (Trait Analysis by + aSSociation, Evolution, and + Linkage). This software suite contains functionality for + analyses in association studies, linkage disequilibrium (LD), kinship, + and dimensionality reduction (e.g., PCA and MDS) + (Bradbury + et al., 2007). While initially released in 2001, the fifth + version, TASSEL 5, has been optimized for handling large data sets and + has added newer approaches to association analyses for many thousands + of traits + (Shabalin, + 2012). Despite these improvements, interacting with TASSEL has + been limited to either a graphical user interface with limited + workflow reproducibility or a command-line interface with a higher + learning curve that can dissuade novice researchers and provide + unnecessary intermediate files in an analytics workflow + (Zhang + et al., 2009). To remediate this issue, we have created an R + package, rTASSEL. This package interfaces the + analytical power of TASSEL with R’s data formats and intuitive + function handling.

+
+ + Approach + + Implementation + +

Overview of the rTASSEL + workflow. Genotypic and phenotypic data (A) are used to + create an R S4 object (B). From this object, TASSEL + functionalities can be called to run various association, linkage + disequilibrium, and relatedness functions (C). Outputs from these + TASSEL analyses are returned to the R environment as data frame + objects (D), Manhattan plot visualizations (E), or interactive + visualizations for linkage disequilibrium analysis + (F).

+ +
+

rTASSEL combines TASSEL’s abilities to + store genotype data as half bytes, bitwise arithmetic for kinship + analyses, genotype filtration, extensive forms of linear modeling, + multithreading, and access to a range of native libraries while + providing access to R’s prominent scripting capabilities and + commonly used Bioconductor classes + (Gentleman + et al., 2004; + Lawrence + et al., 2013; + Morgan + et al., 2021). Since TASSEL is written in Java, a Java to R + interface is implemented via the rJava + package + (Urbanek, + 2021).

+

rTASSEL allows for the rapid import, + analysis, visualization, and export of various genomic data + structures. Diverse formats of genotypic information can be used as + inputs for rTASSEL. These include variant + call format (.vcf), HapMap + (.hmp.txt), and Flapjack + (.flpjk.*). Phenotype data can also be + supplied in multiple formats. These include TASSEL formatted data + sets or R data frame objects + (Figure 1A).

+

Once data is imported, the function + readGenotypePhenotype is used to construct an + S4 object, which is used for all downstream analyses + (Figure 1B, + Figure 1C). + This object contains slots that exclusively hold references to + objects held in the Java virtual machine, which can be called with + downstream functions. Prior to analysis, genotype objects can be + quickly imported and filtered in several ways to help in the + reduction of confounding errors. rTASSEL can + filter genotype objects by either variant site properties + (filterGenotypeTableSites) or by individuals + (filterGenotypeTableTaxa).

+
+ + Association functions +

One of TASSEL’s most dynamic functionalities is its capability to + perform various association modeling techniques. + rTASSEL allows several types of association + studies to be conducted using one primary function, + assocModelFitter, with different parameter + inputs. This allows for implementing both least-squares + fixed-effects general linear models (GLM) and mixed linear models + (MLM) via the + + Q+K + method + (Yu + et al., 2006). If no genotypic data is provided to the GLM + model, assocModelFitter can calculate best + linear unbiased estimates (BLUEs). Additionally, fast GLM approaches + are implemented in rTASSEL, which allow for + the rapid analysis of many phenotypic traits + (Shabalin, + 2012).

+

Linear models can be specified following the format used by R’s + lm function:

+

+ + yA1+A2++An

+

where + + y + is phenotype data, and + + An + is any covariate or factor data. This formula parameter and several + other parameters allow the user to run BLUE, GLM, or MLM modeling. + Once association analysis is completed, TASSEL table reports of + association statistics are generated as an R list which can then be + exported as flat files or converted to data frames + (Figure 1D). + rTASSEL can also visualize association + statistics with the function, manhattanPlot, + which utilizes the graphical capabilities of the package, + ggplot2 + (Wickham, + 2016) + (Figure 1E).

+
+ + Linkage disequilibrium +

rTASSEL can also generate linkage disequilibrium (LD) from + genotype data via the function linkageDiseq. + LD is estimated by the standardized disequilibrium coefficient, + + + D, + correlation between alleles at two loci + ( + + r2), + and subsequent + + p-values + via a two-sided Fisher’s exact test. TASSEL table reports for all + pairwise comparisons are generated as + data.frame objects, and heatmap + visualizations for each given metric are generated via TASSEL’s + legacy LD Java viewer or ggplot2 + (Figure 1F).

+
+ + Relatedness functions +

For users to run MLM methods, relatedness estimates need to be + calculated. rTASSEL can efficiently compute + this on large data sets by processing blocks of sites at a time + using bitwise operations. This can be accomplished using the + function kinshipMatrix, which will generate a + kinship matrix from genotype data. Several methods for calculating + kinship in TASSEL are implemented. By default, a “centered” identity + by state (IBS) approach is used + (Endelman + & Jannink, 2012). Additionally, normalized IBS + (Yang + et al., 2011), dominance-centered IBS + (Muñoz + et al., 2014), and dominance normalized IBS + (Zhu + et al., 2015) can be used. rTASSEL can + either generate a reference object for association analysis or an R + matrix object via R’s + as.matrix function for additional analyses. + In addition to kinship generation, principal components analysis and + multidimensional scaling can be used on genotype data using + rTASSEL methods, pca + and mds, respectively. Finally, phylogenetic + analysis can be performed on genotype data using the + createTree method which will generate + phylo objects commonly used by the + ape package + (Paradis + & Schliep, 2019). The createTree + method allows for two clustering methods: neighbor joining or UPGMA + (unweighted pair group method with arithmetic mean).

+
+ + Genomic prediction +

The function genomicPrediction can be used + for predicting phenotypes from genotypes. To do this, + genomicPrediction uses genomic best linear + unbiased predictors (gBLUPs). It proceeds by fitting a mixed model + that uses kinship to capture covariance between taxa. The mixed + model can calculate BLUPs for taxa that do not have phenotypes based + on the phenotypes of lines with relationship information.

+
+
+ + Additional resources +

More information about various functionalities and workflows can be + found on our + project + webpage. Source code can be found on our + GitHub + repository. An interactive Jupyter notebook session + detailing additional rTASSEL workflows can be + found on + Binder.

+
+ + Acknowledgements +

This project is supported by the USDA-ARS, the Bill and Melinda + Gates Foundation, and NSF IOS #1822330. We thank Sara J. Miller, + Guillaume Ramstein, and Joseph Gage for their insightful suggestions + on this manuscript and pipeline testing.

+
+ + + + + + + BradburyPeter J. + ZhangZhiwu + KroonDallas E. + CasstevensTerry M. + RamdossYogesh + BucklerEdward S. + + TASSEL: Software for association mapping of complex traits in diverse samples + Bioinformatics + 200710 + 20200717 + 23 + 19 + 1367-4803 + https://academic.oup.com/bioinformatics/article/23/19/2633/185151 + 10.1093/bioinformatics/btm308 + 2633 + 2635 + + + + + + ZhangZhiwu + BucklerEdward S. + CasstevensTerry M. + BradburyPeter J. + + Software engineering the mixed model for genome-wide association studies on large samples + Briefings in Bioinformatics + 200911 + 20200717 + 10 + 6 + 1467-5463 + https://academic.oup.com/bib/article/10/6/664/260106 + 10.1093/bib/bbp050 + 664 + 675 + + + + + + GentlemanRobert C. + CareyVincent J. + BatesDouglas M. + BolstadBen + DettlingMarcel + DudoitSandrine + EllisByron + GautierLaurent + GeYongchao + GentryJeff + HornikKurt + HothornTorsten + HuberWolfgang + IacusStefano + IrizarryRafael + LeischFriedrich + LiCheng + MaechlerMartin + RossiniAnthony J. + SawitzkiGunther + SmithColin + SmythGordon + TierneyLuke + YangJean YH + ZhangJianhua + + Bioconductor: Open software development for computational biology and bioinformatics + Genome Biology + 200409 + 20200717 + 5 + 10 + 1474-760X + https://doi.org/10.1186/gb-2004-5-10-r80 + 10.1186/gb-2004-5-10-r80 + R80 + + + + + + + LawrenceMichael + HuberWolfgang + PagèsHervé + AboyounPatrick + CarlsonMarc + GentlemanRobert + MorganMartin T. + CareyVincent J. + + Software for computing and annotating genomic ranges + PLOS Computational Biology + 201308 + 20200717 + 9 + 8 + 1553-7358 + https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003118 + 10.1371/journal.pcbi.1003118 + e1003118 + + + + + + + MorganMartin + ObenchainValerie + HesterJim + PagèsHervé + + SummarizedExperiment: SummarizedExperiment container + 2021 + https://bioconductor.org/packages/SummarizedExperiment + + + + + + UrbanekSimon + + rJava: Low-level R to Java interface + 2021 + https://CRAN.R-project.org/package=rJava + + + + + + YuJianming + PressoirGael + BriggsWilliam H. + Vroh BiIrie + YamasakiMasanori + DoebleyJohn F. + McMullenMichael D. + GautBrandon S. + NielsenDahlia M. + HollandJames B. + KresovichStephen + BucklerEdward S. + + A unified mixed-model method for association mapping that accounts for multiple levels of relatedness + Nature Genetics + 200602 + 20200717 + 38 + 2 + 1546-1718 + https://www.nature.com/articles/ng1702 + 10.1038/ng1702 + 203 + 208 + + + + + + ShabalinAndrey A. + + Matrix eQTL: Ultra fast eQTL analysis via large matrix operations + Bioinformatics + 201205 + 20200717 + 28 + 10 + 1367-4803 + https://academic.oup.com/bioinformatics/article/28/10/1353/213326 + 10.1093/bioinformatics/bts163 + 1353 + 1358 + + + + + + WickhamHadley + + ggplot2: Elegant graphics for data analysis + Springer-Verlag New York + 2016 + 978-3-319-24277-4 + https://ggplot2.tidyverse.org + + + + + + EndelmanJeffrey B. + JanninkJean-Luc + + Shrinkage estimation of the realized relationship matrix + G3: Genes, Genomes, Genetics + 201211 + 20200717 + 2 + 11 + 2160-1836 + https://www.g3journal.org/content/2/11/1405 + 10.1534/g3.112.004259 + 23173092 + 1405 + 1413 + + + + + + YangJian + LeeS. Hong + GoddardMichael E. + VisscherPeter M. + + GCTA: A tool for genome-wide complex trait analysis + The American Journal of Human Genetics + 201101 + 20200717 + 88 + 1 + 0002-9297 + http://www.sciencedirect.com/science/article/pii/S0002929710005987 + 10.1016/j.ajhg.2010.11.011 + 76 + 82 + + + + + + MuñozPatricio R. + ResendeMarcio F. R. + GezanSalvador A. + ResendeMarcos Deon Vilela + CamposGustavo de los + KirstMatias + HuberDudley + PeterGary F. + + Unraveling additive from nonadditive effects using genomic relationship matrices + Genetics + 201412 + 20200717 + 198 + 4 + 0016-6731 + https://www.genetics.org/content/198/4/1759 + 10.1534/genetics.114.171322 + 25324160 + 1759 + 1768 + + + + + + ZhuZhihong + BakshiAndrew + VinkhuyzenAnna A. E. + HemaniGibran + LeeSang Hong + NolteIlja M. + Vliet-OstaptchoukJana V. van + SniederHarold + EskoTonu + MilaniLili + MägiReedik + MetspaluAndres + HillWilliam G. + WeirBruce S. + GoddardMichael E. + VisscherPeter M. + YangJian + + Dominance genetic variation contributes little to the missing heritability for human complex traits + The American Journal of Human Genetics + 201503 + 20200717 + 96 + 3 + 0002-9297 + http://www.sciencedirect.com/science/article/pii/S0002929715000099 + 10.1016/j.ajhg.2015.01.001 + 377 + 385 + + + + + + ParadisE. + SchliepK. + + Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R + Bioinformatics + 2019 + 35 + 10.1093/bioinformatics/bty633 + 526 + 528 + + + + +