-
Notifications
You must be signed in to change notification settings - Fork 20
/
paper.Rmd
223 lines (186 loc) · 9.12 KB
/
paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
title: 'statsExpressions: R Package for Tidy Dataframes and Expressions with Statistical Details'
tags:
- R
- parametric statistics
- nonparametric statistics
- robust statistics
- Bayesian statistics
- tidy
authors:
- name: Indrajeet Patil
orcid: 0000-0003-1995-6531
affiliation: 1
affiliations:
- name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany
date: "`r Sys.Date()`"
year: 2021
bibliography: paper.bib
output: rticles::joss_article
csl: apa.csl
journal: JOSS
link-citations: yes
header-includes:
- \usepackage{tabularx}
- \usepackage{booktabs}
- \usepackage{tikz}
---
```{r echo=FALSE}
# to pretty-print all columns in the output tibble
options(
tibble.width = Inf,
pillar.bold = TRUE,
pillar.neg = TRUE,
pillar.subtle_num = TRUE,
pillar.min_chars = Inf
)
knitr::opts_chunk$set(
collapse = TRUE,
dpi = 300,
warning = FALSE,
message = FALSE,
out.width = "100%",
comment = "#>"
)
```
# Summary
The `statsExpressions` package is designed to facilitate producing dataframes
with rich statistical details for the most common types of statistical
approaches and tests: parametric, nonparametric, robust, and Bayesian *t*-test,
one-way ANOVA, correlation analyses, contingency table analyses, and
meta-analyses. The functions are pipe-friendly and provide a consistent syntax
to work with tidy data. These dataframes additionally contain expressions with
statistical details, and can be used in graphing packages to display these
details.
# Statement of need
The aim of this package is to provide an approachable and intuitive syntax to
carry out common statistical tests across diverse statistical approaches.
# Comparison to Other Packages
Behind the scenes, `statsExpressions` uses `stats` package for parametric and
non-parametric [@base2021], `WRS2` package for robust [@Mair2020], and
`BayesFactor` package for Bayesian statistics [@Morey2020]. Additionally,
random-effects meta-analysis is carried out using `metafor` (parametric)
[@Viechtbauer2010], `metaplus` (robust) [@Beath2016], and `metaBMA` (Bayesian)
[@Heck2019] packages. So one can naturally ask why there needs to be another
package that wraps around these packages.
There is a lot of diversity among these packages in terms of their syntax and
expected input type that can make it difficult to switch from one statistical
approach to another. For example, some functions expect vectors as inputs, while
others expect dataframes. Depending on whether it is a repeated measures design
or not, different functions might expect data to be in wide or long format. Some
functions can internally omit missing values, while other functions error in
their presence. Furthermore, if someone wishes to utilize the objects returned
by these packages downstream in their workflow, this is not straightforward
either because even functions from the same package can return a list, a matrix,
an array, a dataframe, etc., depending on the function. So on and so forth.
The result of sustained exposure to such inconsistencies is that data
exploration can become a cognitively demanding task and discourage users to
explore different statistical approaches. In the long run, this might even
solidify into a habit of sticking to the defaults without giving much thought to
the alternative approaches (e.g., exploring if Bayesian hypothesis testing is to
be preferred over null hypothesis significance testing in context of the
problem).
This is where `statsExpressions` comes in: It can be thought of as a unified
portal through which most of the functionality in these underlying packages can
be accessed, with a little to no cognitive overhead. The package offers just six
primary functions that let users choose a statistical approach without changing
the syntax. The users are always expected to provide a dataframe in tidy format
[@Wickham2019] to functions, all functions work with missing data, and they
always return a dataframe that can be further utilized downstream in the
pipeline (for a visualization, e.g.).
Function | Parametric | Non-parametric | Robust | Bayesian
------------------ | ---- | ----- | ----| -----
`one_sample_test` | \checkmark | \checkmark | \checkmark | \checkmark
`two_sample_test` | \checkmark | \checkmark | \checkmark | \checkmark
`oneway_anova` | \checkmark | \checkmark | \checkmark | \checkmark
`corr_test` | \checkmark | \checkmark | \checkmark | \checkmark
`contingency_table` | \checkmark | \checkmark | - | \checkmark
`meta_analysis` | \checkmark | - | \checkmark | \checkmark
: A summary table listing the primary functions in the package and the
statistical approaches they support. For a more detailed description of the
tests and outputs from these functions, the readers are encouraged to read
vignettes on the package website: <https://indrajeetpatil.github.io/statsExpressions/articles/>.
Note that, unlike `broom` [@Robinson2021] or `parameters`
[@Lüdecke2020parameters], the goal of `statsExpressions` is not to convert
model objects into tidy dataframes, but to provide a consistent and easy syntax
to carry out statistical tests.
# Tidy Dataframes from Statistical Analysis
All functions return dataframes containing exhaustive details from inferential
statistics, and appropriate effect size/posterior estimates and their
confidence/credible intervals. The package internally relies on `easystats`
ecosystem of packages to achieve this [@Ben-Shachar2020; @Lüdecke2020parameters;
@Lüdecke2020performance; @Lüdecke2019; @Makowski2019; @Makowski2020].
To illustrate the simplicity of this syntax, let's say we want to compare
equality of a measure among two independent groups. We can use the
`two_sample_test` function here.
If we first run a parametric *t*-test:
```{r df_p}
# loading needed package
library(statsExpressions)
# Welch's t-test
mtcars %>% two_sample_test(am, wt, type = "parametric")
```
And then decide to run, instead, a robust *t*-test. The syntax remains the same:
```{r df_r}
# Yuen's t-test
mtcars %>% two_sample_test(am, wt, type = "robust")
```
These functions also play nicely with other popular data manipulation packages.
For example, we can use `dplyr` to repeat the same analysis across *all* levels
of a certain grouping variable:
```{r grouped_df}
# needed to do grouped analysis
suppressPackageStartupMessages(library(dplyr))
# running one-sample proportion test for all levels of `cyl`
mtcars %>%
group_by(cyl) %>%
group_modify(~ contingency_table(.x, am), .keep = TRUE) %>%
ungroup()
```
# Expressions for Plots
In addition to other details contained in the dataframe, there is also a column
titled `expression`, which contains expression with statistical details and can
be displayed in a plot (Figure 1). Displaying statistical results in the context
of a visualization is indeed a philosophy adopted by the `ggstatsplot` package
[@Patil2021], and `statsExpressions` functions as its statistical processing
backend.
```{r robanova, fig.cap="Example illustrating how `statsExpressions` functions can be used to display results from a statistical test in a plot."}
# loading needed packages
library(ggplot2)
library(palmerpenguins) # for data
library(ggridges) # for creating a ridgeplot
# creating a dataframe with results and expression
res <- oneway_anova(penguins, species, bill_length_mm, type = "robust")
# create a ridgeplot using `ggridges` package
ggplot(penguins, aes(x = bill_length_mm, y = species)) +
geom_density_ridges(
jittered_points = TRUE, quantile_lines = TRUE,
scale = 0.9, vline_size = 1, vline_color = "red",
position = position_raincloud(adjust_vlines = TRUE)
) + # use 'expression' column to display results in the subtitle
labs(
title = "A heteroscedastic one-way ANOVA for trimmed means",
subtitle = res$expression[[1]]
)
```
The details contained in these expressions (Figure 2) attempt to follow the gold
standard in statistical reporting for both Bayesian [@van2020jasp] and
non-Bayesian [@american1985publication] framework tests.
```{r expr_template, echo=FALSE, fig.cap="The templates used in `statsExpressions` to display statistical details in a plot."}
knitr::include_graphics("stats_reporting_format.png")
```
# Licensing and Availability
`statsExpressions` is licensed under the GNU General Public License (v3.0), with all
source code stored at [GitHub](https://github.com/IndrajeetPatil/statsExpressions/).
In the spirit of honest and open science, requests and suggestions for fixes,
feature updates, as well as general questions and concerns are encouraged via
direct interaction with contributors and developers by filing an
[issue](https://github.com/IndrajeetPatil/statsExpressions/issues) while respecting
[*Contribution Guidelines*](https://indrajeetpatil.github.io/statsExpressions/CONTRIBUTING.html).
# Acknowledgements
I would like to acknowledge the support of Mina Cikara, Fiery Cushman, and Iyad
Rahwan during the development of this project. `statsExpressions` relies heavily
on the [`easystats`](https://github.com/easystats/easystats) ecosystem, a
collaborative project created to facilitate the usage of `R` for statistical
analyses. Thus, I would like to thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.
# References