-
Notifications
You must be signed in to change notification settings - Fork 88
/
README.Rmd
226 lines (173 loc) · 7.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
output: github_document
---
```{r setup, echo = FALSE}
library(DataExplorer)
library(knitr)
library(ggplot2)
library(gridExtra)
set.seed(1)
knitr::opts_chunk$set(
eval = FALSE,
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# DataExplorer <img src="man/figures/logo.png" align="right" width="130" height="150"/>
[![CRAN Version](http://www.r-pkg.org/badges/version/DataExplorer)](https://cran.r-project.org/package=DataExplorer)
[![Downloads](http://cranlogs.r-pkg.org/badges/DataExplorer)](https://cran.r-project.org/package=DataExplorer)
[![Total Downloads](http://cranlogs.r-pkg.org/badges/grand-total/DataExplorer)](https://cran.r-project.org/package=DataExplorer)
[![GitHub Stars](https://img.shields.io/github/stars/boxuancui/DataExplorer.svg?style=social)](https://github.com/boxuancui/DataExplorer)
[![R-CMD-check](https://github.com/boxuancui/DataExplorer/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/boxuancui/DataExplorer/actions/workflows/check-standard.yaml)
[![codecov](https://codecov.io/gh/boxuancui/DataExplorer/graph/badge.svg?token=w8eMGjF8Jw)](https://app.codecov.io/gh/boxuancui/DataExplorer)
[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/2053/badge)](https://bestpractices.coreinfrastructure.org/projects/2053)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](http://boxuancui.github.io/DataExplorer/code_of_conduct.html)
## Background
[Exploratory Data Analysis (EDA)](https://en.wikipedia.org/wiki/Exploratory_data_analysis) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This [R](https://cran.r-project.org/) package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.
## Installation
The package can be installed directly from CRAN.
```{r install-cran}
install.packages("DataExplorer")
```
However, the latest stable version (if any) could be found on [GitHub](https://github.com/boxuancui/DataExplorer), and installed using `devtools` package.
```{r install-github}
if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer")
```
If you would like to install the latest [development version](https://github.com/boxuancui/DataExplorer/tree/develop), you may install the develop branch.
```{r install-github-develop}
if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer", ref = "develop")
```
## Examples
The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes [here](https://boxuancui.github.io/DataExplorer/articles/dataexplorer-intro.html).
#### Report
To get a report for the [airquality](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html) dataset:
```{r create-report-airquality}
library(DataExplorer)
create_report(airquality)
```
To get a report for the [diamonds](https://ggplot2.tidyverse.org/reference/diamonds.html) dataset with response variable **price**:
```{r create-report-diamonds}
library(ggplot2)
create_report(diamonds, y = "price")
```
#### Visualization
Instead of running `create_report`, you may also run each function individually for your analysis, e.g.,
```{r introduce-template}
## View basic description for airquality data
introduce(airquality)
```
```{r introduce, echo=FALSE, eval=TRUE}
kable(t(introduce(airquality)), row.names = TRUE, col.names = "", format.args = list(big.mark = ","))
```
```{r plot-intro, eval=TRUE}
## Plot basic description for airquality data
plot_intro(airquality)
```
```{r plot-missing, eval=TRUE}
## View missing value distribution for airquality data
plot_missing(airquality)
```
```{r plot-bar-template}
## Left: frequency distribution of all discrete variables
plot_bar(diamonds)
## Right: `price` distribution of all discrete variables
plot_bar(diamonds, with = "price")
```
```{r plot-bar-prepare, include=FALSE, echo=FALSE, eval=TRUE}
plot_bar_a <- plot_bar(diamonds, nrow = 3L, ncol = 1L)
plot_bar_b <- plot_bar(diamonds, with = "price", nrow = 3L, ncol = 1L)
```
```{r plot-bar, echo=FALSE, eval=TRUE}
DataExplorer:::plotDataExplorer.grid(
plot_obj = list(plot_bar_a[[1]], plot_bar_b[[1]]),
page_layout = list("Page 1" = seq.int(2L)),
nrow = 1L,
ncol = 2L,
ggtheme = theme_gray(),
theme_config = list(),
title = NULL
)
```
```{r plot-bar-by, eval=TRUE}
## View frequency distribution by a discrete variable
plot_bar(diamonds, by = "cut")
```
```{r plot-histogram-template}
## View histogram of all continuous variables
plot_histogram(diamonds)
```
```{r plot-histogram, echo=FALSE, eval=TRUE}
plot_histogram(diamonds, nrow = 3L, ncol = 3L)
```
```{r plot-density-template}
## View estimated density distribution of all continuous variables
plot_density(diamonds)
```
```{r plot-density, echo=FALSE, eval=TRUE}
plot_density(diamonds, nrow = 3L, ncol = 3L)
```
```{r plot-qq-template}
## View quantile-quantile plot of all continuous variables
plot_qq(diamonds)
```
```{r plot-qq, echo=FALSE, eval=TRUE}
plot_qq(diamonds, sampled_rows = 500L, geom_qq_args = list("size" = 0.5))
```
```{r plot-qq-cut-template}
## View quantile-quantile plot of all continuous variables by feature `cut`
plot_qq(diamonds, by = "cut")
```
```{r plot-qq-cut, echo=FALSE, eval=TRUE}
plot_qq(diamonds, by = "cut", sampled_rows = 500L, geom_qq_args = list("size" = 0.5))
```
```{r plot_correlation, eval=TRUE}
## View overall correlation heatmap
plot_correlation(diamonds)
```
```{r plot_boxplot-template}
## View bivariate continuous distribution based on `cut`
plot_boxplot(diamonds, by = "cut")
```
```{r plot_boxplot, echo=FALSE, eval=TRUE}
plot_boxplot(diamonds, by = "cut", nrow = 3L, ncol = 3L)
```
```{r plot_scatterplot-template}
## Scatterplot `price` with all other continuous features
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)
```
```{r plot_scatterplot, echo=FALSE, eval=TRUE}
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L, geom_point_args = list("size" = 0.5))
```
```{r plot_prcomp-template}
## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)
```
```{r plot_prcomp, echo=FALSE, eval=TRUE}
plot_prcomp(diamonds, maxcat = 5L, nrow = 2L, ncol = 2L)
```
#### Feature Engineering
To make quick updates to your data:
```{r, eval=FALSE}
## Group bottom 20% `clarity` by frequency
group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)
## Group bottom 20% `clarity` by `price`
group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)
## Dummify diamonds dataset
dummify(diamonds)
dummify(diamonds, select = "cut")
## Set values for missing observations
df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))
df[sample.int(260, 50), ] <- NA
set_missing(df, list(0L, "unknown"))
## Update columns
update_columns(airquality, c("Month", "Day"), as.factor)
update_columns(airquality, 1L, function(x) x^2)
## Drop columns
drop_columns(diamonds, 8:10)
drop_columns(diamonds, "clarity")
```
## Articles
See [article wiki page](https://github.com/boxuancui/DataExplorer/wiki/Articles).