Skip to content

Commit

Permalink
Fixed #90 #94 #98 #100
Browse files Browse the repository at this point in the history
  • Loading branch information
boxuancui committed Mar 10, 2019
1 parent 31faa23 commit 219a17a
Show file tree
Hide file tree
Showing 14 changed files with 151 additions and 119 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: DataExplorer
Title: Automate data exploration and treatment
Version: 0.7.1
Version: 0.7.1.9000
Authors@R: person("Boxuan", "Cui", email = "[email protected]", role = c("aut", "cre"))
Description: Automated data exploration process for analytical tasks and predictive modeling, so
that users could focus on understanding data and extracting insights. The package scans and
Expand Down Expand Up @@ -31,6 +31,6 @@ License: MIT + file LICENSE
LazyData: true
URL: http://boxuancui.github.io/DataExplorer/
BugReports: https://github.com/boxuancui/DataExplorer/issues
RoxygenNote: 6.1.0
RoxygenNote: 6.1.1
Encoding: UTF-8
VignetteBuilder: knitr
11 changes: 10 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
# DataExplorer 0.7.1.9000
## Bug Fixes
* [#90](https://github.com/boxuancui/DataExplorer/issues/90): Added first plot in `plot_prcomp` to output and `page_0`.
* [#94](https://github.com/boxuancui/DataExplorer/issues/94): Fixed typo.

## Enhancements
* [#98](https://github.com/boxuancui/DataExplorer/issues/98): Added band customization to `plot_missing`.
* [#100](https://github.com/boxuancui/DataExplorer/issues/100): Switched `geom_text` to `geom_label`.

# DataExplorer 0.7.1
## Enhancements
* [#86](https://github.com/boxuancui/DataExplorer/issues/86): Replaced `gridExtra::grid.arrange` with facets.
Expand Down Expand Up @@ -38,7 +47,7 @@

# DataExplorer 0.6.0
## New Features
* [#15](https://github.com/boxuancui/DataExplorer/issues/15): Added `plot_prcomp` to visualize principle component analysis.
* [#15](https://github.com/boxuancui/DataExplorer/issues/15): Added `plot_prcomp` to visualize principal component analysis.
* [#54](https://github.com/boxuancui/DataExplorer/issues/54): Extracted `dummify` from `plot_correlation` as a new function.
* [#59](https://github.com/boxuancui/DataExplorer/issues/59): Added `introduce` for basic metadata.

Expand Down
2 changes: 1 addition & 1 deletion R/plot_intro.r
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ plot_intro <- function(data, title = NULL, ggtheme = theme_gray(), theme_config
## Create ggplot object
output <- ggplot(intro2, aes(x = reorder(variable, -id), y = value, fill = dimension)) +
geom_bar(stat = "identity") +
geom_text(aes(label = percent(value))) +
geom_label(aes(label = percent(value))) +
scale_y_continuous(labels = percent) +
scale_fill_discrete("Dimension") +
coord_flip() +
Expand Down
51 changes: 31 additions & 20 deletions R/plot_missing.r
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
#'
#' This function returns and plots frequency of missing values for each feature.
#' @param data input data
#' @param group missing profile band taking a list of group name and group upper bounds. Default is \code{list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1)}.
#' @param title plot title
#' @param ggtheme complete ggplot2 themes. The default is \link{theme_gray}.
#' @param theme_config a list of configurations to be passed to \link{theme}.
Expand All @@ -12,25 +13,35 @@
#' @seealso \link{profile_missing}
#' @examples
#' plot_missing(airquality)
#' plot_missing(airquality, group = list("B1" = 0, "B2" = 0.06, "B3" = 1))

plot_missing <- function(data, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))) {
## Declare variable first to pass R CMD check
pct_missing <- NULL
## Profile missing values
missing_value <- profile_missing(data)
## Create ggplot object
output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
coord_flip() +
xlab("Features") + ylab("Missing Rows")
## Plot object
class(output) <- c("single", class(output))
plotDataExplorer(
plot_obj = output,
title = title,
ggtheme = ggtheme,
theme_config = theme_config
)
plot_missing <- function(data, group = list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1), title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))) {
## Declare variable first to pass R CMD check
pct_missing <- Band <- NULL
## Profile missing values
missing_value <- data.table(profile_missing(data))
## Sort group based on value
group <- group[sort.list(unlist(group))]
invisible(lapply(seq_along(group), function(i) {
if (i == 1) {
missing_value[pct_missing <= group[[i]], Band := names(group)[i]]
} else {
missing_value[pct_missing > group[[i-1]] & pct_missing <= group[[i]], Band := names(group)[i]]
}
}))
## Create ggplot object
output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "Band")) +
geom_bar(stat = "identity") +
geom_label(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
scale_fill_discrete("Band") +
coord_flip() +
xlab("Features") + ylab("Missing Rows")
## Plot object
class(output) <- c("single", class(output))
plotDataExplorer(
plot_obj = output,
title = title,
ggtheme = ggtheme,
theme_config = theme_config
)
}
163 changes: 87 additions & 76 deletions R/plot_prcomp.r
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#' Visualize principle component analysis
#' Visualize principal component analysis
#'
#' Visualize output of \link{prcomp}.
#' @param data input data
#' @param variance_cap maximum cumulative explained variance allowed for all principle components. Default is 80\%.
#' @param variance_cap maximum cumulative explained variance allowed for all principal components. Default is 80\%.
#' @param maxcat maximum categories allowed for each discrete feature. The default is 50.
#' @param prcomp_args a list of other arguments to \link{prcomp}
#' @param title plot title starting from page 2.
Expand All @@ -13,7 +13,7 @@
#' @param parallel enable parallel? Default is \code{FALSE}.
#' @return invisibly return the named list of ggplot objects
#' @keywords plot_prcomp
#' @details When cumulative explained variance exceeds \code{variance_cap}, remaining principle components will be ignored. Set \code{variance_cap} to 1 for all principle components.
#' @details When cumulative explained variance exceeds \code{variance_cap}, remaining principal components will be ignored. Set \code{variance_cap} to 1 for all principal components.
#' @details Discrete features containing more categories than \code{maxcat} specifies will be ignored.
#' @note Discrete features will be \link{dummify}-ed first before passing to \link{prcomp}.
#' @note Missing values may create issues in \link{prcomp}. Consider \link{na.omit} your input data first.
Expand All @@ -29,77 +29,88 @@
#' plot_prcomp(diamonds, maxcat = 7L)

plot_prcomp <- function(data, variance_cap = 0.8, maxcat = 50L, prcomp_args = list("scale." = TRUE), title = NULL, ggtheme = theme_gray(), theme_config = list(), nrow = 3L, ncol = 3L, parallel = FALSE) {
## Declare variable first to pass R CMD check
pc <- pct <- cum_pct <- Feature <- variable <- value <- NULL
## Check if input is data.table
if (!is.data.table(data)) data <- data.table(data)
## Dummify data
dt <- suppressWarnings(split_columns(dummify(data, maxcat = maxcat))$continuous)
prcomp_args_list <- list("x" = dt, "retx" = FALSE)
## Analyze principle components
pca <- tryCatch(
do.call("prcomp", c(prcomp_args_list, prcomp_args)),
error = function(e) {
message(e$message)
if (grepl("missing", e$message)) stop("\nConsider passing `na.omit(data)` as input.")
}
)

## Calcualte principle components standard deviation
var_exp <- pca$sdev ^ 2
pc_var <- data.table(
"pc" = paste0("PC", seq_along(pca$sdev)),
"var" = var_exp,
"pct" = var_exp / sum(var_exp),
"cum_pct" = cumsum(var_exp) / sum(var_exp)
)
min_cum_pct <- min(pc_var$cum_pct)
pc_var2 <- pc_var[cum_pct <= max(variance_cap, min_cum_pct)]
## Create explained variance plot
varexp_plot <- ggplot(pc_var2, aes(x = reorder(pc, pct), y = pct)) +
geom_bar(stat = "identity") +
geom_text(aes(label = percent(cum_pct)), color = "white", hjust = 1.1) +
scale_y_continuous(labels = percent) +
coord_flip() +
ggtitle(
label = "% Variance Explained By Principle Components",
subtitle = "Note: White texts indicate cumulative % explained variance"
) +
labs(x = "Principle Components", y = "% Variance Explained")
print(varexp_plot)
## Format rotation data
rotation_dt <- data.table(
"Feature" = rownames(pca$rotation),
data.table(pca$rotation)[, seq.int(nrow(pc_var2)), with = FALSE]
)
melt_rotation_dt <- melt.data.table(rotation_dt, id.vars = "Feature")
feature_names <- rotation_dt[["Feature"]]
## Calculate number of pages
layout <- .getPageLayout(nrow, ncol, ncol(rotation_dt) - 1L)
## Create list of ggplot objects
plot_list <- .lapply(
parallel = parallel,
X = layout,
FUN = function(x) {
ggplot(melt_rotation_dt[variable %in% paste0("PC", x)], aes(x = Feature, y = value)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Relative Importance")
}
)
## Plot objects
class(plot_list) <- c("multiple", class(plot_list))
plotDataExplorer(
plot_obj = plot_list,
page_layout = layout,
title = title,
ggtheme = ggtheme,
theme_config = theme_config,
facet_wrap_args = list(
"facet" = ~ variable,
"nrow" = nrow,
"ncol" = ncol,
"scales" = "free_x"
)
)
## Declare variable first to pass R CMD check
pc <- pct <- cum_pct <- Feature <- variable <- value <- NULL
## Check if input is data.table
if (!is.data.table(data)) data <- data.table(data)
## Dummify data
dt <- suppressWarnings(split_columns(dummify(data, maxcat = maxcat))$continuous)
prcomp_args_list <- list("x" = dt, "retx" = FALSE)
## Analyze principal components
pca <- tryCatch(
do.call("prcomp", c(prcomp_args_list, prcomp_args)),
error = function(e) {
message(e$message)
if (grepl("missing", e$message)) stop("\nConsider passing `na.omit(data)` as input.")
}
)

## Calcualte principal components standard deviation
var_exp <- pca$sdev ^ 2
pc_var <- data.table(
"pc" = paste0("PC", seq_along(pca$sdev)),
"var" = var_exp,
"pct" = var_exp / sum(var_exp),
"cum_pct" = cumsum(var_exp) / sum(var_exp)
)
min_cum_pct <- min(pc_var$cum_pct)
pc_var2 <- pc_var[cum_pct <= max(variance_cap, min_cum_pct)]
## Create explained variance plot
varexp_plot <- ggplot(pc_var2, aes(x = reorder(pc, pct), y = pct)) +
geom_bar(stat = "identity") +
geom_label(aes(label = percent(cum_pct))) +
scale_y_continuous(labels = percent) +
coord_flip() +
labs(x = "Principal Components", y = "% Variance Explained")
# ggtitle(
# label = "% Variance Explained By Principal Components",
# subtitle = "Note: Labels indicate cumulative % explained variance"
# ) +
#
# print(varexp_plot)
## Format rotation data
rotation_dt <- data.table(
"Feature" = rownames(pca$rotation),
data.table(pca$rotation)[, seq.int(nrow(pc_var2)), with = FALSE]
)
melt_rotation_dt <- melt.data.table(rotation_dt, id.vars = "Feature")
feature_names <- rotation_dt[["Feature"]]
## Calculate number of pages
layout <- .getPageLayout(nrow, ncol, ncol(rotation_dt) - 1L)
## Create list of ggplot objects
plot_list <- .lapply(
parallel = parallel,
X = layout,
FUN = function(x) {
ggplot(melt_rotation_dt[variable %in% paste0("PC", x)], aes(x = Feature, y = value)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Relative Importance")
}
)
## Plot objects
class(varexp_plot) <- c("single", class(varexp_plot))
class(plot_list) <- c("multiple", class(plot_list))
invisible(c(
list(
"page_0" = plotDataExplorer(
plot_obj = varexp_plot,
title = "% Variance Explained By Principal Components\n(Note: Labels indicate cumulative % explained variance)",
ggtheme = ggtheme,
theme_config = theme_config
)),
plotDataExplorer(
plot_obj = plot_list,
page_layout = layout,
title = title,
ggtheme = ggtheme,
theme_config = theme_config,
facet_wrap_args = list(
"facet" = ~ variable,
"nrow" = nrow,
"ncol" = ncol,
"scales" = "free_x"
)
)
))
}
6 changes: 1 addition & 5 deletions R/profile_missing.r
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,7 @@ profile_missing <- function(data) {
"num_missing" = sapply(data, function(x) {sum(is.na(x))})
)
missing_value[, feature := factor(feature, levels = feature[order(-rank(num_missing))])]
missing_value[, pct_missing := num_missing / nrow(data)]
missing_value[pct_missing < 0.05, group := "Good"]
missing_value[pct_missing >= 0.05 & pct_missing < 0.4, group := "OK"]
missing_value[pct_missing >= 0.4 & pct_missing < 0.8, group := "Bad"]
missing_value[pct_missing >= 0.8, group := "Remove"][]
missing_value[, pct_missing := num_missing / nrow(data)][]
## Set data class back to original
if (!is_data_table) class(missing_value) <- data_class

Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows
```

```{r plot_prcomp-template}
## Visualize principle component analysis
## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)
```

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ plot_boxplot(diamonds, by = "cut")
## Scatterplot `price` with all other continuous features
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)

## Visualize principle component analysis
## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)
```

Expand Down
6 changes: 3 additions & 3 deletions inst/rmd_template/report.rmd
Original file line number Diff line number Diff line change
Expand Up @@ -204,9 +204,9 @@ if ("plot_correlation" %in% names(report_config)) {
}
```

```{r principle_component_analysis}
```{r principal_component_analysis}
if ("plot_prcomp" %in% names(report_config)) {
cat("### Principle Component Analysis", fill = TRUE)
cat("### Principal Component Analysis", fill = TRUE)
if (intro[["complete_rows"]] > 0) {
if (!is.null(response)) {
dt <- na.omit(data[, -response, with = FALSE])
Expand All @@ -215,7 +215,7 @@ if ("plot_prcomp" %in% names(report_config)) {
do_call("plot_prcomp", na_omit = TRUE)
}
} else {
message("Insufficient complete rows! Skipping principle component analysis.")
message("Insufficient complete rows! Skipping principal component analysis.")
}
}
```
Expand Down
6 changes: 5 additions & 1 deletion man/plot_missing.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/plot_prcomp.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 219a17a

Please sign in to comment.