-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch creation of variable with mutate_over #4834
Comments
I don't think this belongs to dplyr, but will let @hadley follow up. Another way to approach this would be to library(dplyr, warn.conflicts = FALSE)
prodz <- function(prefixes) {
xps <- purrr::map(prefixes, function(prefix) {
rlang::parse_expr(glue::glue("{prefix}.Width * {prefix}.Length"))
})
rlang::set_names(xps, prefixes)
}
prodz(c("Sepal", "Petal"))
#> $Sepal
#> Sepal.Width * Sepal.Length
#>
#> $Petal
#> Petal.Width * Petal.Length
iris %>%
as_tibble() %>%
mutate(!!!prodz(c("Sepal", "Petal")))
#> # A tibble: 150 x 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 17.8 0.280
#> 2 4.9 3 1.4 0.2 setosa 14.7 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 15.0 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 14.3 0.3
#> 5 5 3.6 1.4 0.2 setosa 18 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 21.1 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 15.6 0.42
#> 8 5 3.4 1.5 0.2 setosa 17 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 12.8 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 15.2 0.15
#> # … with 140 more rows Created on 2020-02-10 by the reprex package (v0.3.0.9000) |
Thank you for your follow-up. Let me elaborate a bit more on my motivation for this request/issue:
|
Here is a (very) quick workaround shot at a function called library(tidyverse)
over <- function(str, xps = NULL) {
mask <- dplyr:::peek_mask()
data <- mask$full_data()
if (is.null(xps)) {
data
}
else if (is.character(xps)) {
prodz <- function(str) {
xps <- purrr::map(str, function(.x) {
rlang::parse_expr(glue::glue(xps))
})
rlang::set_names(xps, str)
}
mutate(data, !!!prodz(str))
}
else {
abort("`xps` must be NULL, a function, a formula, or a named list of functions/formulas")
}
}
iris %>%
as_tibble() %>%
mutate(over(c("Sepal", "Petal"), "{.x}.Width * {.x}.Length"))
#> # A tibble: 150 x 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 17.8 0.280
#> 2 4.9 3 1.4 0.2 setosa 14.7 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 15.0 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 14.3 0.3
#> 5 5 3.6 1.4 0.2 setosa 18 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 21.1 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 15.6 0.42
#> 8 5 3.4 1.5 0.2 setosa 17 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 12.8 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 15.2 0.15
#> # … with 140 more rows Created on 2020-02-10 by the reprex package (v0.3.0) |
Can you please try explaining what you want another way? I’m not currently understanding what you want. |
I’ll give it another try, although it might be repetitive in some parts: A common data-wrangling operation is to create a new variable based on two or more other existing variables. In library(tidyverse)
iris_tbl <-
iris %>%
as_tibble()
# simple example of what we want to do repeatedly
iris_tbl %>%
mutate(Sepal.Ratio = case_when(
Sepal.Length / Sepal.Width > 2 ~ T,
T ~ F)
) Often, data-sets have a bunch of variables with similar patterns on which one wants to perform similar data-wrangling tasks. I often see people using the tidyverse either using repeated # standard for loop for creating many variables with similar patterns
for (var in c("Sepal", "Petal")) {
iris_tbl[[paste0(var,".Ratio")]] <- ifelse(iris_tbl[[paste0(var,".Length")]] / iris_tbl[[paste0(var,".Width")]] > 2, T, F)
} There are pipe-friendly tidyverse workarounds to tackle this kind of data-wrangling task, but they are less known. I found # tidyverse workaround: custom function + reduce
gen_vars <- function(df, x) {
mutate(df,
"{x}.Ratio" := case_when(
!! sym(paste0(x, ".Length")) / !! sym(paste0(x, ".Width")) > 2 ~ T,
T ~ F
)
)
}
my_vars <- c("Petal", "Sepal")
iris_tbl %>%
reduce(my_vars, gen_vars, .init = .) For more complex cases these workarounds usually take three steps:
It would be great if # dplyr feature request:
# pseudo code (not working!) basic version
iris_tbl %>%
mutate_over(c("Sepal", "Petal"),
"{x}.Ratio" := case_when(
!! sym(paste0(x, ".Length")) / !! sym(paste0(x, ".Width")) > 2 ~ T,
T ~ F)
) Since the scoped # or over as alternative to dplyr::across
# pseudo code - not working!
iris_tbl %>%
mutate(over(c("Sepal", "Petal"),
"{x}.Ratio" := case_when(
!! sym(paste0(x, ".Length")) / !! sym(paste0(x, ".Width")) > 2 ~ T,
T ~ F)
)
) Ideally, such a function would further:
# example with advanced functionality of the curly-curly operater and a regex helper
# pseudo code - not working!
iris_tbl %>%
mutate(over(suffix(".Length|.Width"), # <- fictional regex helper function
"{x}.Ratio" := case_when(
{{ glue(“{x}.Length“) }} / {{ glue(“{x}.Width“) }} > 2 ~ T,
T ~ F)
)
) |
That feels pretty special purpose to me — so I don’t think it belongs in dplyr, but it would be a great function for another package. |
A short follow-up to this old issue: I followed the advise and created an dplyr add-on package that helps with the problems described above (see here). Since the scope of the package has widen somewhat, compared to the very special use case I presented above in this issue, I just wanted to report back and let the dplyr team know about the package and its major functions. Nevertheless, I still assume that only a fraction of dplyr user's would benefit from its functionality. In a nutshell, the package builds a small family of functions around dplyr's
Some use cases would be: creating many lags / leads of one variablelibrary(dplyr)
library(dplyover)
iris_tbl <- as_tibble(iris)
tibble(a = 1:25) %>%
mutate(over(c(1:3),
list(lag = ~ lag(a, .x),
lead = ~ lead(a, .x)),
.names = "a_{fn}{x}"))
#> # A tibble: 25 x 7
#> a a_lag1 a_lead1 a_lag2 a_lead2 a_lag3 a_lead3
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 NA 2 NA 3 NA 4
#> 2 2 1 3 NA 4 NA 5
#> 3 3 2 4 1 5 NA 6
#> 4 4 3 5 2 6 1 7
#> 5 5 4 6 3 7 2 8
#> 6 6 5 7 4 8 3 9
#> 7 7 6 8 5 9 4 10
#> 8 8 7 9 6 10 5 11
#> 9 9 8 10 7 11 6 12
#> 10 10 9 11 8 12 7 13
#> # … with 15 more rows creating many lags / leads of a set of variablesiris_tbl %>%
transmute(
crossover(starts_with("sepal"),
1:5,
list(lag = ~ lag(.x, .y)),
.names = "{xcol}_{fn}{y}")) %>%
glimpse
#> Rows: 150
#> Columns: 10
#> $ Sepal.Length_lag1 <dbl> NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.…
#> $ Sepal.Length_lag2 <dbl> NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4…
#> $ Sepal.Length_lag3 <dbl> NA, NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0,…
#> $ Sepal.Length_lag4 <dbl> NA, NA, NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, …
#> $ Sepal.Length_lag5 <dbl> NA, NA, NA, NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4…
#> $ Sepal.Width_lag1 <dbl> NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.…
#> $ Sepal.Width_lag2 <dbl> NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9…
#> $ Sepal.Width_lag3 <dbl> NA, NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4,…
#> $ Sepal.Width_lag4 <dbl> NA, NA, NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, …
#> $ Sepal.Width_lag5 <dbl> NA, NA, NA, NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3… applying functions to variable pairs (two sets of variables)iris_tbl %>%
transmute(across2(ends_with("Length"),
ends_with("Width"),
.fns = list(product = ~ .x * .y,
sum = ~ .x + .y),
.names = "{pre}_{fn}",
.names_fn = tolower))
#> # A tibble: 150 x 4
#> sepal_product sepal_sum petal_product petal_sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 17.8 8.6 0.280 1.60
#> 2 14.7 7.9 0.280 1.60
#> 3 15.0 7.9 0.26 1.5
#> 4 14.3 7.7 0.3 1.7
#> 5 18 8.6 0.280 1.60
#> 6 21.1 9.3 0.68 2.1
#> 7 15.6 8 0.42 1.7
#> 8 17 8.4 0.3 1.7
#> 9 12.8 7.3 0.280 1.60
#> 10 15.2 8 0.15 1.6
#> # … with 140 more rows |
I want to suggest that wrangling longitudinal data could frequently use this functionality. Often there are variable sets with common wave-specific suffix, and new variables with the suffix are created from expressions of other variables with the common suffix. The accross2 function delivers this for 2 variables (Thanks @TimTeaFan). I have also used the method of across(var1-sufix) with a get() of some string function of var2-stem and cur_column()'s suffix (See: https://stackoverflow.com/questions/65543579/can-you-use-dplyr-across-to-iterate-across-pairs-of-columns) I realize that I could pivot the data long by suffix, do the mutate once, and then pivot back to wide. However this seem inefficient if there are many columns not being pivoted, unless a select is done inside the mutate). `df <- data.frame(cats1=1:5, dogs1=6:10, cats2=11:15, dogs2=16:20) mutate(df, across(.cols=matches("cats"), |
When doing data-wrangling, a common task is to create new variables on the basis of several other variables. In
dplyr
this can be done usingmutate
&case_when
. Often (but not as common maybe) several variables have to be created in similar ways.At the moment, it seems,
mutate_at
does not allow this kind batch creation of variables withcase_when
(or at least I am not aware of it - on SO similar questions have been unanswered or at least do not seem to offer a way to do it withmutate_at
- see for example here and here).It would be great, if
dplyr
had an official way of dealing with this kind of data-wrangling task, this could be a function calledmutate_over
, which mutates a tibble using a string vector. Or it could be just an argument withinmutate_at
which would allow this kind of operations.There is a pipe-friendly workaround for this kind of batch creation of variables using
purrr::reduce
. However, I think that this way is not very obvious, plus being a pure data-wrangling task, this functionality would fit todplyr
. Although I understand that it would lead to inconsistencies when having amutate_over
function but notfilter_over
etc.Below I provide three examples with increasing complexity.
Created on 2020-02-05 by the reprex package (v0.3.0)
Feature wishlist:
mutate_over
).!! sym
less verbose, maybe by extending the functionality of the curly-curly operator or by introducing another similar operator.The text was updated successfully, but these errors were encountered: