Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch creation of variable with mutate_over #4834

Closed
TimTeaFan opened this issue Feb 4, 2020 · 8 comments
Closed

Batch creation of variable with mutate_over #4834

TimTeaFan opened this issue Feb 4, 2020 · 8 comments
Assignees

Comments

@TimTeaFan
Copy link

TimTeaFan commented Feb 4, 2020

When doing data-wrangling, a common task is to create new variables on the basis of several other variables. In dplyr this can be done using mutate & case_when. Often (but not as common maybe) several variables have to be created in similar ways.

At the moment, it seems,mutate_at does not allow this kind batch creation of variables with case_when (or at least I am not aware of it - on SO similar questions have been unanswered or at least do not seem to offer a way to do it with mutate_at - see for example here and here).

It would be great, if dplyr had an official way of dealing with this kind of data-wrangling task, this could be a function called mutate_over, which mutates a tibble using a string vector. Or it could be just an argument within mutate_at which would allow this kind of operations.

There is a pipe-friendly workaround for this kind of batch creation of variables using purrr::reduce. However, I think that this way is not very obvious, plus being a pure data-wrangling task, this functionality would fit to dplyr. Although I understand that it would lead to inconsistencies when having a mutate_over function but not filter_over etc.

Below I provide three examples with increasing complexity.

library(tidyverse)

iris <- as_tibble(iris)

# example 1
# generate product of width and length for each string variable ("Sepal", "Petal") 
gen_vars1 <- function(df, x) {
  
  mutate(df,
         !! x := !! sym(paste0(x, ".Length")) * !! sym(paste0(x, ".Width")))
}

iris %>% 
  reduce(c("Sepal", "Petal"), gen_vars1, .init = .)
#> # A tibble: 150 x 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl> <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa   17.8 0.280
#>  2          4.9         3            1.4         0.2 setosa   14.7 0.280
#>  3          4.7         3.2          1.3         0.2 setosa   15.0 0.26 
#>  4          4.6         3.1          1.5         0.2 setosa   14.3 0.3  
#>  5          5           3.6          1.4         0.2 setosa   18   0.280
#>  6          5.4         3.9          1.7         0.4 setosa   21.1 0.68 
#>  7          4.6         3.4          1.4         0.3 setosa   15.6 0.42 
#>  8          5           3.4          1.5         0.2 setosa   17   0.3  
#>  9          4.4         2.9          1.4         0.2 setosa   12.8 0.280
#> 10          4.9         3.1          1.5         0.1 setosa   15.2 0.15 
#> # … with 140 more rows


#  example 2
# generate logical vector based on two conditions
gen_vars2 <- function(df, x) {
  
  mutate(df,
         "{x}.Ratio" := case_when( 
          !! sym(paste0(x, ".Length")) >= 4 & !! sym(paste0(x, ".Width")) <= 2.2 ~ T,
          T ~ F
           )
         )
}

iris %>% 
  reduce(c("Sepal", "Petal"), gen_vars2, .init = .)
#> # A tibble: 150 x 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <lgl>      
#>  1          5.1         3.5          1.4         0.2 setosa  FALSE      
#>  2          4.9         3            1.4         0.2 setosa  FALSE      
#>  3          4.7         3.2          1.3         0.2 setosa  FALSE      
#>  4          4.6         3.1          1.5         0.2 setosa  FALSE      
#>  5          5           3.6          1.4         0.2 setosa  FALSE      
#>  6          5.4         3.9          1.7         0.4 setosa  FALSE      
#>  7          4.6         3.4          1.4         0.3 setosa  FALSE      
#>  8          5           3.4          1.5         0.2 setosa  FALSE      
#>  9          4.4         2.9          1.4         0.2 setosa  FALSE      
#> 10          4.9         3.1          1.5         0.1 setosa  FALSE      
#> # … with 140 more rows, and 1 more variable: Petal.Ratio <lgl>



# example 3
# generate logical vector based on two conditions with different conditions per input
gen_vars3 <- function(df, x) {
  
  l = switch(x,
            "Sepal" = 4,
            "Petal" = 2) 
  
  w = switch(x,
            "Sepal" = 3,
            "Petal" = 1) 
  
  mutate(df,
         "{x}.Ratio" := case_when( 
           !! sym(paste0(x, ".Length")) >= l & !! sym(paste0(x, ".Width")) <= w ~ T,
           T ~ F
         )
  )
}

iris %>% 
  reduce(c("Sepal", "Petal"), gen_vars3, .init = .)
#> # A tibble: 150 x 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <lgl>      
#>  1          5.1         3.5          1.4         0.2 setosa  FALSE      
#>  2          4.9         3            1.4         0.2 setosa  TRUE       
#>  3          4.7         3.2          1.3         0.2 setosa  FALSE      
#>  4          4.6         3.1          1.5         0.2 setosa  FALSE      
#>  5          5           3.6          1.4         0.2 setosa  FALSE      
#>  6          5.4         3.9          1.7         0.4 setosa  FALSE      
#>  7          4.6         3.4          1.4         0.3 setosa  FALSE      
#>  8          5           3.4          1.5         0.2 setosa  FALSE      
#>  9          4.4         2.9          1.4         0.2 setosa  TRUE       
#> 10          4.9         3.1          1.5         0.1 setosa  FALSE      
#> # … with 140 more rows, and 1 more variable: Petal.Ratio <lgl>

Created on 2020-02-05 by the reprex package (v0.3.0)

Feature wishlist:

  1. It would be great to have this functionality in dplyr as a function (for example, mutate_over).
  2. Such a function could provide further regex helpers, which would help extract strings from a tibbles colnames.
  3. Not a direct dplyr request, but it would be great to make the!! sym less verbose, maybe by extending the functionality of the curly-curly operator or by introducing another similar operator.
@romainfrancois
Copy link
Member

I don't think this belongs to dplyr, but will let @hadley follow up. Another way to approach this would be to parse() expressions made up with glue() and then !!! into mutate() but this has its own catches.

library(dplyr, warn.conflicts = FALSE)

prodz <- function(prefixes) {
  xps <- purrr::map(prefixes, function(prefix) {
    rlang::parse_expr(glue::glue("{prefix}.Width * {prefix}.Length"))  
  })
  rlang::set_names(xps, prefixes)
}
prodz(c("Sepal", "Petal"))
#> $Sepal
#> Sepal.Width * Sepal.Length
#> 
#> $Petal
#> Petal.Width * Petal.Length

iris %>% 
  as_tibble() %>% 
  mutate(!!!prodz(c("Sepal", "Petal")))
#> # A tibble: 150 x 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl> <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa   17.8 0.280
#>  2          4.9         3            1.4         0.2 setosa   14.7 0.280
#>  3          4.7         3.2          1.3         0.2 setosa   15.0 0.26 
#>  4          4.6         3.1          1.5         0.2 setosa   14.3 0.3  
#>  5          5           3.6          1.4         0.2 setosa   18   0.280
#>  6          5.4         3.9          1.7         0.4 setosa   21.1 0.68 
#>  7          4.6         3.4          1.4         0.3 setosa   15.6 0.42 
#>  8          5           3.4          1.5         0.2 setosa   17   0.3  
#>  9          4.4         2.9          1.4         0.2 setosa   12.8 0.280
#> 10          4.9         3.1          1.5         0.1 setosa   15.2 0.15 
#> # … with 140 more rows

Created on 2020-02-10 by the reprex package (v0.3.0.9000)

@TimTeaFan
Copy link
Author

TimTeaFan commented Feb 10, 2020

Thank you for your follow-up. Let me elaborate a bit more on my motivation for this request/issue:

  1. I consider it to be a common data-wrangling operation …
  2. … for which there is no easy accessible tidyverse way (only more or less good work-arounds).
  3. I know many people who like the tidyverse, but who fall back to base R/ for loops with eval/parse for this kind of operation, because of 2.
  4. And while 3 is anecdotal evidence, I think it wouldn’t be that complicated to integrate this functionality in dplyr or some other tidyverse package.
  5. At the heart of the issue is that:
    1. mutate_at and summarise_at can only work on existing variables …
    2. and they do not allow to refer to the variable name (~ .x only refers to a variables values).
  6. I think 5 also holds true for the newer across(). Maybe a solution would be a similar function like across (e.g. over()) which does not select variables, but string parts contained in variables and allows the functions in fns to access their names (just as an idea).

@TimTeaFan
Copy link
Author

Here is a (very) quick workaround shot at a function called over() equivalent to across() just to give an example of what I had in mind in terms of look and feel. However, I have not yet thought about how to make it work with more complex cases and also with summarise and other verbs etc.

library(tidyverse)

over <- function(str, xps = NULL) {
  mask <- dplyr:::peek_mask()
  data <- mask$full_data()
  if (is.null(xps)) {
    data
  }
  else if (is.character(xps)) {
  prodz <- function(str) {
    xps <- purrr::map(str, function(.x) {
      rlang::parse_expr(glue::glue(xps))  
    })
   rlang::set_names(xps, str)
  }
  mutate(data, !!!prodz(str))
  }
  else {
    abort("`xps` must be NULL, a function, a formula, or a named list of functions/formulas")
  }
}

iris %>% 
  as_tibble() %>% 
  mutate(over(c("Sepal", "Petal"), "{.x}.Width * {.x}.Length"))
#> # A tibble: 150 x 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl> <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa   17.8 0.280
#>  2          4.9         3            1.4         0.2 setosa   14.7 0.280
#>  3          4.7         3.2          1.3         0.2 setosa   15.0 0.26 
#>  4          4.6         3.1          1.5         0.2 setosa   14.3 0.3  
#>  5          5           3.6          1.4         0.2 setosa   18   0.280
#>  6          5.4         3.9          1.7         0.4 setosa   21.1 0.68 
#>  7          4.6         3.4          1.4         0.3 setosa   15.6 0.42 
#>  8          5           3.4          1.5         0.2 setosa   17   0.3  
#>  9          4.4         2.9          1.4         0.2 setosa   12.8 0.280
#> 10          4.9         3.1          1.5         0.1 setosa   15.2 0.15 
#> # … with 140 more rows

Created on 2020-02-10 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Feb 11, 2020

Can you please try explaining what you want another way? I’m not currently understanding what you want.

@TimTeaFan
Copy link
Author

I’ll give it another try, although it might be repetitive in some parts:

A common data-wrangling operation is to create a new variable based on two or more other existing variables. In dplyr we often use mutate/case_when for that:

library(tidyverse)

iris_tbl <- 
  iris %>% 
  as_tibble()

# simple example of what we want to do repeatedly
iris_tbl %>%  
  mutate(Sepal.Ratio = case_when(
                          Sepal.Length / Sepal.Width > 2 ~ T,
                          T ~ F)
         )

Often, data-sets have a bunch of variables with similar patterns on which one wants to perform similar data-wrangling tasks. I often see people using the tidyverse either using repeated mutate/case_when calls or fall back to base R for this kind of task. A for loop is a standard way to approach this:

# standard for loop for creating many variables with similar patterns
for (var in c("Sepal", "Petal")) {
  
  iris_tbl[[paste0(var,".Ratio")]] <- ifelse(iris_tbl[[paste0(var,".Length")]] / iris_tbl[[paste0(var,".Width")]] > 2, T, F)
  
}

There are pipe-friendly tidyverse workarounds to tackle this kind of data-wrangling task, but they are less known. I found purrr::reduce a good helper function, romainfrancois proposed another possible solution above.

# tidyverse workaround: custom function + reduce

gen_vars <- function(df, x) {
  
  mutate(df,
         "{x}.Ratio" := case_when( 
           !! sym(paste0(x, ".Length")) / !! sym(paste0(x, ".Width")) > 2 ~ T,
           T ~ F
         )
  )
}

my_vars <- c("Petal", "Sepal")

iris_tbl %>% 
  reduce(my_vars, gen_vars, .init = .)  

For more complex cases these workarounds usually take three steps:

  1. create a custom function
  2. select string patterns
  3. apply the workaround

It would be great if dplyr had an official one-step method for this type of data-wrangling task. Originally, I suggested a scoped version of mutate called mutate_over:

# dplyr feature request:
# pseudo code (not working!) basic version
iris_tbl %>%  
  mutate_over(c("Sepal", "Petal"),
              "{x}.Ratio" := case_when( 
                  !! sym(paste0(x, ".Length")) / !! sym(paste0(x, ".Width")) > 2 ~ T,
                   T ~ F)
  )

Since the scoped dplyr verbs are about to retire, another helpful option would be a function similar to across (here called over()):

# or over as alternative to dplyr::across
# pseudo code - not working!
iris_tbl %>%  
  mutate(over(c("Sepal", "Petal"),
              "{x}.Ratio" := case_when( 
                !! sym(paste0(x, ".Length")) / !! sym(paste0(x, ".Width")) > 2 ~ T,
                T ~ F)
              )
  )

Ideally, such a function would further:

  1. provide regex helpers, which help extract string parts from a tibbles colnames
  2. use an advanced version of the curly-curly operator making the !! sym syntax less verbose.
# example with advanced functionality of the curly-curly operater and a regex helper
# pseudo code - not working!
iris_tbl %>%  
  mutate(over(suffix(".Length|.Width"), # <- fictional regex helper function 
              "{x}.Ratio" := case_when( 
                {{ glue(“{x}.Length“) }} / {{ glue(“{x}.Width“) }} > 2 ~ T,
                T ~ F)
              )
  )

@hadley
Copy link
Member

hadley commented Feb 12, 2020

That feels pretty special purpose to me — so I don’t think it belongs in dplyr, but it would be a great function for another package.

@hadley hadley closed this as completed Feb 12, 2020
@TimTeaFan
Copy link
Author

A short follow-up to this old issue:

I followed the advise and created an dplyr add-on package that helps with the problems described above (see here).

Since the scope of the package has widen somewhat, compared to the very special use case I presented above in this issue, I just wanted to report back and let the dplyr team know about the package and its major functions. Nevertheless, I still assume that only a fraction of dplyr user's would benefit from its functionality.

In a nutshell, the package builds a small family of functions around dplyr's across(). This function family helps to create columns by applying one or several functions to:

  • over() a vector
  • across2() two sets of columns
  • crossover() a set of columns and a vector

Some use cases would be:

creating many lags / leads of one variable

library(dplyr)
library(dplyover)

iris_tbl <- as_tibble(iris)

tibble(a = 1:25) %>%
  mutate(over(c(1:3),
              list(lag  = ~ lag(a, .x),
                   lead = ~ lead(a, .x)),
              .names = "a_{fn}{x}"))
#> # A tibble: 25 x 7
#>        a a_lag1 a_lead1 a_lag2 a_lead2 a_lag3 a_lead3
#>    <int>  <int>   <int>  <int>   <int>  <int>   <int>
#>  1     1     NA       2     NA       3     NA       4
#>  2     2      1       3     NA       4     NA       5
#>  3     3      2       4      1       5     NA       6
#>  4     4      3       5      2       6      1       7
#>  5     5      4       6      3       7      2       8
#>  6     6      5       7      4       8      3       9
#>  7     7      6       8      5       9      4      10
#>  8     8      7       9      6      10      5      11
#>  9     9      8      10      7      11      6      12
#> 10    10      9      11      8      12      7      13
#> # … with 15 more rows

creating many lags / leads of a set of variables

iris_tbl %>%
  transmute(
    crossover(starts_with("sepal"),
              1:5,
              list(lag = ~ lag(.x, .y)),
              .names = "{xcol}_{fn}{y}")) %>% 
  glimpse
#> Rows: 150
#> Columns: 10
#> $ Sepal.Length_lag1 <dbl> NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.…
#> $ Sepal.Length_lag2 <dbl> NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4…
#> $ Sepal.Length_lag3 <dbl> NA, NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0,…
#> $ Sepal.Length_lag4 <dbl> NA, NA, NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, …
#> $ Sepal.Length_lag5 <dbl> NA, NA, NA, NA, NA, 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4…
#> $ Sepal.Width_lag1  <dbl> NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.…
#> $ Sepal.Width_lag2  <dbl> NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9…
#> $ Sepal.Width_lag3  <dbl> NA, NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4,…
#> $ Sepal.Width_lag4  <dbl> NA, NA, NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, …
#> $ Sepal.Width_lag5  <dbl> NA, NA, NA, NA, NA, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3…

applying functions to variable pairs (two sets of variables)

iris_tbl %>%
  transmute(across2(ends_with("Length"),
                    ends_with("Width"),
                    .fns = list(product = ~ .x * .y,
                                sum = ~ .x + .y),
                    .names = "{pre}_{fn}",
                    .names_fn = tolower))
#> # A tibble: 150 x 4
#>    sepal_product sepal_sum petal_product petal_sum
#>            <dbl>     <dbl>         <dbl>     <dbl>
#>  1          17.8       8.6         0.280      1.60
#>  2          14.7       7.9         0.280      1.60
#>  3          15.0       7.9         0.26       1.5 
#>  4          14.3       7.7         0.3        1.7 
#>  5          18         8.6         0.280      1.60
#>  6          21.1       9.3         0.68       2.1 
#>  7          15.6       8           0.42       1.7 
#>  8          17         8.4         0.3        1.7 
#>  9          12.8       7.3         0.280      1.60
#> 10          15.2       8           0.15       1.6 
#> # … with 140 more rows

@deb193
Copy link

deb193 commented Feb 8, 2022

I want to suggest that wrangling longitudinal data could frequently use this functionality. Often there are variable sets with common wave-specific suffix, and new variables with the suffix are created from expressions of other variables with the common suffix. The accross2 function delivers this for 2 variables (Thanks @TimTeaFan). I have also used the method of across(var1-sufix) with a get() of some string function of var2-stem and cur_column()'s suffix (See: https://stackoverflow.com/questions/65543579/can-you-use-dplyr-across-to-iterate-across-pairs-of-columns)

I realize that I could pivot the data long by suffix, do the mutate once, and then pivot back to wide. However this seem inefficient if there are many columns not being pivoted, unless a select is done inside the mutate).

`df <- data.frame(cats1=1:5, dogs1=6:10, cats2=11:15, dogs2=16:20)
df

mutate(df, across(.cols=matches("cats"),
.fns= ~.x + get(str_c('dogs', str_extract(string = cur_column(),
pattern = "[0-9]$"
) # end str_extract
) # end str_c
), # end get
.names='total{str_extract(string=.col, pattern="[0-9]$")}')
) # end mutate
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants