Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart handling of rowwise vs. vectorized operations. #1380

Closed
abalter opened this issue Oct 25, 2023 · 2 comments
Closed

Smart handling of rowwise vs. vectorized operations. #1380

abalter opened this issue Oct 25, 2023 · 2 comments
Labels
reprex needs a minimal reproducible example

Comments

@abalter
Copy link

abalter commented Oct 25, 2023

In the following reprex I show a very simple SQL query that I can't figure out how to replicate using dbplyr translation.

The operation is to calculate a random number based on the value in an existing column. Because there is no vectorized function for this in R, using dplyr I need to either call map or use rowwise. However, neither of these translates properly to SQL.

This is a general issue in that SQL is naturally rowwise, while R tries to be vectorized as much as possible for efficiency.

library(dbplyr)
library(tidyverse)

### Create in-memory DB
con = dbConnect(RSQLite::SQLite(), location=":memory:")
#> Error in dbConnect(RSQLite::SQLite(), location = ":memory:"): could not find function "dbConnect"

### Create local table
df = tibble(A = 1:10)

### Write table to database
dbWriteTable(con, "tmp", df, overwrite=T)
#> Error in dbWriteTable(con, "tmp", df, overwrite = T): could not find function "dbWriteTable"

### Read lazy table from DB
dbf = tbl(con, "tmp")
#> Error in eval(expr, envir, enclos): object 'con' not found


### Perform operation in the database:
### Create random integer based on value in column A
dbGetQuery(con, "select A, abs(random() % A) as B from tmp")
#> Error in dbGetQuery(con, "select A, abs(random() % A) as B from tmp"): could not find function "dbGetQuery"

### Try doing the same with local table
df %>% mutate(B = sample(1:A, 1))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `B = sample(1:A, 1)`.
#> Caused by warning in `1:A`:
#> ! numerical expression has 10 elements: only the first used
#> # A tibble: 10 × 2
#>        A     B
#>    <int> <int>
#>  1     1     1
#>  2     2     1
#>  3     3     1
#>  4     4     1
#>  5     5     1
#>  6     6     1
#>  7     7     1
#>  8     8     1
#>  9     9     1
#> 10    10     1
df %>% mutate(across(A, ~sample(1:.x, 1), .names="B"))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(A, ~sample(1:.x, 1), .names = "B")`.
#> Caused by warning in `1:A`:
#> ! numerical expression has 10 elements: only the first used
#> # A tibble: 10 × 2
#>        A     B
#>    <int> <int>
#>  1     1     1
#>  2     2     1
#>  3     3     1
#>  4     4     1
#>  5     5     1
#>  6     6     1
#>  7     7     1
#>  8     8     1
#>  9     9     1
#> 10    10     1
df %>% rowwise() %>% mutate(B = sample(1:A, 1))
#> # A tibble: 10 × 2
#> # Rowwise: 
#>        A     B
#>    <int> <int>
#>  1     1     1
#>  2     2     1
#>  3     3     3
#>  4     4     4
#>  5     5     5
#>  6     6     1
#>  7     7     4
#>  8     8     6
#>  9     9     2
#> 10    10     5
df %>% mutate(B = map_dbl(A, ~sample(1:.x, 1)))
#> # A tibble: 10 × 2
#>        A     B
#>    <int> <dbl>
#>  1     1     1
#>  2     2     2
#>  3     3     1
#>  4     4     4
#>  5     5     2
#>  6     6     1
#>  7     7     1
#>  8     8     5
#>  9     9     3
#> 10    10     4

### Try doing the same with lazy table
dbf %>% mutate(B = sample(1:A, 1)) %>% show_query()
#> Error in eval(expr, envir, enclos): object 'dbf' not found
dbf %>% mutate(across(A, ~sample(1:.x, 1), .names="B")) %>% show_query()
#> Error in eval(expr, envir, enclos): object 'dbf' not found
dbf %>% rowwise() %>% mutate(B = sample(1:A, 1)) %>% show_query()
#> Error in eval(expr, envir, enclos): object 'dbf' not found
dbf %>% mutate(B = map_dbl(A, ~sample(1:., 1))) %>% show_query()
#> Error in eval(expr, envir, enclos): object 'dbf' not found

Created on 2023-10-25 with reprex v2.0.2

Quite honestly, my preference would be for dplyr to, by default, map non-vectorized operations. I'm having trouble thinking of a situation where the meaning isn't implicitly clear. That doesn't mean there aren't.

Alternatively, find a way to deal with the situation where rowwise is used so that code can be used for both local and lazy tables.

Or maybe in dplyr have a way to explicitly specify whether a given operation is to be done row-wise or column-wise. Consider this:

library(tidyverse)
tb = tibble(A = list(c(1,2), c(3,4)))
tb %>% rowwise() %>% mutate(B = sum(A))
#> # A tibble: 2 × 2
#> # Rowwise: 
#>   A             B
#>   <list>    <dbl>
#> 1 <dbl [2]>     3
#> 2 <dbl [2]>     7
tb %>% mutate(B = map_int(A, ~sum(.)))
#> # A tibble: 2 × 2
#>   A             B
#>   <list>    <int>
#> 1 <dbl [2]>     3
#> 2 <dbl [2]>     7

Created on 2023-10-25 with reprex v2.0.2

I think the following should produce the outputs shown:

tb %>% mutate(B = rowwise(sum(A)))
#> # A tibble: 2 × 2
#> # Rowwise: 
#>   A             B
#>   <list>    <dbl>
#> 1 <dbl [2]>     3
#> 2 <dbl [2]>     7
tb %>% summarize(s = colwise(sum(A)))
#> # A tibble: 1 × 1
#>   s        
 #>  <list>   
#> 1 <dbl [2]>
tb %>% summarize(s = colwise(sum(A))) %>% pull(s)
#> [[1]]
#> [1] 4 6

There would be no question how those are to be evaluated or translated to SQL

@hadley
Copy link
Member

hadley commented Nov 2, 2023

I'm having a hard time following your reprex as there seems to be a lot of content that's not germane to the problem. But lets start with a simple problem: what does abs(random() % A) do?

@hadley hadley added the reprex needs a minimal reproducible example label Nov 2, 2023
@hadley
Copy link
Member

hadley commented Dec 21, 2023

I've closed this issue due to lack of requested reprex. If you still care about this bug, please open a new issue with a reprex.

@hadley hadley closed this as completed Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

2 participants