Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Support joining using NA as join key #37902

Open
thisisnic opened this issue Sep 27, 2023 · 0 comments
Open

[R] Support joining using NA as join key #37902

thisisnic opened this issue Sep 27, 2023 · 0 comments

Comments

@thisisnic
Copy link
Member

Describe the enhancement requested

As described in #14907 (comment). Reprex copied below.

library(dplyr)
library(arrow)

tbl1 <- tibble::tibble(
  a = 1:3,
  b = c("a", "b", NA),
  d = c(letters[4:6])
)

tbl2 <- tibble::tibble(
  b = c("b", NA),
  c = c("a should be 2", "a should be 3")
)


# Left join tibbles, NAs matched
left_join(tbl2, tbl1)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3     3 f


# Left join arrow table & tibble, NAs NOT matched
left_join(as_arrow_table(tbl2), tbl1) %>% collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3    NA <NA>


# Left join arrow table & arrow table, NAs NOT matched
left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3    NA <NA>

The problem here is the question of "can we join on NA values in arrow?"

Not right now! But, here are a couple of workarounds. The first uses extra code, and the second passes the data to duckdb and back.

library(arrow)
library(dplyr)

tbl1 <- tibble::tibble(
  a = 1:3,
  b = c("a", "b", NA),
  d = c(letters[4:6])
)

tbl2 <- tibble::tibble(
  b = c("b", NA),
  c = c("a should be 2", "a should be 3")
)

as_arrow_table(tbl2) |>
  # replace NAs in tbl2 with alternative value
  mutate(b = ifelse(is.na(b), "temp_value", b)) |>
  left_join(
    as_arrow_table(tbl1) |>
      # replace NAs in tbl1 with alternative value
      mutate(b = ifelse(is.na(b), "temp_value", b))
  ) |>
  # replace alternative value in results with NA
  mutate(b = ifelse(b == "temp_value", NA, b)) |>
  collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3     3 f

Or with DuckDB:

tbl1_duckdb <- as_arrow_table(tbl1) |>
  to_duckdb()

as_arrow_table(tbl2) |>
  to_duckdb() |>
  left_join(tbl1_duckdb, na_matches = "na") |>
  collect()
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3     3 f

Note that we have to pass in na_matches = "na" explicitly in the example there as the default value when working with duckdb/dbplyr is "never" - basically reflecting that in SQL we can't join on NULL (NA) values.

I also had a look to try to work out whether we can implement this in Arrow or not. dbplyr implements a function sql_expr_matches() which is what allows matching on NAs, and here's a snippet from it:

sql_expr_matches.DBIConnection <- function(con, x, y, ...) {
  glue_sql2(
    con,
    "CASE WHEN ({x} = {y}) OR ({x} IS NULL AND {y} IS NULL) ",
    "THEN 0 ",
    "ELSE 1 ",
    "END = 0"
  )
}

The C++ changes in #11579 may allow us to implement something like this in arrow - the unit tests in that PR certainly make it look feasible, though non-trivial.

Component(s)

R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant