You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
library(dplyr)
library(arrow)
tbl1<-tibble::tibble(
a=1:3,
b= c("a", "b", NA),
d= c(letters[4:6])
)
tbl2<-tibble::tibble(
b= c("b", NA),
c= c("a should be 2", "a should be 3")
)
# Left join tibbles, NAs matched
left_join(tbl2, tbl1)
#> Joining with `by = join_by(b)`#> # A tibble: 2 × 4#> b c a d #> <chr> <chr> <int> <chr>#> 1 b a should be 2 2 e #> 2 <NA> a should be 3 3 f# Left join arrow table & tibble, NAs NOT matched
left_join(as_arrow_table(tbl2), tbl1) %>% collect()
#> # A tibble: 2 × 4#> b c a d #> <chr> <chr> <int> <chr>#> 1 b a should be 2 2 e #> 2 <NA> a should be 3 NA <NA># Left join arrow table & arrow table, NAs NOT matched
left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
#> # A tibble: 2 × 4#> b c a d #> <chr> <chr> <int> <chr>#> 1 b a should be 2 2 e #> 2 <NA> a should be 3 NA <NA>
The problem here is the question of "can we join on NA values in arrow?"
Not right now! But, here are a couple of workarounds. The first uses extra code, and the second passes the data to duckdb and back.
library(arrow)
library(dplyr)
tbl1<-tibble::tibble(
a=1:3,
b= c("a", "b", NA),
d= c(letters[4:6])
)
tbl2<-tibble::tibble(
b= c("b", NA),
c= c("a should be 2", "a should be 3")
)
as_arrow_table(tbl2) |># replace NAs in tbl2 with alternative value
mutate(b= ifelse(is.na(b), "temp_value", b)) |>
left_join(
as_arrow_table(tbl1) |># replace NAs in tbl1 with alternative value
mutate(b= ifelse(is.na(b), "temp_value", b))
) |># replace alternative value in results with NA
mutate(b= ifelse(b=="temp_value", NA, b)) |>
collect()
#> # A tibble: 2 × 4#> b c a d #> <chr> <chr> <int> <chr>#> 1 b a should be 2 2 e #> 2 <NA> a should be 3 3 f
Or with DuckDB:
tbl1_duckdb <- as_arrow_table(tbl1) |>
to_duckdb()
as_arrow_table(tbl2) |>
to_duckdb() |>
left_join(tbl1_duckdb, na_matches = "na") |>
collect()
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 3 f
Note that we have to pass in na_matches = "na" explicitly in the example there as the default value when working with duckdb/dbplyr is "never" - basically reflecting that in SQL we can't join on NULL (NA) values.
I also had a look to try to work out whether we can implement this in Arrow or not. dbplyr implements a function sql_expr_matches() which is what allows matching on NAs, and here's a snippet from it:
sql_expr_matches.DBIConnection <- function(con, x, y, ...) {
glue_sql2(
con,
"CASE WHEN ({x} = {y}) OR ({x} IS NULL AND {y} IS NULL) ",
"THEN 0 ",
"ELSE 1 ",
"END = 0"
)
}
The C++ changes in #11579may allow us to implement something like this in arrow - the unit tests in that PR certainly make it look feasible, though non-trivial.
Component(s)
R
The text was updated successfully, but these errors were encountered:
Describe the enhancement requested
As described in #14907 (comment). Reprex copied below.
The problem here is the question of "can we join on
NA
values in arrow?"Not right now! But, here are a couple of workarounds. The first uses extra code, and the second passes the data to duckdb and back.
Or with DuckDB:
Note that we have to pass in
na_matches = "na"
explicitly in the example there as the default value when working with duckdb/dbplyr is "never" - basically reflecting that in SQL we can't join on NULL (NA) values.I also had a look to try to work out whether we can implement this in Arrow or not. dbplyr implements a function
sql_expr_matches()
which is what allows matching on NAs, and here's a snippet from it:The C++ changes in #11579 may allow us to implement something like this in arrow - the unit tests in that PR certainly make it look feasible, though non-trivial.
Component(s)
R
The text was updated successfully, but these errors were encountered: