-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] right_join() function does not produce the expected outcome #14907
Comments
Hi @abduazizR , thanks for reporting this! Just to check, which version of Arrow are you using? I understand that your data may be confidential, but is it possible to create a reproducible example using in-built datasets instead? |
Hi Nic, Thanks for responding. I am using the latest version of Arrow 10.0.1 In the example I posted, library(tidyverse)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(duckdb)
#> Loading required package: DBI
data1 <- tibble(
id = 1:100,
x = rnorm(100)
)
data2 <- tibble(
id = 52:69,
z = rbinom(18, size = 18, prob = 0.5)
)
data2_arrow <- data2 |> as_arrow_table()
# The weired right_join output
data2_arrow |>
right_join(data1) |> collect() |>
# The ids from data 1 are missing
tail(10)
#> # A tibble: 10 × 3
#> id z x
#> <int> <int> <dbl>
#> 1 NA NA -0.369
#> 2 NA NA 0.470
#> 3 NA NA -0.372
#> 4 NA NA 0.505
#> 5 NA NA 1.47
#> 6 NA NA -0.858
#> 7 NA NA -0.330
#> 8 NA NA 2.02
#> 9 NA NA 1.53
#> 10 NA NA -0.399
# The expected right_join output
data2 |>
right_join(data1) |>
tail(10)
#> Joining, by = "id"
#> # A tibble: 10 × 3
#> id z x
#> <int> <int> <dbl>
#> 1 91 NA -0.369
#> 2 92 NA 0.470
#> 3 93 NA -0.372
#> 4 94 NA 0.505
#> 5 95 NA 1.47
#> 6 96 NA -0.858
#> 7 97 NA -0.330
#> 8 98 NA 2.02
#> 9 99 NA 1.53
#> 10 100 NA -0.399 Created on 2022-12-11 with reprex v2.0.2 With my data, Thank you. |
Thanks for the reprex there, this looks similar to an issue we had back in ARROW-15838 though there has been work done to fix it - I'm suspecting we might need to do something extra to make this work for I'll take a look into what's going on in the example in the reprex, which hopefully should also fix the original problem you had, though it's not guaranteed - if you do find a reprex that captures that, feel free to add that here. I can confirm that I can reproduce the error from the reprex in the latest dev version of Arrow. |
Thanks for the response. Can I rely on |
@abduazizR It looks like that issue is restricted to library(arrow)
library(dplyr)
tbl <- tibble::tibble(x = 1:10, y = 1:10)
tbl2 <- tibble::tibble(y = 3:5, z = c("a", "b", "c"))
# right join using dplyr
tbl2 %>%
right_join(tbl)
#> Joining, by = "y"
#> # A tibble: 10 × 3
#> y z x
#> <int> <chr> <int>
#> 1 3 a 3
#> 2 4 b 4
#> 3 5 c 5
#> 4 1 <NA> 1
#> 5 2 <NA> 2
#> 6 6 <NA> 6
#> 7 7 <NA> 7
#> 8 8 <NA> 8
#> 9 9 <NA> 9
#> 10 10 <NA> 10
# right join using Arrow
tbl2 %>%
arrow_table() %>%
right_join(tbl) %>%
collect()
#> # A tibble: 10 × 3
#> y z x
#> <int> <chr> <int>
#> 1 3 a 3
#> 2 4 b 4
#> 3 5 c 5
#> 4 NA <NA> 1
#> 5 NA <NA> 2
#> 6 NA <NA> 6
#> 7 NA <NA> 7
#> 8 NA <NA> 8
#> 9 NA <NA> 9
#> 10 NA <NA> 10
# left join using dplyr
tbl %>%
left_join(tbl2)
#> Joining, by = "y"
#> # A tibble: 10 × 3
#> x y z
#> <int> <int> <chr>
#> 1 1 1 <NA>
#> 2 2 2 <NA>
#> 3 3 3 a
#> 4 4 4 b
#> 5 5 5 c
#> 6 6 6 <NA>
#> 7 7 7 <NA>
#> 8 8 8 <NA>
#> 9 9 9 <NA>
#> 10 10 10 <NA>
# left join using Arrow
tbl %>%
arrow_table() %>%
left_join(tbl2) %>%
collect()
#> # A tibble: 10 × 3
#> x y z
#> <int> <int> <chr>
#> 1 3 3 a
#> 2 4 4 b
#> 3 5 5 c
#> 4 1 1 <NA>
#> 5 2 2 <NA>
#> 6 6 6 <NA>
#> 7 7 7 <NA>
#> 8 8 8 <NA>
#> 9 9 9 <NA>
#> 10 10 10 <NA> |
…come (#15077) * Closes: #14907 Authored-by: Nic Crane <[email protected]> Signed-off-by: Will Jones <[email protected]>
…ed outcome (apache#15077) * Closes: apache#14907 Authored-by: Nic Crane <[email protected]> Signed-off-by: Will Jones <[email protected]>
…ed outcome (apache#15077) * Closes: apache#14907 Authored-by: Nic Crane <[email protected]> Signed-off-by: Will Jones <[email protected]>
Hello! I was wondering if this has been resolved as I'm still coming up against behaviour that differs from What appears to me to be the difference is that while Am I missing sth? library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
tbl1 <- tibble::tibble(
a = 1:3,
b = c("a", "b", NA),
d = c(letters[4:6])
)
tbl2 <- tibble::tibble(
b = c("b", NA),
c = c("a should be 2", "a should be 3")
)
# Left join tibbles, NAs matched
left_join(tbl2, tbl1)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 3 f
# Left join arrow table & tibble, NAs NOT matched
left_join(as_arrow_table(tbl2), tbl1) %>% collect()
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 NA <NA>
# Left join arrow table & arrow table, NAs NOT matched
left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 NA <NA>
# Same with right_join. It appears that the NAs are the problem as arrow doesn't seem to count them as a match
right_join(tbl1, tbl2)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#> a b d c
#> <int> <chr> <chr> <chr>
#> 1 2 b e a should be 2
#> 2 3 <NA> f a should be 3
right_join(as_arrow_table(tbl1), as_arrow_table(tbl2)) %>% collect()
#> # A tibble: 2 × 4
#> a b d c
#> <int> <chr> <chr> <chr>
#> 1 2 b e a should be 2
#> 2 NA <NA> <NA> a should be 3
right_join(as_arrow_table(tbl1), tbl2) %>% collect()
#> # A tibble: 2 × 4
#> a b d c
#> <int> <chr> <chr> <chr>
#> 1 NA <NA> <NA> a should be 3
#> 2 2 b e a should be 2 Created on 2023-09-26 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os macOS Ventura 13.5.2
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Athens
#> date 2023-09-26
#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> arrow * 12.0.1.1 2023-07-18 [1] CRAN (R 4.2.0)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.0)
#> dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.2.0)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.2.0)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [3] CRAN (R 4.2.0)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0)
#> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
#> sessioninfo 1.2.2 2021-12-06 [3] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0)
#> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.39 2023-04-20 [1] CRAN (R 4.2.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0)
#>
#> [1] /Users/Anna/Library/R/arm64/4.2/library
#> [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/site-library
#> [3] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Hi @annakrystalli, the problem described in the original issue has been fixed, but the problem you are describing in your reprex there is something else. I've opened up #37902 as a feature request, and have added some more explanation there - happy to discuss further on that ticket |
Great! Thank you. |
Describe the bug, including details regarding any error messages, version, and platform.
Hi,
I noticed something strange today when I was using arrow datasets. I cannot give a reproducible example but you can get the idea from the code below. I have
ccaei
as an arrow dataset. When I try to useright_join()
with an R tibble before usingcollect()
, it gives me wrong numbers (the number of distinctENROLID
is less than that present inoutpatients
). I get the correct number when I useright_join()
after usingcollect()
, although this is computationally inefficient. Could you help me with this?This gives a really weird number
This makes sense
Not sure where the mistake came from.
Component(s)
R
The text was updated successfully, but these errors were encountered: