-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17252: [R] Intermittent valgrind failure #13773
Conversation
|
@github-actions crossbow submit test-r-linux-valgrind |
Revision: 3647aa3 Submitted crossbow builds: ursacomputing/crossbow @ actions-de212b2c40
|
r/R/compute.R
Outdated
#' as_arrow_table(mtcars) %>% | ||
#' transmute(mpg, mpg_predicted = mtcars_predict_mpg(disp, cyl)) %>% | ||
#' collect() %>% | ||
#' head() | ||
#' Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF") | ||
#' | ||
register_scalar_function <- function(name, fun, in_type, out_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you put Sys.setenv(R_ARROW_COLLECT_WITH_UDF = "true")
inside of register_scalar_function()
? You're already opting-in to UDFs by calling this function, and there's no reason you'd want to call this but not have COLLECT_WITH_UDF working.
r/tests/testthat/test-compute.R
Outdated
dplyr::mutate(fun_result = times_32(value)) %>% | ||
dplyr::collect() %>% | ||
dplyr::arrange(row_num) | ||
withr::with_envvar(list(R_ARROW_COLLECT_WITH_UDF = "true"), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to remove this now
r/tests/testthat/test-compute.R
Outdated
@@ -209,6 +213,7 @@ test_that("user-defined functions work during multi-threaded execution", { | |||
skip_if_not(CanRunWithCapturedR()) | |||
skip_if_not_available("dataset") | |||
# Snappy has a UBSan issue: https://github.com/google/snappy/pull/148 | |||
# TODO(ARROW-17178): remove when user-defined function execution is stabilized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the skip is for snappy ubsan too, so both need to be resolved before we can unskip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See also ARROW-17283)
r/tests/testthat/test-compute.R
Outdated
dplyr::collect(), | ||
tibble::tibble(a = 1L, b = 32.0) | ||
) | ||
withr::with_envvar(list(R_ARROW_COLLECT_WITH_UDF = "true"), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can revert this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we? To avoid the leaks for any subsequent tests we need the variable to not be "true"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking you can revert this because you've already called register_scalar_function()
so the env var will be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! with_envvar()
was definitely not doing the right thing here (which is hopefully why the last valgrind check failed). I fixed it (I think) and requested another check!
Co-authored-by: Neal Richardson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the one hand, I can see why there might be some cases an extra collect is needed, to avoid valgrind errors. On the other hand:
- The valgrind errors you are receiving are not the errors I would expect (I would expect direct leaks of scan-related objects).
- I'm not entirely sure how this would be related to UDFs.
If this workaround fixes things for a little while then so be it. However, I'd really like to know exactly what is going on. You mentioned you had gotten better at reproducing. Could you share some kind of minimal(ish?) reproducer?
I agree that this is a temporary workaround...I don't think anything all that insidious is going on that affects normal usage, but I also don't want us to get an angry CRAN note about valgrind that attracts attention to anything else we're doing.
Slight progress there: I tried waiting for the thread pools to finish before unloading the package (#13779) and that seems to remove the errors as well, although it's a bit of a hack in its own way (getting the IO thread pool is not exported in the public headers). That seems consistent with the shutting down of the thread pools leaking something?
From the arrow/r directory,
The error isn't, I don't think, but was exposed by a change that we needed to make UDFs work (we need to evaluate the entire exec plan within one |
@github-actions crossbow submit test-r-linux-valgrind |
Revision: 3b71868 Submitted crossbow builds: ursacomputing/crossbow @ actions-511c82c07f
|
The last failure here is disappointing and leaves me stumped as to which change introduced the problem. I will spend some time tomorrow trying to reproduce this so that it's maybe possible to diagnose. |
@github-actions crossbow submit test-r-linux-valgrind |
Revision: 2efade7 Submitted crossbow builds: ursacomputing/crossbow @ actions-d22fe5d8c4
|
r/tests/testthat/test-compute.R
Outdated
@@ -143,6 +148,9 @@ test_that("arrow_scalar_function() with bad return type errors", { | |||
call_function("times_32_bad_return_type_scalar", Array$create(1L)), | |||
"Expected return Array or Scalar with type 'double'" | |||
) | |||
|
|||
# TODO(ARROW-17178) remove the need for this! | |||
Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF") | |||
}) | |||
|
|||
test_that("register_user_defined_function() can register multiple kernels", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_that("register_user_defined_function() can register multiple kernels", { | |
test_that("register_scalar_function() can register multiple kernels", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
r/tests/testthat/test-compute.R
Outdated
@@ -109,6 +112,8 @@ test_that("register_scalar_function() adds a compute function to the registry", | |||
dplyr::collect(), | |||
tibble::tibble(a = 1L, b = 32.0) | |||
) | |||
|
|||
Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NBD but should this more properly go inside of on.exit
above? like
on.exit({
unregister_binding("times_32", update_cache = TRUE)
# TODO(ARROW-17178) remove the need for this!
Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF")
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
… previous function name
@github-actions crossbow submit test-r-linux-valgrind |
Revision: 37e6e2e Submitted crossbow builds: ursacomputing/crossbow @ actions-d72df52d33
|
@paleolimbot is this good to merge now? |
Yes! Sorry, forgot to check the valgrind result. |
Benchmark runs are scheduled for baseline = 3148884 and contender = 7448322. 7448322 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
This PR fixes intermittent leaks that occur after one of the changes from ARROW-16444: when we drain the
RecordBatchReader
that is emitted from the plan too quickly, it seems, some parts of the plan can leak (I don't know why this happens).I tried removing various pieces of the
RunWithCapturedR()
changes (see #13746) but the only thing that removes the errors completely is draining the resultingRecordBatchReader
from R (i.e.,reader$read_table()
) instead of in C++ (i.e.,reader->ToTable()
). Unfortunately, for user-defined functions to work in a plan we need a C++ levelreader->ToTable()
. I took the approach here of disabling the C++ level read by default, requiring a user to opt in to the version ofcollect()
that works with a UDF. It's not ideal, but definitely safer (and more clearly marks the user-defined function behaviour as experimental).I was able to replicate the original leaks but they are few and far between...our tests just happen to create and destroy many, many exec plans and something about the CI environment seems to trigger these more reliably (although the errors don't always occur there, either). Most of the leaks are small but there were some instances where an entire
Table
leaked.