Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17252: [R] Intermittent valgrind failure #13773

Merged
merged 8 commits into from
Aug 9, 2022

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Aug 2, 2022

This PR fixes intermittent leaks that occur after one of the changes from ARROW-16444: when we drain the RecordBatchReader that is emitted from the plan too quickly, it seems, some parts of the plan can leak (I don't know why this happens).

I tried removing various pieces of the RunWithCapturedR() changes (see #13746) but the only thing that removes the errors completely is draining the resulting RecordBatchReader from R (i.e., reader$read_table()) instead of in C++ (i.e., reader->ToTable()). Unfortunately, for user-defined functions to work in a plan we need a C++ level reader->ToTable(). I took the approach here of disabling the C++ level read by default, requiring a user to opt in to the version of collect() that works with a UDF. It's not ideal, but definitely safer (and more clearly marks the user-defined function behaviour as experimental).

I was able to replicate the original leaks but they are few and far between...our tests just happen to create and destroy many, many exec plans and something about the CI environment seems to trigger these more reliably (although the errors don't always occur there, either). Most of the leaks are small but there were some instances where an entire Table leaked.

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@paleolimbot
Copy link
Member Author

@github-actions crossbow submit test-r-linux-valgrind

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

Revision: 3647aa3

Submitted crossbow builds: ursacomputing/crossbow @ actions-de212b2c40

Task Status
test-r-linux-valgrind Azure

@paleolimbot paleolimbot marked this pull request as ready for review August 2, 2022 11:43
r/R/compute.R Outdated
#' as_arrow_table(mtcars) %>%
#' transmute(mpg, mpg_predicted = mtcars_predict_mpg(disp, cyl)) %>%
#' collect() %>%
#' head()
#' Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF")
#'
register_scalar_function <- function(name, fun, in_type, out_type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you put Sys.setenv(R_ARROW_COLLECT_WITH_UDF = "true") inside of register_scalar_function()? You're already opting-in to UDFs by calling this function, and there's no reason you'd want to call this but not have COLLECT_WITH_UDF working.

dplyr::mutate(fun_result = times_32(value)) %>%
dplyr::collect() %>%
dplyr::arrange(row_num)
withr::with_envvar(list(R_ARROW_COLLECT_WITH_UDF = "true"), {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to remove this now

@@ -209,6 +213,7 @@ test_that("user-defined functions work during multi-threaded execution", {
skip_if_not(CanRunWithCapturedR())
skip_if_not_available("dataset")
# Snappy has a UBSan issue: https://github.com/google/snappy/pull/148
# TODO(ARROW-17178): remove when user-defined function execution is stabilized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the skip is for snappy ubsan too, so both need to be resolved before we can unskip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See also ARROW-17283)

dplyr::collect(),
tibble::tibble(a = 1L, b = 32.0)
)
withr::with_envvar(list(R_ARROW_COLLECT_WITH_UDF = "true"), {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can revert this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we? To avoid the leaks for any subsequent tests we need the variable to not be "true".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking you can revert this because you've already called register_scalar_function() so the env var will be set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! with_envvar() was definitely not doing the right thing here (which is hopefully why the last valgrind check failed). I fixed it (I think) and requested another check!

r/R/compute.R Outdated Show resolved Hide resolved
@paleolimbot
Copy link
Member Author

(See #13779 and #13780 for some other potential fixes, both have their checks pending)

paleolimbot and others added 2 commits August 2, 2022 14:40
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, I can see why there might be some cases an extra collect is needed, to avoid valgrind errors. On the other hand:

  • The valgrind errors you are receiving are not the errors I would expect (I would expect direct leaks of scan-related objects).
  • I'm not entirely sure how this would be related to UDFs.

If this workaround fixes things for a little while then so be it. However, I'd really like to know exactly what is going on. You mentioned you had gotten better at reproducing. Could you share some kind of minimal(ish?) reproducer?

@paleolimbot
Copy link
Member Author

If this workaround fixes things for a little while then so be it.

I agree that this is a temporary workaround...I don't think anything all that insidious is going on that affects normal usage, but I also don't want us to get an angry CRAN note about valgrind that attracts attention to anything else we're doing.

I'd really like to know exactly what is going on.

Slight progress there: I tried waiting for the thread pools to finish before unloading the package (#13779) and that seems to remove the errors as well, although it's a bit of a hack in its own way (getting the IO thread pool is not exported in the public headers). That seems consistent with the shutting down of the thread pools leaking something?

Could you share some kind of minimal(ish?) reproducer?

From the arrow/r directory, echo 'devtools::test()' | R --no-save -d "valgrind --tool=memcheck --leak-check=full". Not very minimal, but an improvement over the 5 hours it takes the crossbow job. You can run fewer test files using devtools::test(filter = "some_regex") but I was never able to get an error using a few obvious filters ("dplyr", for example). I wonder if the CI just runs threads really really slowly which is why leaks show up more frequently then.

I'm not entirely sure how this would be related to UDFs.

The error isn't, I don't think, but was exposed by a change that we needed to make UDFs work (we need to evaluate the entire exec plan within one RunWithCapturedR(), which meant calling reader->ToTable() instead of sending the RecordBatchReader to R, then converting it to a table there). Before this PR, all exec plan results were routed through a C++-level reader->ToTable() unless an R-level RecordBatchReader was explicitly requested; after this PR, all exec plans go through an R-level RecordBatchReader unless somebody actually registers a user-defined function.

@paleolimbot
Copy link
Member Author

@github-actions crossbow submit test-r-linux-valgrind

@github-actions
Copy link

github-actions bot commented Aug 4, 2022

Revision: 3b71868

Submitted crossbow builds: ursacomputing/crossbow @ actions-511c82c07f

Task Status
test-r-linux-valgrind Azure

@paleolimbot
Copy link
Member Author

The last failure here is disappointing and leaves me stumped as to which change introduced the problem. I will spend some time tomorrow trying to reproduce this so that it's maybe possible to diagnose.

@paleolimbot
Copy link
Member Author

@github-actions crossbow submit test-r-linux-valgrind

@github-actions
Copy link

github-actions bot commented Aug 7, 2022

Revision: 2efade7

Submitted crossbow builds: ursacomputing/crossbow @ actions-d22fe5d8c4

Task Status
test-r-linux-valgrind Azure

@@ -143,6 +148,9 @@ test_that("arrow_scalar_function() with bad return type errors", {
call_function("times_32_bad_return_type_scalar", Array$create(1L)),
"Expected return Array or Scalar with type 'double'"
)

# TODO(ARROW-17178) remove the need for this!
Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF")
})

test_that("register_user_defined_function() can register multiple kernels", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_that("register_user_defined_function() can register multiple kernels", {
test_that("register_scalar_function() can register multiple kernels", {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@@ -109,6 +112,8 @@ test_that("register_scalar_function() adds a compute function to the registry",
dplyr::collect(),
tibble::tibble(a = 1L, b = 32.0)
)

Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBD but should this more properly go inside of on.exit above? like

on.exit({
  unregister_binding("times_32", update_cache = TRUE)
  # TODO(ARROW-17178) remove the need for this!
  Sys.unsetenv("R_ARROW_COLLECT_WITH_UDF")
})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

nealrichardson added a commit to nealrichardson/arrow that referenced this pull request Aug 7, 2022
@paleolimbot
Copy link
Member Author

@github-actions crossbow submit test-r-linux-valgrind

@github-actions
Copy link

github-actions bot commented Aug 7, 2022

Revision: 37e6e2e

Submitted crossbow builds: ursacomputing/crossbow @ actions-d72df52d33

Task Status
test-r-linux-valgrind Azure

@nealrichardson
Copy link
Member

@paleolimbot is this good to merge now?

@paleolimbot
Copy link
Member Author

Yes! Sorry, forgot to check the valgrind result.

@ursabot
Copy link

ursabot commented Aug 9, 2022

Benchmark runs are scheduled for baseline = 3148884 and contender = 7448322. 7448322 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.54% ⬆️0.0%] test-mac-arm
[Finished ⬇️1.09% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7448322e ec2-t3-xlarge-us-east-2
[Finished] 7448322e test-mac-arm
[Finished] 7448322e ursa-i9-9960x
[Finished] 7448322e ursa-thinkcentre-m75q
[Finished] 3148884e ec2-t3-xlarge-us-east-2
[Failed] 3148884e test-mac-arm
[Finished] 3148884e ursa-i9-9960x
[Finished] 3148884e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Aug 9, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants