Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve classes in pmap() #554

Merged
merged 1 commit into from
Nov 15, 2018
Merged

Preserve classes in pmap() #554

merged 1 commit into from
Nov 15, 2018

Conversation

mikmart
Copy link
Contributor

@mikmart mikmart commented Sep 26, 2018

Following this suggestion, changed the calls constructed by pmap() to use two double brackets for indexing rather than a single double bracket with a vector argument. This retains class for S3 atomic vector types (whose [[ method preserves the class) in the input. Fixes #358.

However, list-based S3 classes, e.g. POSIXlt still fail.

@mungojam
Copy link

Does this cover pwalk too?

@mikmart
Copy link
Contributor Author

mikmart commented Sep 27, 2018

@mungojam not as it is, as (I think) the problem with pwalk() comes from the implementation of transpose(). But maybe this should, I'll have a look.

@mikmart
Copy link
Contributor Author

mikmart commented Sep 27, 2018

Okay I don't have any ideas on fixing transpose() and thus pwalk() directly. However, if we just rewrite pwalk() using pmap() (latest commit) it should also be fixed.

@hadley do you think it's OK to change pwalk() to call pmap()? Looks like they have been separate from the beginning but I don't see why.

src/map.c Outdated
@@ -174,7 +174,7 @@ SEXP pmap_impl(SEXP env, SEXP l_name_, SEXP f_name_, SEXP type_) {
SEXP i = Rf_install("i");
SEXP one = PROTECT(Rf_ScalarInteger(1));

// Construct call like f(.x[[c(1, i)]], .x[[c(2, i)]], ...)
// Construct call like f(.x[[1]][[i]], .x[[2]][[i]], ...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here saying that x[[c(1, i)]] is buggy for S3 vectors

@@ -60,3 +60,17 @@ test_that("pmap on data frames performs rowwise operations", {
expect_is(pmap_chr(mtcars2, paste), "character")
expect_is(pmap_raw(mtcars2, function(mpg, cyl) as.raw(cyl)), "raw")
})

test_that("preserves S3 class", {
fctr <- factor("a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you only need to test a single S3 class here, since the other options don't cause the tests to exercise any more code.

@hadley
Copy link
Member

hadley commented Sep 27, 2018

The downside of using the same code is that it makes pwalk() less efficient than it could be, because it has to initialise and store the results of each call, even when that isn't necessary. But I could probably be persuaded if you did a little benchmarking showing that had hardly any affect.

OTOH I think transpose() will need a complete rewrite, once I've fully internalised the lessons from vctrs, so it's also ok to leave as is for now, and we'll fix more holistically later.

@mikmart
Copy link
Contributor Author

mikmart commented Sep 27, 2018

Oh yeah I see now, thanks for the explanation! I'll do some benchmarks to see what the hit would be like.

@mikmart
Copy link
Contributor Author

mikmart commented Sep 28, 2018

@hadley here's one benchmark (with a contrived use-case). Looks like using pmap() is actually a little faster:

library(purrr)

pwalk2 <- function(.l, .f, ...) {
  pmap(.l, .f, ...)
  invisible(.l)
}

res <- bench::press(
  i = 0:6,
  {
    args <- list(n = rep(10^(6 - i), 10^i))
    bench::mark(
      pwalk(args, runif),
      pwalk2(args, runif)
    )
  }
)

res
#> # A tibble: 14 x 15
#>    expression     i      min     mean   median      max `itr/sec` mem_alloc
#>  * <chr>      <int> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#>  1 pwalk(arg~     0  29.11ms  29.77ms  29.39ms  32.86ms    33.6      7.63MB
#>  2 pwalk2(ar~     0  29.14ms  31.42ms  29.52ms  57.91ms    31.8      7.63MB
#>  3 pwalk(arg~     1  29.19ms  30.18ms  30.28ms  33.13ms    33.1      7.65MB
#>  4 pwalk2(ar~     1  28.62ms  29.35ms  29.02ms  31.57ms    34.1      7.65MB
#>  5 pwalk(arg~     2   31.6ms  32.55ms  32.02ms  36.14ms    30.7      7.88MB
#>  6 pwalk2(ar~     2  28.64ms  29.25ms  28.94ms  31.24ms    34.2      7.88MB
#>  7 pwalk(arg~     3  32.46ms  33.03ms  32.81ms  35.63ms    30.3     10.12MB
#>  8 pwalk2(ar~     3  30.42ms  31.07ms  30.93ms  32.99ms    32.2     10.12MB
#>  9 pwalk(arg~     4  61.77ms  62.77ms  62.77ms  63.77ms    15.9     32.54MB
#> 10 pwalk2(ar~     4  47.58ms  49.49ms  48.24ms  56.41ms    20.2      32.5MB
#> 11 pwalk(arg~     5 560.56ms 560.56ms 560.56ms 560.56ms     1.78   244.52MB
#> 12 pwalk2(ar~     5 233.72ms 247.71ms 252.81ms  256.6ms     4.04   244.14MB
#> 13 pwalk(arg~     6    4.74s    4.74s    4.74s    4.74s     0.211    2.39GB
#> 14 pwalk2(ar~     6    2.36s    2.36s    2.36s    2.36s     0.423    2.38GB
#> # ... with 7 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>,
#> #   result <list>, memory <list>, time <list>, gc <list>

Real use-cases would probably include I/O operations, where I would imagine the time taken to interact with the output device would dominate the results anyway; so I didn't continue with testing as it looks like the pmap() implementation is noticeably faster in this simple case already.

(I have to say though, I find the memory allocation results quite strange.. Considering that each run has a fixed number (1e6) of doubles being created. The larger lists with larger i of course account for something, but a 300x increase..?)

@mikmart
Copy link
Contributor Author

mikmart commented Sep 28, 2018

Okay I also tried some I/O by saving plots. Very little difference here, too:

library(purrr)

pwalk2 <- function(.l, .f, ...) {
  pmap(.l, .f, ...)
  invisible(.l)
}

f <- function(file, n) {
  pdf(file)
  plot(runif(n))
  on.exit({dev.off(); unlink(file)})
}

bench::press(
  i = 0:3,
  {
    n <- rep(10^(4 - i), 10^i)
    tf <- map_chr(seq_along(n), ~ tempfile())
    args <- list(file = tf, n = n)
    bench::mark(
      pwalk(args, f),
      pwalk2(args, f)
    )
  }
)
#> Running with:
#>       i
#> 1     0
#> 2     1
#> 3     2
#> 4     3
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 8 x 11
#>   expression     i      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>      <int> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 pwalk(arg~     0  35.89ms  40.62ms  38.12ms   57.5ms    24.6      1.56MB
#> 2 pwalk2(ar~     0  34.39ms  36.94ms  35.98ms  41.56ms    27.1    807.95KB
#> 3 pwalk(arg~     1  53.31ms  58.11ms     58ms  63.17ms    17.2      1.41MB
#> 4 pwalk2(ar~     1  52.56ms  54.72ms  54.18ms  60.73ms    18.3      1.41MB
#> 5 pwalk(arg~     2 299.54ms 299.54ms 299.54ms 299.54ms     3.34     7.88MB
#> 6 pwalk2(ar~     2  280.8ms  280.8ms  280.8ms  280.8ms     3.56     7.88MB
#> 7 pwalk(arg~     3     2.4s     2.4s     2.4s     2.4s     0.416   71.37MB
#> 8 pwalk2(ar~     3    2.43s    2.43s    2.43s    2.43s     0.412   71.36MB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

Created on 2018-09-28 by the reprex package (v0.2.0.9000).

@hadley
Copy link
Member

hadley commented Sep 28, 2018

Ok, that's compelling!

Do you want to change the walk2() and walk() to wrappers around the map functions in order to be consistent?

@mikmart
Copy link
Contributor Author

mikmart commented Sep 28, 2018

Absolutely. Should be good now.

@lionel-
Copy link
Member

lionel- commented Sep 28, 2018

This will make walk() use more memory when the function produces large objects, but for () loops can be used in that case as well, or the function can be changed to avoid returning the large object.

@mikmart
Copy link
Contributor Author

mikmart commented Sep 28, 2018

Ah yeah the benchmark memory allocation numbers don't tell the whole story here as they don't show peak memory demand, just the total allocated.

@cderv cderv mentioned this pull request Oct 9, 2018
@mikmart
Copy link
Contributor Author

mikmart commented Nov 4, 2018

Perhaps the walk() changes should be decoupled from this PR? I could revert this to just include the pmap() fix.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefits of using the same code for pwalk() and pmap() outweight the small downside.

Constructing a call like .l[[c(j, i)]] drops attributes while .l[[j]][[i]] preserves them.

Write pwalk() and walk() in terms of pmap() and map()
@lionel- lionel- merged commit fe5ee0c into tidyverse:master Nov 15, 2018
@lionel-
Copy link
Member

lionel- commented Nov 15, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants