Speedup #90

lorenzwalthert · 2017-07-25T20:38:42Z

On an example file, the changes introduced with this PR speed up styler up to 4x. For styling a whole package, the speed gains turn out to be much less, only about 2.5x. Main bottle neck now is tidyr::nest().

This PR is based on #78

instead of start:stop and start > stop return condition.

codecov · 2017-07-25T20:57:15Z

Codecov Report

Merging #90 into master will increase coverage by 0.8%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master      #90     +/-   ##
=========================================
+ Coverage   91.47%   92.28%   +0.8%     
=========================================
  Files          17       17             
  Lines         610      596     -14     
=========================================
- Hits          558      550      -8     
+ Misses         52       46      -6

Impacted Files	Coverage Δ
R/rules-replacement.R	`100% <100%> (ø)`	⬆️
R/nested.R	`85.29% <100%> (ø)`	⬆️
R/visit.R	`100% <100%> (ø)`	⬆️
R/parsed.R	`97.72% <100%> (-0.1%)`	⬇️
R/unindent.R	`96% <100%> (ø)`	⬆️
R/modify_pd.R	`100% <100%> (+10.52%)`	⬆️
R/utils.R	`100% <100%> (ø)`	⬆️
R/rules-spacing.R	`93.87% <100%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59de93f...c339fa8. Read the comment docs.

krlmlr

Looks good. Would you mind sharing the output of a relevant profvis run, e.g. on Rpubs?

krlmlr · 2017-07-25T22:09:38Z

R/modify_pd.R

  }

  pd %>%
    set_unindention_child(token = "')'", unindent_by = indent_by)
 }
 #' @rdname update_indention
 indent_curly <- function(pd, indent_by) {
-  indention_needed <- needs_indention(pd, token = "'{'")
+  opening <- which(pd$token == "'{'")
+  indention_needed <- needs_indention(pd, token = "'{'", opening[1])


Can we define a function get_indent_indices() instead that returns a (possibly empty) integer vector of positions that need to have indention added?

indent_indices <- get_indent_indices(pd, ...) if (length(indent_indices) > 0L) pd$indent[indent_indices] <- pd$indent[indent_indices] + 2L

Also, the benchmarks suggest that this doesn't buy us anything.

Ok, sounds good. I think since dplyr::between() returns boolean values, we better call it compute_indent_flags() and do boolean subsetting instead of integer subsetting. Does that sound reasonable?

indent_round <- function(pd, indent_by) { indent_flags <- compute_indent_flags(pd, token = "'('") pd$indent[indent_flags] <- pd$indent[indent_flags] + indent_by pd %>% set_unindention_child(token = "')'", unindent_by = indent_by) } compute_indent_flags <- function(pd, token = "'('") { npd <- nrow(pd) opening <- which(pd$token == token) if (!needs_indention(pd, token, opening[1])) return() start <- opening + 1 stop <- npd - 1 between(seq_len(npd), start, stop) }

It has only slightly worse performance (<1%) so I think we should do it. It also reduces code duplication.

Indices have the advantage that they can be checked for length zero, and that they only contain the positions that we care about. Essentially, compute_indent_indices <- function(...) which(compute_indent_flags(...)). I'm not sure about performance, because you need to allocate an extra integer vector.

Ok. I can try that. When I checked the profiling I just got the impression that which() is expensive.

I'm fine with flags if it works well enough.

Ok no, I just figured out that which() is not expensive. It was the comparison inside which(), which we can't avoid anyways. Also, always having a numerical vector is better than having NULL sometimes, so I use indices anyways.

krlmlr · 2017-07-25T22:13:04Z

R/modify_pd.R

@@ -127,6 +121,6 @@ token_is_multi_line <- function(pd) {
 #' @param pd_flat A flat parse table.
 #' @return A nested parse table.
 strip_eol_spaces <- function(pd_flat) {
-  pd_flat %>%
-    mutate(spaces = spaces * (lead(lag_newlines, default = 0) == 0))
+  pd_flat$spaces <- pd_flat$spaces * (lead(pd_flat$lag_newlines, default = 0) == 0)


The following pattern seems slightly easier to read:

idx <- which(...) pd_flat$spaces[idx] <- 0L pd_flat

Or flags instead of the (expensive) which?

idx <- lead(pd_flat$lag_newlines, default = 0) != 0 pd_flat$spaces[idx] <- 0

krlmlr · 2017-07-25T22:14:21Z

R/nested.R

@@ -29,10 +29,10 @@ compute_parse_data_nested <- function(text) {
    add_terminal_token_before() %>%
    add_terminal_token_after()

+  parse_data$child <- rep(list(NULL), length(parse_data$text))
+  parse_data$short <- substr(parse_data$text, 1, 5)


The short is optional and could be added to the "flat" parse data, too.

Do we still need short?

We don't use it further no. I It just helps when working interactively. Should we drop it?

If it helps we should keep it for now.

moved it into tokenize()

krlmlr · 2017-07-25T22:17:34Z

R/nested.R

    left_join(pd_flat, ., by = "id")
 }

 #' @rdname add_token_terminal
 add_terminal_token_before <- function(pd_flat) {
-  pd_flat %>%
+  terminals <- pd_flat %>%


Maybe faster:

terminals <- which(pd_flat$terminals) order <- order(pd_flat$line1, pd_flat$col1)[terminals] data_frame(id = pd_flat$id[order], token_before = ...) %>% ...

Or:

terminals <- which(pd_flat$terminals) order <- order(pd_flat$line1[terminals], pd_flat$col1[terminals]) data_frame(id = pd_flat$id[terminals][order], token_before = ...) %>% ...

I tried that but I felt since this function is only called once and it seems pretty inexpensive (10 ms out of 15'460 for the whole run, file R/nested.R), I left it as is, for better legibility. Or do you prefer the rearrangement anyways?

I didn't know that, I just noticed you changed it and assumed that performance matters here. Never mind.

krlmlr · 2017-07-25T22:23:21Z

R/nested.R

  split <- pd_flat %>%
-    mutate_(internal = ~ (id %in% parent) | (parent <= 0)) %>%
-    nest_("data", names(pd_flat))
+    nest_("data", setdiff(names(pd_flat), "internal"))


You could try split_data <- split(pd_flat[...], pd_flat$internal) and use split_data instead of split$data below.

yes, that works well.

krlmlr · 2017-07-25T22:28:18Z

R/rules-spacing.R


-  non_comments <- pd %>%
-    filter(token != "COMMENT")
+  non_comments <-pd[pd$token != "COMMENT", ]

  comments <- comments %>%


A pipe with just one step?

Yes. Should we rather do?

extract(comments, text, ...)

krlmlr · 2017-07-25T22:29:01Z

R/rules-spacing.R

    arrange(line1, col1)
 }


 set_space_before_comments <- function(pd_flat) {
  comment_after <- pd_flat$token == "COMMENT"
+  if (all(!comment_after)) return(pd_flat)


!any() might be slightly faster (also elsewhere).

Ok, Can try that. Had it before, but then I felt it's less legible, but did not think in terms of speed.

Changed it, but it's not really faster.

krlmlr · 2017-07-25T22:29:47Z

R/unindent.R

@@ -5,7 +5,7 @@
 #' @inheritParams unindent_child
 #' @importFrom purrr map
 set_unindention_child <- function(pd, token = "')'", unindent_by) {
-  if(all(pd$terminal) | all(pd$indent == 0)) return(pd)
+  if(all(pd$indent == 0) | all(pd$terminal) ) return(pd)


|| is slightly clearer, no performance difference here.

👍 Will try.

krlmlr · 2017-07-25T22:30:56Z

R/utils.R

 rep_char <- function(char, times) {
-  lapply(times, rep.int, x = char) %>%
-    vapply(paste, collapse = "", character(1L))
+  map(times, rep.int, x = char) %>%


Can times be a vector here?

For the flat serialization, newlines_and_spaces() uses rep_char() with a vectorised times input and for setting spaces at the beginning of comments we do that. I think we can use map() in these two places. It speeds things up quite a bit (~15%). Thanks for the hint.

krlmlr · 2017-07-25T22:31:28Z

R/visit.R

@@ -17,18 +17,19 @@ NULL
 pre_visit <- function(pd_nested, funs) {
  if (is.null(pd_nested)) return()
  pd_transformed <- pd_nested %>%
-    visit_one(funs) %>%
-    mutate(child = map(child, pre_visit, funs = funs))
+    visit_one(funs)


One-step pipe?

I can change this to a one-liner without pipe.

speedup of ~15%

No speed improvement

lorenzwalthert · 2017-07-26T20:02:31Z

I added a profiling of commit cbf94f2 of a package (with 10 styled files) here. With this example, styling is about 4 times faster than before optimising for performance (commit 59de93f) .

krlmlr

Looks good. Eventually we might want to use visitors that works in a loop without recursion, for easier to understand profiling results.

…er `token`.

lorenzwalthert added 9 commits July 25, 2017 18:36

remove mutate statments

ad70d87

outsource opening

2c004b7

transmute / select

1497a46

replace dplyr with base R

f79bbbd

early return

8b4debf

purrr over base R

3fa0a13

more likely condition first

bc3be9a

primer on performance

3d41115

use dplyr::between(). Closes r-lib#83

1b1792d

instead of start:stop and start > stop return condition.

lorenzwalthert requested a review from krlmlr July 25, 2017 20:38

drop space in vignette name.

5ea6961

krlmlr reviewed Jul 25, 2017

View reviewed changes

lorenzwalthert added 6 commits July 26, 2017 17:45

make times argument in rep_char scalar.

bdb625b

use base R split instead of nest()

32ef3a4

speedup of ~15%

!any(cond) instead of all(!cond)

c3c5a88

No speed improvement

lazy

3fe76fe

pipe -> nested call

02742a6

simplify

cbf94f2

lorenzwalthert requested a review from krlmlr July 26, 2017 20:02

krlmlr reviewed Jul 26, 2017

View reviewed changes

lorenzwalthert added 4 commits July 27, 2017 16:38

use compute_indent_indices that simplifies code

30aa043

drop one-step pipe

1690b8f

document

9b9af35

move short to tokenize

65d99f7

lorenzwalthert requested a review from krlmlr July 27, 2017 14:50

update documentation for needs_indention and drop unnecessary paramet…

c339fa8

…er `token`.

krlmlr approved these changes Jul 27, 2017

View reviewed changes

lorenzwalthert merged commit 0a9f468 into r-lib:master Jul 27, 2017

lorenzwalthert mentioned this pull request Jul 27, 2017

make styler faster #78

Closed

lorenzwalthert mentioned this pull request Sep 5, 2017

Consider removing dplyr and possibly purrr dependencies? #181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup #90

Speedup #90

lorenzwalthert commented Jul 25, 2017 •

edited

Loading

codecov bot commented Jul 25, 2017 •

edited

Loading

krlmlr left a comment

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 26, 2017 •

edited

Loading

lorenzwalthert Jul 26, 2017

krlmlr Jul 26, 2017

lorenzwalthert Jul 27, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 25, 2017

krlmlr Jul 26, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 26, 2017

lorenzwalthert Jul 27, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 26, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

lorenzwalthert Jul 27, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

krlmlr Jul 25, 2017

lorenzwalthert Jul 26, 2017

lorenzwalthert commented Jul 26, 2017

krlmlr left a comment

Speedup #90

Speedup #90

Conversation

lorenzwalthert commented Jul 25, 2017 • edited Loading

codecov bot commented Jul 25, 2017 • edited Loading

Codecov Report

krlmlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krlmlr Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorenzwalthert commented Jul 26, 2017

krlmlr left a comment

Choose a reason for hiding this comment

lorenzwalthert commented Jul 25, 2017 •

edited

Loading

codecov bot commented Jul 25, 2017 •

edited

Loading

krlmlr Jul 26, 2017 •

edited

Loading