-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup #90
Speedup #90
Conversation
instead of start:stop and start > stop return condition.
Codecov Report
@@ Coverage Diff @@
## master #90 +/- ##
=========================================
+ Coverage 91.47% 92.28% +0.8%
=========================================
Files 17 17
Lines 610 596 -14
=========================================
- Hits 558 550 -8
+ Misses 52 46 -6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Would you mind sharing the output of a relevant profvis run, e.g. on Rpubs?
R/modify_pd.R
Outdated
} | ||
|
||
pd %>% | ||
set_unindention_child(token = "')'", unindent_by = indent_by) | ||
} | ||
#' @rdname update_indention | ||
indent_curly <- function(pd, indent_by) { | ||
indention_needed <- needs_indention(pd, token = "'{'") | ||
opening <- which(pd$token == "'{'") | ||
indention_needed <- needs_indention(pd, token = "'{'", opening[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define a function get_indent_indices()
instead that returns a (possibly empty) integer vector of positions that need to have indention added?
indent_indices <- get_indent_indices(pd, ...)
if (length(indent_indices) > 0L) pd$indent[indent_indices] <- pd$indent[indent_indices] + 2L
Also, the benchmarks suggest that this doesn't buy us anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good. I think since dplyr::between()
returns boolean values, we better call it compute_indent_flags()
and do boolean subsetting instead of integer subsetting. Does that sound reasonable?
indent_round <- function(pd, indent_by) {
indent_flags <- compute_indent_flags(pd, token = "'('")
pd$indent[indent_flags] <- pd$indent[indent_flags] + indent_by
pd %>%
set_unindention_child(token = "')'", unindent_by = indent_by)
}
compute_indent_flags <- function(pd, token = "'('") {
npd <- nrow(pd)
opening <- which(pd$token == token)
if (!needs_indention(pd, token, opening[1])) return()
start <- opening + 1
stop <- npd - 1
between(seq_len(npd), start, stop)
}
It has only slightly worse performance (<1%) so I think we should do it. It also reduces code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indices have the advantage that they can be checked for length zero, and that they only contain the positions that we care about. Essentially, compute_indent_indices <- function(...) which(compute_indent_flags(...))
. I'm not sure about performance, because you need to allocate an extra integer vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I can try that. When I checked the profiling I just got the impression that which()
is expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with flags if it works well enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok no, I just figured out that which()
is not expensive. It was the comparison inside which()
, which we can't avoid anyways. Also, always having a numerical vector is better than having NULL sometimes, so I use indices anyways.
R/modify_pd.R
Outdated
@@ -127,6 +121,6 @@ token_is_multi_line <- function(pd) { | |||
#' @param pd_flat A flat parse table. | |||
#' @return A nested parse table. | |||
strip_eol_spaces <- function(pd_flat) { | |||
pd_flat %>% | |||
mutate(spaces = spaces * (lead(lag_newlines, default = 0) == 0)) | |||
pd_flat$spaces <- pd_flat$spaces * (lead(pd_flat$lag_newlines, default = 0) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following pattern seems slightly easier to read:
idx <- which(...)
pd_flat$spaces[idx] <- 0L
pd_flat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or flags instead of the (expensive) which?
idx <- lead(pd_flat$lag_newlines, default = 0) != 0
pd_flat$spaces[idx] <- 0
R/nested.R
Outdated
@@ -29,10 +29,10 @@ compute_parse_data_nested <- function(text) { | |||
add_terminal_token_before() %>% | |||
add_terminal_token_after() | |||
|
|||
parse_data$child <- rep(list(NULL), length(parse_data$text)) | |||
parse_data$short <- substr(parse_data$text, 1, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short
is optional and could be added to the "flat" parse data, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need short
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use it further no. I It just helps when working interactively. Should we drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it helps we should keep it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it into tokenize()
left_join(pd_flat, ., by = "id") | ||
} | ||
|
||
#' @rdname add_token_terminal | ||
add_terminal_token_before <- function(pd_flat) { | ||
pd_flat %>% | ||
terminals <- pd_flat %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe faster:
terminals <- which(pd_flat$terminals)
order <- order(pd_flat$line1, pd_flat$col1)[terminals]
data_frame(id = pd_flat$id[order], token_before = ...) %>% ...
Or:
terminals <- which(pd_flat$terminals)
order <- order(pd_flat$line1[terminals], pd_flat$col1[terminals])
data_frame(id = pd_flat$id[terminals][order], token_before = ...) %>% ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that but I felt since this function is only called once and it seems pretty inexpensive (10 ms out of 15'460 for the whole run, file R/nested.R), I left it as is, for better legibility. Or do you prefer the rearrangement anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that, I just noticed you changed it and assumed that performance matters here. Never mind.
R/nested.R
Outdated
split <- pd_flat %>% | ||
mutate_(internal = ~ (id %in% parent) | (parent <= 0)) %>% | ||
nest_("data", names(pd_flat)) | ||
nest_("data", setdiff(names(pd_flat), "internal")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could try split_data <- split(pd_flat[...], pd_flat$internal)
and use split_data
instead of split$data
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that works well.
R/rules-spacing.R
Outdated
|
||
non_comments <- pd %>% | ||
filter(token != "COMMENT") | ||
non_comments <-pd[pd$token != "COMMENT", ] | ||
|
||
comments <- comments %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pipe with just one step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Should we rather do?
extract(comments, text, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
R/rules-spacing.R
Outdated
arrange(line1, col1) | ||
} | ||
|
||
|
||
set_space_before_comments <- function(pd_flat) { | ||
comment_after <- pd_flat$token == "COMMENT" | ||
if (all(!comment_after)) return(pd_flat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!any()
might be slightly faster (also elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Can try that. Had it before, but then I felt it's less legible, but did not think in terms of speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it, but it's not really faster.
R/unindent.R
Outdated
@@ -5,7 +5,7 @@ | |||
#' @inheritParams unindent_child | |||
#' @importFrom purrr map | |||
set_unindention_child <- function(pd, token = "')'", unindent_by) { | |||
if(all(pd$terminal) | all(pd$indent == 0)) return(pd) | |||
if(all(pd$indent == 0) | all(pd$terminal) ) return(pd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
||
is slightly clearer, no performance difference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Will try.
R/utils.R
Outdated
rep_char <- function(char, times) { | ||
lapply(times, rep.int, x = char) %>% | ||
vapply(paste, collapse = "", character(1L)) | ||
map(times, rep.int, x = char) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can times
be a vector here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the flat serialization, newlines_and_spaces()
uses rep_char()
with a vectorised times
input and for setting spaces at the beginning of comments we do that. I think we can use map()
in these two places. It speeds things up quite a bit (~15%). Thanks for the hint.
R/visit.R
Outdated
@@ -17,18 +17,19 @@ NULL | |||
pre_visit <- function(pd_nested, funs) { | |||
if (is.null(pd_nested)) return() | |||
pd_transformed <- pd_nested %>% | |||
visit_one(funs) %>% | |||
mutate(child = map(child, pre_visit, funs = funs)) | |||
visit_one(funs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One-step pipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this to a one-liner without pipe.
speedup of ~15%
No speed improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Eventually we might want to use visitors that works in a loop without recursion, for easier to understand profiling results.
On an example file, the changes introduced with this PR speed up styler up to 4x. For styling a whole package, the speed gains turn out to be much less, only about 2.5x. Main bottle neck now is
tidyr::nest()
.This PR is based on #78