Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For performance, consider accessing user supplied context by reference #298

Closed
patefacio opened this issue Jun 4, 2018 · 9 comments
Closed

Comments

@patefacio
Copy link
Contributor

patefacio commented Jun 4, 2018

In this branch (https://github.com/patefacio/tera/tree/less_clone) I've updated the Renderer code to, where possible, not do deep copies (ie clones). It required some refactor. The strategy was the Renderer keeps a borrow of user supplied context and passes that on to an AstProcessor. This way an instance of a class is not required to have reference to another field in same struct - which I think is impossible in safe rust. I added a RefOrOwned class to treat an immutable ref and a owned type transparently. Then the actual processor can clone those RefOrOwned instances and it is a large pointer copy. I have not finished all features yet so I have more work to do, but the initial results are promising.

Looking for comments, suggestions.

Perhaps my usage is pathological; I have a large context to generate this html: (https://patefacio.github.io/plusauri.github.io/html/forecast/this_just_in_bogle_sample.plusauri.as_modeled.uai_true.html)

For my particular template the performance change is stark: (durations in milliseconds).

7 matches for "took" in buffer: plus_release_server.out
2400:Render (less_clone) for template took 67: For template (forecast) with parents ([])
4867:Render (less_clone) for template took 62: For template (forecast) with parents ([])
7332:Render (less_clone) for template took 56: For template (forecast) with parents ([])
9797:Render (less_clone) for template took 59: For template (forecast) with parents ([])
12262:Render (less_clone) for template took 65: For template (forecast) with parents ([])
14727:Render (less_clone) for template took 59: For template (forecast) with parents ([])
17192:Render (less_clone) for template took 58: For template (forecast) with parents ([])

8 matches for "took" in buffer: plus_release_server.out
2:Render (tera_ref) for template took 1909: For template (forecast) with parents ([])
70:Render (tera_ref) for template took 2105: For template (forecast) with parents ([])
137:Render (tera_ref) for template took 1999: For template (forecast) with parents ([])
204:Render (tera_ref) for template took 2003: For template (forecast) with parents ([])
271:Render (tera_ref) for template took 1931: For template (forecast) with parents ([])
338:Render (tera_ref) for template took 2062: For template (forecast) with parents ([])
405:Render (tera_ref) for template took 2073: For template (forecast) with parents ([])
472:Render (tera_ref) for template took 2034: For template (forecast) with parents ([])

@Keats
Copy link
Owner

Keats commented Jun 6, 2018

Can you make a PR for the benches you've added (although I would keep the current huge_loop one as is since it was an example from handlebars and create another one for loop nesting)?

That's a pretty huge rewrite of the renderer though and it would need to pass all the tests (currently 24 failing) before I take a good look at it though. Is rewriting the whole renderer necessary or is mostly related to for loops?

@patefacio
Copy link
Contributor Author

Maybe a more skillful simple replacement could be done. I tried first to simply replace Value with &'a Value and just could not make it work. I think RefOrOwned concept, which may exist elsewhere in rust as standard, is required to stop deep cloning and that required an extra class. That is, the work could not be done in Renderer::render because you need that main function to hold ref to user context to ensure lifetime is long enough struct doing for real work, which I called AstProcessor.

I understand your reservations and have them also. I'll be glad to try to get those tests working - I just wanted some thoughts before continuing. I will also make a PR for the benches leaving original huge_loop.

@patefacio
Copy link
Contributor Author

I've got almost all the tests working. I believe the only ones not working are related to error processing where the tests look for specific error text. For some, I updated the required text to match slightly new output. I imagine the text of errors is not considered part of the api? Anyway, I never really acclimated to rust error_chain; Honestly I find it difficult to use/understand.

I believe the updates will be faster for many/most renderings that involve any significant context data. For very small contexts in tight loops it might be worse. Here are new benchmarks vs current as run on my macbook. For my specific use case it takes from 2,000ms down to under 100ms - but I may be doing something atypical in my templates. @Keats - Let me know what you think:

less_clone

running 2 tests
test bench_big_loop_big_object ... bench:     375,215 ns/iter (+/- 67,384)
test bench_macro_big_object    ... bench:     704,519 ns/iter (+/- 64,984)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running target/release/deps/tera-6527d5d84c556326

running 13 tests
test access_deep_object                        ... bench:       6,721 ns/iter (+/- 1,537)
test access_deep_object_with_literal           ... bench:       7,856 ns/iter (+/- 550)
test bench_build_inheritance_chains            ... bench:       3,185 ns/iter (+/- 225)
test bench_escape_html                         ... bench:          98 ns/iter (+/- 5)
test bench_huge_loop                           ... bench:     769,062 ns/iter (+/- 90,359)
test bench_parsing_basic_template              ... bench:      22,946 ns/iter (+/- 3,289)
test bench_parsing_with_inheritance_and_macros ... bench:      44,286 ns/iter (+/- 4,726)
test bench_rendering_basic_template            ... bench:       7,290 ns/iter (+/- 606)
test bench_rendering_inheritance_and_macros    ... bench:       8,220 ns/iter (+/- 489)
test bench_rendering_only_inheritance          ... bench:       3,074 ns/iter (+/- 617)
test bench_rendering_only_macro_call           ... bench:       1,127 ns/iter (+/- 103)
test bench_rendering_only_parent               ... bench:       1,932 ns/iter (+/- 182)
test bench_rendering_only_variable             ... bench:       2,248 ns/iter (+/- 136)

Original

running 2 tests
test bench_big_loop_big_object ... bench:   3,027,435 ns/iter (+/- 1,185,983)
test bench_macro_big_object    ... bench:   1,817,381 ns/iter (+/- 123,192)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running /Users/dbdavidson/dev/open_source/tera_bench/target/release/deps/tera-dd530086e3e4cc51

running 13 tests
test access_deep_object                        ... bench:       8,280 ns/iter (+/- 852)
test access_deep_object_with_literal           ... bench:      11,206 ns/iter (+/- 1,858)
test bench_build_inheritance_chains            ... bench:       3,355 ns/iter (+/- 448)
test bench_escape_html                         ... bench:          92 ns/iter (+/- 5)
test bench_huge_loop                           ... bench:   2,265,449 ns/iter (+/- 700,556)
test bench_parsing_basic_template              ... bench:      21,018 ns/iter (+/- 3,146)
test bench_parsing_with_inheritance_and_macros ... bench:      44,102 ns/iter (+/- 3,045)
test bench_rendering_basic_template            ... bench:       6,980 ns/iter (+/- 894)
test bench_rendering_inheritance_and_macros    ... bench:       9,637 ns/iter (+/- 2,183)
test bench_rendering_only_inheritance          ... bench:       3,022 ns/iter (+/- 287)
test bench_rendering_only_macro_call           ... bench:       1,098 ns/iter (+/- 88)
test bench_rendering_only_parent               ... bench:       1,699 ns/iter (+/- 208)
test bench_rendering_only_variable             ... bench:       1,835 ns/iter (+/- 113)

test result: ok. 0 passed; 0 failed; 0 ignored; 13 measured; 0 filtered out

@Keats
Copy link
Owner

Keats commented Jun 15, 2018

It's interesting that looking at the benchmarks it only really affects the loop so that's really the bottleneck, nothing else. I don't think I will have time this week but I should be able to try having a look on my end next week.

I imagine the text of errors is not considered part of the api?

It is not.

I'll probably switch from error-chain to basic errors but that's a breaking change afaik so that will have to wait for a major version. Also waiting to see where failure is going before doing any error refactoring.

@patefacio
Copy link
Contributor Author

I think it is more than the loop. Maybe an example of "correlation is not causation"? I believe the issue really is the combination of (1) storing by value and (2) using Value throughout as return on all eval_.... Because of that macro invocations, set assignments and loops have no recourse but to clone. For example, I've added two new benchmarks in tera_less_clone () with no loops. One with a set ... statement, which a client could assume is like a reference but is really a deep copy and another with a macro invocation.

Also, originally I had assumed it was Rocket that was the culprit. The reason is I was timing my rocket route handler and it was fast (millis) yet it took two seconds to make it to browser - so I assumed rocket middleware on transit. It turns out that the request to render in rocket really only deferred the call to tera rendering until after return of the route. Anyway, I was discussing some with Rocket community and in the process I created this gist: https://gist.github.com/patefacio/d861650e7297b4d27c6e0c591a43cb12

If you look at the last file you'll see an image in the call stack and when I rummaged through it a very large percentage of time was in clone. That is why I think it is more than looping. Of course, with looping it will show big because it does clone there as well.

tera_less_clone

running 2 tests
test bench_macro_big_object_no_loop_macro_call ... bench:   6,393,256 ns/iter (+/- 563,981)
test bench_macro_big_object_no_loop_with_set   ... bench:   6,702,669 ns/iter (+/- 626,915)

original

running 2 tests
test bench_macro_big_object_no_loop_macro_call ... bench:  13,463,490 ns/iter (+/- 1,026,651)
test bench_macro_big_object_no_loop_with_set   ... bench:  13,553,011 ns/iter (+/- 1,198,475)

@Keats
Copy link
Owner

Keats commented Jun 25, 2018

Let me know when everything is working and i'll have a look, that does look promising

@patefacio
Copy link
Contributor Author

Sorry for long delay - vacation and stuff. All tests pass and although some tests I changed slightly, I believe they capture the same intent. I look forward to comments/suggestions.

@Keats
Copy link
Owner

Keats commented Aug 21, 2018

no worries.
Can you open a PR for easy review?

@Keats Keats closed this as completed Sep 2, 2018
@Keats
Copy link
Owner

Keats commented Sep 2, 2018

The code has been released with 0.11.13, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants