-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CGI.escapeHTML by reducing buffer extension and branches #2226
Conversation
and switch-case branches.
751b4bf
to
4d23502
Compare
4d23502
to
0bc4478
Compare
ext/cgi/escape/escape.c
Outdated
break; | ||
} | ||
#define HTML_ESCAPE(c, str) do { \ | ||
html_escape_table[c] = str; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, define variable int len = strlen(str);
here, and use it following?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be robust and easier to read. I did so 9bb706a
This is great! I was trying to understand why escape_utils was still faster than |
Thanks for your comment. escape_utils/benchmark/html_escape.rbHere's the benchmark result of escape_utils/benchmark/html_escape.rb using this
Note: benchmark/cgi_escape_html.ymlSame environment as above, but with benchmarks in this PR:
WhyLet me clarify my understanding of characteristics of each implementation:
|
(It included my wrong assumption for |
and switch-case branches. Buffer allocation optimization using `ALLOCA_N` would be the main benefit of patch. It eliminates the O(N) buffer extensions. It also reduces the number of branches using escape table like https://mattn.kaoriya.net/software/lang/c/20160817011915.htm. Closes: #2226 Co-authored-by: Nobuyoshi Nakada <[email protected]> Co-authored-by: Yasuhiro MATSUMOTO <[email protected]>
I committed the fixed version in 0a29dc8 and updated all above benchmark results again. |
See #2226 for benchmark results.
Thank you for such detailed explanation. I'm glad that we can either kill escape_utils or at least shrink its implementation in Ruby 2.7. |
This is great! I wonder why Rack's implementation doesn't use @rafaelfranca I think we use the JavaScript / URI escaping stuff in escape_utils still (I wanted to remove this dependency, but couldn't) |
I didn't do all the optimisations here, but I did do this one which I'm not sure if it's implemented here or not: https://github.com/ioquatix/trenni/blob/master/ext/trenni/escape.c#L83-L84 Basically, search the string to see if there are any characters to be escaped. If you find index to escape, then start from there (you might be able to minimise stack allocation too if you know length of remainder). If you don't find any symbol to escape, you can return early with no overhead. |
ext/cgi/escape/escape.c
Outdated
} while (0) | ||
HTML_ESCAPE('\'', "'"); | ||
HTML_ESCAPE('&', "&"); | ||
HTML_ESCAPE('"', """); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use "
which is one character less :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there are some variations, but I intended to use exactly these characters for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with you, it was half-joke :)
Before merging this, I did the experiment to skip a buffer allocation at all when nothing is escaped k0kubun#16. It slightly improved the benchmark, but it also complicates implementation (if we don't care maintainability, why not use SIMD? :p). So I intentionally skipped that for now. |
I don't think the implementation has to be much complicated to avoid buffer allocation. Just make "Find next token" method that takes current The reason why it's good optimisation is because it avoids allocation, GC pressure, etc. I would say many strings have no sequence that requires escape. |
For the record: my complicated patch.
diff --git a/ext/cgi/escape/escape.c b/ext/cgi/escape/escape.c
index 76d8f0d067..5fa8463c75 100644
--- a/ext/cgi/escape/escape.c
+++ b/ext/cgi/escape/escape.c
@@ -34,35 +34,87 @@ preserve_original_state(VALUE orig, VALUE dest)
RB_OBJ_INFECT_RAW(dest, orig);
}
+static inline char *
+proceed_one_char(char *dest, const unsigned char c)
+{
+ uint8_t len = html_escape_table[c].len;
+ if (len) {
+ memcpy(dest, html_escape_table[c].str, len);
+ dest += len;
+ }
+ else {
+ *dest++ = c;
+ }
+ return dest;
+}
+
+#define FAST_EACH_CHAR() \
+ /* Manual loop unrolling to align word access */ \
+ for (; end - cstr >= 4; cstr += 4) { \
+ /* Prefetch four bytes */ \
+ const unsigned char c0 = cstr[0]; \
+ const unsigned char c1 = cstr[1]; \
+ const unsigned char c2 = cstr[2]; \
+ const unsigned char c3 = cstr[3]; \
+ /* return cstr instead of cstr + N for alignment */ \
+ BLOCK(c0, cstr); \
+ BLOCK(c1, cstr); \
+ BLOCK(c2, cstr); \
+ BLOCK(c3, cstr); \
+ } \
+ /* The original loop */ \
+ while (cstr < end) { \
+ const unsigned char c = cstr[0]; \
+ BLOCK(c, cstr); \
+ cstr++; \
+ }
+
+static inline const char *
+scout_escape_char(const char *cstr, const char *end) {
+#define BLOCK(c, p) if (html_escape_table[c].len) return (p);
+ FAST_EACH_CHAR();
+#undef BLOCK
+ return NULL;
+}
+
+static inline char *
+escape_cstr(char *dest, const char *cstr, const char *end)
+{
+#define BLOCK(c, p) dest = proceed_one_char(dest, (c));
+ FAST_EACH_CHAR();
+#undef BLOCK
+ return dest;
+}
+
static VALUE
optimized_escape_html(VALUE str)
{
- VALUE vbuf;
- char *buf = ALLOCV_N(char, vbuf, RSTRING_LEN(str) * HTML_ESCAPE_MAX_LEN);
const char *cstr = RSTRING_PTR(str);
- const char *end = cstr + RSTRING_LEN(str);
-
- char *dest = buf;
- while (cstr < end) {
- const unsigned char c = *cstr++;
- uint8_t len = html_escape_table[c].len;
- if (len) {
- memcpy(dest, html_escape_table[c].str, len);
- dest += len;
- }
- else {
- *dest++ = c;
- }
- }
+ long len = RSTRING_LEN(str);
+ const char *end = cstr + len;
+
+ const char *first = scout_escape_char(cstr, end);
+
+ if (!first) return rb_str_dup(str);
+
+ if (len < 20) {
+ char *buf = ALLOCA_N(char, len * HTML_ESCAPE_MAX_LEN);
+ memcpy(buf, cstr, first - cstr);
+ char *dest = escape_cstr(buf + (first - cstr), first, end);
- VALUE escaped;
- if (RSTRING_LEN(str) < (dest - buf)) {
- escaped = rb_str_new(buf, dest - buf);
+ VALUE escaped = rb_str_new(buf, dest - buf);
preserve_original_state(str, escaped);
+ return escaped;
}
- else {
- escaped = rb_str_dup(str);
- }
+
+ VALUE vbuf;
+ char *buf = ALLOCV_N(char, vbuf, len * HTML_ESCAPE_MAX_LEN);
+
+ memcpy(buf, cstr, first - cstr);
+ char *dest = escape_cstr(buf + (first - cstr), first, end);
+
+ VALUE escaped = rb_str_new(buf, dest - buf);
+ preserve_original_state(str, escaped);
ALLOCV_END(vbuf);
return escaped;
} |
My point is that having the "Find next token" is already not as simple as the current implementation, and the benefit should be big enough to accept it. My patch did not improve the no-escape performance that much, but yours might do 🙂
The argument sounds fair, but I'd also say many strings are shorter than 170 characters (170 * 6 < |
What is the reason to call |
I'm not sure and I'm not Rubyist but calling rb_str_dup is required.
|
I think that this is a great improvement and I think this implementation is fast enough. I have existing benchmarks, so I added
This implementation probably doesn't beat Trenni's implementation, but I will test it once it's merged. Ruby does have some basic string CoW so maybe performance hit is not so bad when calling In my experience, typical use case is appending to an output buffer. So, I think it's silly to duplicate a string in memory for the sole purpose of appending to another buffer. In my testing, avoiding this operation was a huge performance win, to then point where all my operations became appends: https://github.com/ioquatix/trenni/blob/master/ext/trenni/escape.h#L11-L12 Overall, this and several other optimisations allow Trenni templates to be 10x or more faster than ERB, even while using escaping by default. While we can't utilise the append & escape operation without changing the existing method, maybe it's not silly to add it, e.g. |
mattn's comment is right. Also I already explained that in #2226 (comment):
|
@k0kubun sorry I didn't clearly read all your detailed notes. Thanks for such information. |
On this point, do you think it makes sense to add something like |
At least I think it does not help Oh by the way, when fixing Anyway I think |
and switch-case branches. Buffer allocation optimization using `ALLOCA_N` would be the main benefit of patch. It eliminates the O(N) buffer extensions. It also reduces the number of branches using escape table like https://mattn.kaoriya.net/software/lang/c/20160817011915.htm. Closes: ruby/ruby#2226 Co-authored-by: Nobuyoshi Nakada <[email protected]> Co-authored-by: Yasuhiro MATSUMOTO <[email protected]>
and switch-case branches. Buffer allocation optimization using `ALLOCA_N` would be the main benefit of patch. It eliminates the O(N) buffer extensions. It also reduces the number of branches using escape table like https://mattn.kaoriya.net/software/lang/c/20160817011915.htm. Closes: ruby/ruby#2226 Co-authored-by: Nobuyoshi Nakada <[email protected]> Co-authored-by: Yasuhiro MATSUMOTO <[email protected]>
Benchmark
Intel 4.0GHz i7-4790K with 16GB memory under x86-64 Ubuntu 8 Cores, gcc 7.3.0
escape_utils.gem's benchmark
Here's the benchmark result of escape_utils/benchmark/html_escape.rb using this
CGI.escapeHTML
. OriginallyCGI.escapeHTML
was about 0.72x ofEscapeUtils.escape_html
, and now it's 2.48x.Note:
Haml::Helpers.html_escape
usesERB::Util.html_escape
which usesCGI.escapeHTML
, so those 3 are the same.Other scenarios
When there's at least one escaped character (
_one
,_all
,_real
), it becomes 1.91~5.35x.When there's nothing to be escaped (
_blank
,_none
), unfortunately it becomes 1.08~1.12x slower.