string_decoder: rewrite implementation #6777

mscdex · 2016-05-15T19:44:58Z

Checklist

tests and code linting passes
a test and/or benchmark is included
the commit message follows commit guidelines

Affected core subsystem(s)

string_decoder

Description of change

This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability.

Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking.

Benchmark results:

string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=16 n=2500000: ./node: 2608500 ./node-master: 2411100 ............. 8.19%
string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=64 n=2500000: ./node: 3812400 ./node-master: 3689900 ............. 3.32%
string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=256 n=2500000: ./node: 3995900 ./node-master: 3666700 ............ 8.98%
string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=1024 n=2500000: ./node: 3990400 ./node-master: 3734300 ........... 6.86%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=16 n=2500000: ./node: 657720 ./node-master: 593080 ............. 10.90%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=64 n=2500000: ./node: 1454800 ./node-master: 1356500 ............ 7.25%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=256 n=2500000: ./node: 1821500 ./node-master: 1741000 ........... 4.62%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=1024 n=2500000: ./node: 1819900 ./node-master: 1750200 .......... 3.98%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=16 n=2500000: ./node: 83475 ./node-master: 75012 .............. 11.28%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=64 n=2500000: ./node: 175280 ./node-master: 172150 ............. 1.82%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=256 n=2500000: ./node: 266000 ./node-master: 247300 ............ 7.56%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=1024 n=2500000: ./node: 260120 ./node-master: 252230 ........... 3.13%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=16 n=2500000: ./node: 20576 ./node-master: 17684 .............. 16.35%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=64 n=2500000: ./node: 46176 ./node-master: 43044 ............... 7.27%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=256 n=2500000: ./node: 66595 ./node-master: 63134 .............. 5.48%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=1024 n=2500000: ./node: 67692 ./node-master: 64320 ............. 5.24%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=16 n=2500000: ./node: 792560 ./node-master: 376330 ...... 110.60%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=64 n=2500000: ./node: 2776000 ./node-master: 1388800 ..... 99.88%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=256 n=2500000: ./node: 2552900 ./node-master: 1250700 ... 104.12%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=1024 n=2500000: ./node: 2791300 ./node-master: 1263100 .. 120.99%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=16 n=2500000: ./node: 242150 ./node-master: 100900 ..... 139.99%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=64 n=2500000: ./node: 731910 ./node-master: 308100 ..... 137.56%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=256 n=2500000: ./node: 1880800 ./node-master: 1005100 ... 87.14%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=1024 n=2500000: ./node: 2031900 ./node-master: 1067100 .. 90.41%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=16 n=2500000: ./node: 28221 ./node-master: 13268 ...... 112.70%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=64 n=2500000: ./node: 110520 ./node-master: 48485 ..... 127.95%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=256 n=2500000: ./node: 281630 ./node-master: 155010 .... 81.69%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=1024 n=2500000: ./node: 509790 ./node-master: 329090 ... 54.91%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=16 n=2500000: ./node: 7454.5 ./node-master: 3540.4 .... 110.56%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=64 n=2500000: ./node: 24644 ./node-master: 12107 ...... 103.56%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=256 n=2500000: ./node: 74954 ./node-master: 40305 ...... 85.97%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=1024 n=2500000: ./node: 125180 ./node-master: 92216 .... 35.75%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=16 n=2500000: ./node: 985010 ./node-master: 451520 ..... 118.15%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=64 n=2500000: ./node: 2830500 ./node-master: 1421600 .... 99.10%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=256 n=2500000: ./node: 2853400 ./node-master: 1380000 .. 106.77%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=1024 n=2500000: ./node: 2389400 ./node-master: 1483300 .. 61.08%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=16 n=2500000: ./node: 268370 ./node-master: 126350 .... 112.40%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=64 n=2500000: ./node: 756360 ./node-master: 448280 ..... 68.73%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=256 n=2500000: ./node: 2099500 ./node-master: 1075000 .. 95.29%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=1024 n=2500000: ./node: 1978000 ./node-master: 1120100 . 76.59%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=16 n=2500000: ./node: 30721 ./node-master: 14321 ..... 114.52%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=64 n=2500000: ./node: 132060 ./node-master: 64124 .... 105.94%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=256 n=2500000: ./node: 324520 ./node-master: 185200 ... 75.23%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=1024 n=2500000: ./node: 581660 ./node-master: 425450 .. 36.71%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=16 n=2500000: ./node: 9096.6 ./node-master: 3610.9 ... 151.92%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=64 n=2500000: ./node: 30843 ./node-master: 14610 ..... 111.11%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=256 n=2500000: ./node: 90907 ./node-master: 46596 ..... 95.10%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=1024 n=2500000: ./node: 150890 ./node-master: 98642 ... 52.97%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=16 n=2500000: ./node: 3743100 ./node-master: 2683600 ......... 39.48%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=64 n=2500000: ./node: 6516300 ./node-master: 4930200 ......... 32.17%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=256 n=2500000: ./node: 6235800 ./node-master: 5263600 ........ 18.47%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=1024 n=2500000: ./node: 6349900 ./node-master: 4837100 ....... 31.28%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=16 n=2500000: ./node: 973190 ./node-master: 777490 .......... 25.17%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=64 n=2500000: ./node: 3124800 ./node-master: 2180500 ........ 43.31%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=256 n=2500000: ./node: 4903100 ./node-master: 3918900 ....... 25.11%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=1024 n=2500000: ./node: 4909800 ./node-master: 3802800 ...... 29.11%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=16 n=2500000: ./node: 131990 ./node-master: 95528 .......... 38.17%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=64 n=2500000: ./node: 401520 ./node-master: 323660 ......... 24.05%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=256 n=2500000: ./node: 964750 ./node-master: 824650 ........ 16.99%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=1024 n=2500000: ./node: 1424400 ./node-master: 1378400 ...... 3.33%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=16 n=2500000: ./node: 33139 ./node-master: 24806 ........... 33.59%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=64 n=2500000: ./node: 101440 ./node-master: 77741 .......... 30.48%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=256 n=2500000: ./node: 242900 ./node-master: 203300 ........ 19.48%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=1024 n=2500000: ./node: 358530 ./node-master: 336270 ........ 6.62%


string_decoder/string-decoder-create.js encoding=ascii n=25000000: ./node: 17123000 ./node-master: 7115100 .. 140.66%
string_decoder/string-decoder-create.js encoding=utf8 n=25000000: ./node: 3912500 ./node-master: 2699900 ..... 44.91%
string_decoder/string-decoder-create.js encoding=utf-8 n=25000000: ./node: 3843400 ./node-master: 2118000 .... 81.46%
string_decoder/string-decoder-create.js encoding=base64 n=25000000: ./node: 3112700 ./node-master: 2427900 ... 28.21%
string_decoder/string-decoder-create.js encoding=ucs2 n=25000000: ./node: 3626400 ./node-master: 2487700 ..... 45.77%
string_decoder/string-decoder-create.js encoding=UTF-8 n=25000000: ./node: 2867000 ./node-master: 1915200 .... 49.70%
string_decoder/string-decoder-create.js encoding=AscII n=25000000: ./node: 7559200 ./node-master: 4634900 .... 63.09%
string_decoder/string-decoder-create.js encoding=UTF-16LE n=25000000: ./node: 2690700 ./node-master: 1747700 . 53.96%

jasnell · 2016-05-15T19:46:16Z

lib/string_decoder.js

-    throw new Error('Unknown encoding: ' + encoding);
+// Do not cache `Buffer.isEncoding` when checking encoding names as some
+// modules monkey-patch it to support additional encodings
+function normalizeEncoding(enc) {


I'm thinking it would make sense to move this into an internal util

Perhaps. I was working on some Buffer changes that might end up doing something similar.

fwiw, I had been experimenting with something similar in the speculative icu module work. See https://github.com/jasnell/node/blob/icu-module/lib/icu.js#L113-L134.

jasnell · 2016-05-15T20:52:49Z

Overall there's nothing that stands out as being a concern here. I would note that the documentation for string_decoder leaves much to be desired but that can be fixed up later. If CI is green then LGTM

mscdex · 2016-05-15T20:53:16Z

CI: https://ci.nodejs.org/job/node-test-pull-request/2651/

addaleax · 2016-05-16T00:15:38Z

lib/string_decoder.js

+// characters.
+exports.StringDecoder = StringDecoder;
+function StringDecoder(encoding) {
+  this.encoding = normalizeEncoding(encoding);


This should default to utf8 when encoding is empty/undefined/generally false-y, no?

Fixed + test added.

addaleax · 2016-05-16T00:57:12Z

CI is green, and LGTM if the fallback to 'utf8' is restored as stated in the docs.

mscdex · 2016-05-16T04:42:15Z

CI after a few nits were addressed: https://ci.nodejs.org/job/node-test-pull-request/2652/

mscdex · 2016-05-16T05:33:59Z

I had to change the UTF-8 implementation due to being unable to distinguish between explicit and implicit replacement characters. The new UTF-8 implementation now does incomplete character checking similar to the original implementation. UTF-8 performance numbers are still the same however.

addaleax · 2016-05-16T09:10:28Z

CI with the most recent changes: https://ci.nodejs.org/job/node-test-commit/3352/
LGTM if it’s green, nice work!

mscdex · 2016-05-18T05:06:37Z

@jasnell Does this still LGTY (after the changes I made to the UTF-8 implementation)?

jasnell · 2016-05-18T05:14:52Z

Yes. I'd like to see if we can make further improvements and optimizations to this but this LGTM

mscdex · 2016-05-18T05:23:07Z

@jasnell I'm not sure much more is possible for UTF-8 on the js-side of things. AFAIK the only way now to optimize UTF-8 further would be to try to skip v8's UTF-8 decoder and use a custom C/C++ UTF-8 to UTF-16 decoder and pass the resulting output to v8? I'm assuming v8 doesn't do any additional processing on two-byte strings (created via String::NewFromTwoByte).

mscdex · 2016-05-18T05:26:11Z

I'm being extra cautious and marking this as not for v4.x. If it's one thing I've learned, it's that node's tests aren't always up to snuff and may not (always) be good enough for checking rewrites like this. Let's let it stew in current branches first.

jasnell · 2016-05-18T05:26:56Z

ICU includes a number of highly optimized UTF8/UTF16 utilities that could potentially be leveraged. In fact, we might be able to drop down to a full native implementation of string_decoder that is based entirely on ICU's utilities.

bnoordhuis · 2016-05-18T09:46:22Z

lib/string_decoder.js

+    i = 0;
+  }
+  if (i < buf.length)
+    return (r ? r + this.text(buf, i) : this.text(buf, i));


EDIT: Removed, misread.

bnoordhuis · 2016-05-18T12:49:56Z

LGTM

This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability. Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking. PR-URL: nodejs#6777 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>

mscdex · 2016-05-29T18:51:30Z

One last CI before landing: https://ci.nodejs.org/job/node-test-pull-request/2855/

This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability. Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking. PR-URL: nodejs#6777 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>

This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability. Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking. PR-URL: #6777 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>

gagern · 2016-06-15T13:00:42Z

See #7308 for some breakage apparently caused by this change here.

nodejs-github-bot added the buffer Issues and PRs related to the buffer subsystem. label May 15, 2016

mscdex added string_decoder Issues and PRs related to the string_decoder subsystem. and removed buffer Issues and PRs related to the buffer subsystem. labels May 15, 2016

jasnell reviewed May 15, 2016
View reviewed changes

addaleax reviewed May 16, 2016
View reviewed changes

mscdex force-pushed the stringdecoder-rewrite branch from e034539 to 21cc740 Compare May 16, 2016 04:37

mscdex force-pushed the stringdecoder-rewrite branch from 21cc740 to efd0a38 Compare May 16, 2016 05:27

mscdex added the dont-land-on-v4.x label May 18, 2016

bnoordhuis reviewed May 18, 2016
View reviewed changes

mscdex force-pushed the stringdecoder-rewrite branch from efd0a38 to d23b7d2 Compare May 29, 2016 18:49

mscdex merged commit d23b7d2 into nodejs:master May 29, 2016

mscdex deleted the stringdecoder-rewrite branch May 29, 2016 19:33

gagern mentioned this pull request Jun 15, 2016

string_decoder: RangeError introduced by 6.2.1 #7308

Closed

gagern mentioned this pull request Jun 15, 2016

Set travis node version to 6.2.0 CindyJS/CindyJS#374

Merged

jasnell mentioned this pull request Jun 27, 2016

buffer,string_decoder: consolidate encoding validation logic #7207

Closed

3 tasks

mscdex mentioned this pull request Jul 12, 2016

string_decoder: fix bad utf8 character handling #7310

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string_decoder: rewrite implementation #6777

string_decoder: rewrite implementation #6777

mscdex commented May 15, 2016

jasnell May 15, 2016

mscdex May 15, 2016

jasnell May 15, 2016

jasnell commented May 15, 2016

mscdex commented May 15, 2016

addaleax May 16, 2016

mscdex May 16, 2016

addaleax commented May 16, 2016

mscdex commented May 16, 2016

mscdex commented May 16, 2016 •

edited

Loading

addaleax commented May 16, 2016

mscdex commented May 18, 2016

jasnell commented May 18, 2016

mscdex commented May 18, 2016

mscdex commented May 18, 2016 •

edited

Loading

jasnell commented May 18, 2016 •

edited

Loading

bnoordhuis May 18, 2016 •

edited

Loading

bnoordhuis commented May 18, 2016

mscdex commented May 29, 2016

gagern commented Jun 15, 2016

string_decoder: rewrite implementation #6777

string_decoder: rewrite implementation #6777

Conversation

mscdex commented May 15, 2016

Checklist

Affected core subsystem(s)

Description of change

jasnell May 15, 2016

Choose a reason for hiding this comment

mscdex May 15, 2016

Choose a reason for hiding this comment

jasnell May 15, 2016

Choose a reason for hiding this comment

jasnell commented May 15, 2016

mscdex commented May 15, 2016

addaleax May 16, 2016

Choose a reason for hiding this comment

mscdex May 16, 2016

Choose a reason for hiding this comment

addaleax commented May 16, 2016

mscdex commented May 16, 2016

mscdex commented May 16, 2016 • edited Loading

addaleax commented May 16, 2016

mscdex commented May 18, 2016

jasnell commented May 18, 2016

mscdex commented May 18, 2016

mscdex commented May 18, 2016 • edited Loading

jasnell commented May 18, 2016 • edited Loading

bnoordhuis May 18, 2016 • edited Loading

Choose a reason for hiding this comment

bnoordhuis commented May 18, 2016

mscdex commented May 29, 2016

gagern commented Jun 15, 2016

mscdex commented May 16, 2016 •

edited

Loading

mscdex commented May 18, 2016 •

edited

Loading

jasnell commented May 18, 2016 •

edited

Loading

bnoordhuis May 18, 2016 •

edited

Loading