stream: support decoding buffers for Writables #7425

addaleax · 2016-06-26T04:30:59Z

Checklist

make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
a test and/or benchmark is included
documentation is changed or added
the commit message follows commit guidelines

Affected core subsystem(s)

stream

Description of change

See #7315 for motivation.

Support decoding the input of writable streams to a specific decoding before passing it to _write().

By default, all data written to a writable stream is encoded into Buffers. This change enables the reverse situation, i.e. when it is desired by the stream implementer to process all input as strings, whether it was passed to write() as a string or not.

This makes sense for multi-byte character encodings where the buffers that are written using write() may contain partial characters, so calling chunk.toString() is not universally applicable.

addaleax · 2016-06-26T04:33:59Z

CI: https://ci.nodejs.org/job/node-test-commit/3854/

/cc @mscdex @nodejs/streams

mscdex · 2016-06-26T05:37:21Z

lib/_stream_writable.js

@@ -11,6 +11,7 @@ const util = require('util');
 const internalUtil = require('internal/util');
 const Stream = require('stream');
 const Buffer = require('buffer').Buffer;
+const StringDecoder = require('string_decoder').StringDecoder;


You might lazy-load this as is done in Readable?

mcollina · 2016-06-26T08:04:42Z

Can you please check if this causes any performance issue? Please fix the nits @mscdex pointed out first.
I'm a bit concerned about decodeChunk, as it now has many ifs in it, and it is probably a very hot function.

addaleax · 2016-06-27T04:41:32Z

Addressed the above comments (both function bodies should be below 600 characters now) & added a benchmark. Output from comparing with master using #7094 shows no significant difference for the existing functionality (no automatic decoding), but feel invited to suggest other/better ways to benchmark:

                                                                                            improvement significant      p.value
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="buffer" n=50000         -2.37 %             3.675244e-01
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="string" n=50000         -0.45 %             8.177667e-01
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="buffer" n=50000          1.72 %             4.993509e-01
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="string" n=50000          0.10 %             9.567027e-01

sonewman · 2016-06-27T06:35:19Z

If the stream is in objectMode does that just override the decodeBuffer option?

mcollina · 2016-06-27T06:45:17Z

Can you please check the "old" net & http benchmarks?
LGTM if there are no issues.

sonewman · 2016-06-27T06:49:43Z

The only reason I ask is because if it does, it looks like we would still create the string decoder anyway. Even though it would not be used. Please feel free to correct me if I am wrong.

mafintosh · 2016-06-27T14:57:22Z

I'm 👎 on adding more features to streams that can be solved in userland. It's already complicated as it is. You can do this now using a string decoder in a pipeline or your write method impl.

mscdex · 2016-06-27T16:56:15Z

@mafintosh The same could be said about string decoder in Readable streams ;-)

mafintosh · 2016-06-27T17:58:04Z

@mscdex if anything we should remove that from readable streams / deprecate it as well hehe

mscdex · 2016-06-27T18:02:19Z

benchmark/streams/writable-simple.js

+'use strict';
+
+const common = require('../common');
+const v8 = require('v8');


addaleax · 2016-06-27T18:26:07Z

@sonewman Yes, the behaviour for object mode is unaltered (no encoding/decoding in any case). If you feel that it is important, I can add a check for object mode.

@mcollina Running the benchmarks now, this might take a while.

@mafintosh

You can do this now using a string decoder in a pipeline

Not sure what you mean. StringDecoder is not actually a stream, even if it looks like it, so piping through it is not really feasible.

And yes, of course this can be done in the _write implementation, but arguably string decoding/encoding is actually part of the stream’s state, so implementing it in core seems reasonable to me, too.

mcollina · 2016-06-28T08:18:33Z

@addaleax the general sentiment is that there are already too many features in streams.
I think that strings should be not supported at all in binary streams, as they are something that was introduced when we had no Buffer.

@mafintosh Pragmatically, I don't see us removing any features from streams, and this brings parity between Readable and Writable. I had to do some gymnastic to support utf8 characters in split2: https://github.com/mcollina/split2/blob/master/index.js#L20-L26 mcollina/split2#7.

@addaleax any news on those benchmarks?

addaleax · 2016-06-29T01:53:38Z

If the outcome here is a decision based on a general “no new features in streams” rule (and I would totally understand that), could we formalize that by setting the stability index of streams to Locked?

@mcollina Sorry, had to restart the HTTP benchmarks (and running each one 30× in a row takes a lot of time). Full output of the analysis for the net ones is here:
net-benchmarks-1.txt

I’ll look into the net/net-c2s.js (where there is are actually an improvement?) and net/tcp-raw-pipe.js ones in more detail.

mcollina · 2016-06-29T07:01:45Z

Please do, +/- 1% is definitely in the range of tolerance.
Maybe we can just wait #7094 to land, and then rebase this one on top of that.

addaleax · 2016-06-30T06:16:07Z

Maybe we can just wait #7094 to land, and then rebase this one on top of that.

I would expect #7094 to land first anyway. :)

And sorry, I had to re-start the HTTP benchmarks again (didn’t have wrk installed… 🙈)… this time I should have gotten it right.

addaleax · 2016-08-04T11:43:45Z

Btw, here is the benchmark comparison output now that #7094 has landed (raw data). It’s quite the mix of positive and negative changes, I wouldn’t say there’s a noticable overall impact of this PR.

Again, there are some obviously unrelated but significant results (e.g. in check_is_http_token). /cc @AndreasMadsen for ideas on how to “fix” that?

mcollina · 2016-08-04T12:08:39Z

@addaleax there are a couple of benchmarks on net/tcp-raw-pipe.js that are -10%. Those needs to be on par.

addaleax · 2016-08-04T12:16:33Z

I’ll run those again with more iterations to get a clearer picture there (btw, the full 30× run of all HTTP & net benchmarks took over 1½ days… I’m not doing that again :D)

AndreasMadsen · 2016-08-04T14:09:01Z

@addaleax OMG, that is some bad luck. There are 270 benchmarks, so lets just consider *** to be significant, to reduce type 1 errors. That means we can expect 0.27 to be false positive, or 0.031 if just considering the check_is_http_token benchmarks. Getting 4 or more significant is just ridiculously unlikely (1 - pbinom(3, 31, 0.001) = 3.079267e-08).

This can of cause happen, but I think we should assume it isn't just an accident. Looking at the histogram and kernel estimated distribution, it is clear that the values aren't normally distributed.

The t distribution assumes a normal distribution, however the error is often small when considering 30 or more samples (central limit theorem). I'm not sure what causes this behaviour (maybe external changes). Coming to think about it, I'm not sure the measurements are theoretically normally distributed, it is essentially an inverse mean which by the central limit theorem should make them inverse normal distributed not normal distributed. (A fix could be to measure in sec/op, I need to try it out.) However this still doesn't explain the group (sum of two normal) distribution, that I think can only be caused by external changes (maybe we should just randomize the order).

(btw, the full 30× run of all HTTP & net benchmarks took over 1½ days… I’m not doing that again :D)

Yes. The benchmarks are "calibrated" for eye balling. Now that we have the statistical tools, the number of iterations in each benchmark needs to be reduced.

addaleax · 2016-08-04T17:21:18Z

Okay, net/tcp-raw-pipe only:
raw data, compare.js output:

                                                   improvement significant   p.value
 net/tcp-raw-pipe.js dur=5 type="asc" len=102400        0.04 %             0.8971511
 net/tcp-raw-pipe.js dur=5 type="asc" len=16777216      0.29 %             0.1865769
 net/tcp-raw-pipe.js dur=5 type="buf" len=102400        0.34 %             0.2119530
 net/tcp-raw-pipe.js dur=5 type="buf" len=16777216      0.17 %             0.5240107
 net/tcp-raw-pipe.js dur=5 type="utf" len=102400        0.30 %             0.4999322
 net/tcp-raw-pipe.js dur=5 type="utf" len=16777216      0.09 %             0.8339992

About what one would expect if there’s no actual difference.

AndreasMadsen · 2016-08-04T21:43:43Z

Took a closer look at some of the benchmarks:

                                                                improvement significant      p.value     sw.old.p     sw.new.p
 http/check_is_http_token.js n=500000000 key="Content-Encoding"      0.88 %         *** 2.229408e-05 6.667391e-05 3.628832e-06
 http/check_is_http_token.js n=500000000 key="content-length"        0.33 %             1.936083e-01 2.188204e-05 2.578660e-04
 http/check_is_http_token.js n=500000000 key="Content-Location"      0.93 %         *** 1.668739e-05 6.907060e-05 5.897051e-07
 http/check_is_http_token.js n=500000000 key="Content-Type"          0.89 %             1.769491e-01 3.938976e-10 2.517831e-04
 http/check_is_http_token.js n=500000000 key="content-type"          0.67 %             2.333239e-01 1.960339e-09 4.177647e-04

From both the Shapiro-Wilk test (sw) and just eyeballing the kernel density estimates, the data is definitely not normally distributed and the worst cases are also those that show significance. To investigate external affect, I have plotted the measurements together with the index (almost time).

It is very hard to eyeball, but I would say that something happened around 140-180 (and a few other places) that caused Content-Encoding and Content-Location to get slower.

Just for fun I tried running almost (decreased n by a factor 10) the same compare script as you:

./node benchmark/compare.js --old ./node-master --new ./node-stream --filter check_is_http_token --set n=50000000 --set key='Content-Encoding' -- http | tee compare.csv

which gives:

                                                               improvement significant p.value  sw.old.p  sw.new.p
 http/check_is_http_token.js n=50000000 key="Content-Encoding"     -0.60 %             0.12222 0.2888716 0.9960188

This is much better and does not show significance, so my best guess is that there where to much external input.

edit: I did some reading, apparently the reciprocal normal distribution is a bimodal distribution (has two tops) and has an undefined mean (and thus also variance). So it could explain the bad results (it is always hard to say what happens in practice when the statistical assumptions aren't met). I will spend some more time on it :)

edit 2: while it will theoretically be better to measure in sec/obs, it doesn't matter from a statistical perspective when the obs/sec mean is sufficiently high and its variance is sufficiently low. It is only a normal distribution that is somewhat centered around 0 (not the case here), that becomes bimodal under the reciprocal transformation, otherwise it remains normally distributed.

jasnell · 2017-01-06T19:53:17Z

Ping @addaleax ... what do you want to do with this one?

Support decoding the input of writable streams to a specific decoding before passing it to `_write()`. By default, all data written to a writable stream is encoded into Buffers. This change enables the reverse situation, i.e. when it is desired by the stream implementer to process all input as strings, whether it was passed to `write()` as a string or not. This makes sense for multi-byte character encodings where the buffers that are written using `write()` may contain partial characters, so calling `chunk.toString()` is not universally applicable. Fixes: nodejs#7315

addaleax · 2017-01-06T21:00:59Z

Kinda still think this should happen, kinda tired of waiting for benchmarks to finish. I’ve rebased this and appreciate reviews… I can try to run the benchmarks sometime soon. But if any @nodejs/collaborators want to help (or even take over), please feel free to do that.

mscdex · 2017-01-06T21:10:18Z

@addaleax IMHO it'd be better to just write a stream benchmark instead of relying on the net benchmarks.

addaleax · 2017-01-06T21:13:17Z

@mscdex I had run the stream benchmarks for this, and all that net/http benchmarking stuff was just because it was requested in this thread to make sure. If you think the pre-existing stream benchmarks plus the one included in this PR don’t cover enough ground, it would be cool to get some hints as to how they could be extended.

mscdex · 2017-01-06T21:14:15Z

@addaleax Oh oops, nevermind, I didn't see that you added one already in this PR.

mscdex · 2017-01-07T19:29:37Z

FWIW here's what I get after simplifying the code a bit (see diff below) and increasing the number of runs to 60 and increasing n:

                                                                                          improvement significant      p.value
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="buffer" n=3000000     -3.74 %          ** 0.0048280826
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="string" n=3000000      1.44 %             0.1239054500
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="buffer" n=3000000     -4.55 %         *** 0.0008835026
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="string" n=3000000     -0.15 %             0.7084000024

diff

diff --git a/lib/_stream_writable.js b/lib/_stream_writable.js
index deb6e08..8d786c0 100644
--- a/lib/_stream_writable.js
+++ b/lib/_stream_writable.js
@@ -295,42 +295,40 @@ Writable.prototype.setDefaultEncoding = function setDefaultEncoding(encoding) {
   return this;
 };
 
-function decodeChunk(state, chunk, encoding) {
-  if (state.objectMode)
-    return chunk;
-
-  var sd = state.stringDecoder;
-  if (typeof chunk === 'string') {
-    if (sd !== null && encoding === sd.encoding && sd.lastNeed === 0)
-      return chunk; // No re-encoding encessary.
-
-    if (state.decodeStrings !== false || sd !== null)
-      chunk = Buffer.from(chunk, encoding);
-  }
-
-  if (sd !== null) {
-    // chunk is always a Buffer now.
-    if (state.flushing) {
-      chunk = sd.end(chunk);
-    } else {
-      chunk = sd.write(chunk);
-    }
+function decodeString(state, str, encoding) {
+  if (!state.objectMode &&
+      state.decodeStrings !== false &&
+      typeof str === 'string' &&
+      (!state.stringDecoder ||
+       state.stringDecoder.encoding !== encoding ||
+       state.stringDecoder.lastNeed)) {
+    str = Buffer.from(str, encoding);
+    if (state.stringDecoder)
+      str = decodeBuffer(state, str);
   }
+  return str;
+}
 
-  return chunk;
+function decodeBuffer(state, buf) {
+  if (state.flushing)
+    return state.stringDecoder.end(buf);
+  else
+    return state.stringDecoder.write(buf);
 }
 
 // if we're already writing something, then just put this
 // in the queue, and wait our turn.  Otherwise, call _write
 // If we return false, then we need a drain event, so set that flag.
 function writeOrBuffer(stream, state, isBuf, chunk, encoding, cb) {
-  var sd = state.stringDecoder;
-  if (!isBuf || sd) {
-    chunk = decodeChunk(state, chunk, encoding);
+  if (!isBuf) {
+    chunk = decodeString(state, chunk, encoding);
     if (chunk instanceof Buffer)
       encoding = 'buffer';
-    else if (sd)
-      encoding = sd.encoding;
+    else if (state.stringDecoder)
+      encoding = state.stringDecoder.encoding;
+  } else if (state.stringDecoder && !state.objectMode) {
+    chunk = decodeBuffer(state, chunk);
+    encoding = state.stringDecoder.encoding;
   }
 
   var len = state.objectMode ? 1 : chunk.length;

The reason I made the changes is that they seemed to perform closer/better to current master, at least when running the benchmarks and comparing results visually.

jasnell · 2017-08-24T23:14:16Z

Closing due to lack of any visible forward progress. We can reopen if necessary

addaleax added stream Issues and PRs related to the stream subsystem. semver-minor PRs that contain new features and should be released in the next minor version. labels Jun 26, 2016

mscdex reviewed Jun 26, 2016
View reviewed changes

mscdex reviewed Jun 27, 2016
View reviewed changes

addaleax mentioned this pull request Jul 5, 2016

Refactor benchmark tools for statistical significance #7094

Merged

4 tasks

addaleax force-pushed the stream-auto-decoder branch from f36890b to 66f1091 Compare August 1, 2016 12:43

AndreasMadsen mentioned this pull request Aug 18, 2016

benchmark jenkins job? #8157

Closed

Trott force-pushed the master branch from b0df363 to c5ce7f4 Compare September 21, 2016 00:09

rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:02

jasnell added the stalled Issues and PRs that are stalled. label Jan 6, 2017

addaleax added 2 commits January 6, 2017 21:56

benchmark: add writable stream encoding benchmark

c4ea4e6

addaleax force-pushed the stream-auto-decoder branch from 66f1091 to c4ea4e6 Compare January 6, 2017 20:59

nodejs-github-bot added the dont-land-on-v7.x label Jan 6, 2017

targos removed the dont-land-on-v7.x label Jan 28, 2017

MylesBorins force-pushed the master branch from 8df7ee0 to 54fef67 Compare February 1, 2017 01:00

refack force-pushed the master branch from 16073c0 to fbe946b Compare April 14, 2017 04:11

jasnell closed this Aug 24, 2017

This was referenced Sep 12, 2017

Feature request: automatic string decoding for writable streams #7315

Closed

Document how to use StringDecoder in combination with stream.Transform #15369

Closed

addaleax mentioned this pull request Jan 12, 2019

decodeStrings option for Writable streams doesn't decode incoming data to Strings #25464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stream: support decoding buffers for Writables #7425

stream: support decoding buffers for Writables #7425

addaleax commented Jun 26, 2016

addaleax commented Jun 26, 2016

mscdex Jun 26, 2016

mcollina commented Jun 26, 2016 •

edited

Loading

addaleax commented Jun 27, 2016

sonewman commented Jun 27, 2016

mcollina commented Jun 27, 2016

sonewman commented Jun 27, 2016

mafintosh commented Jun 27, 2016

mscdex commented Jun 27, 2016

mafintosh commented Jun 27, 2016

mscdex Jun 27, 2016

addaleax commented Jun 27, 2016

mcollina commented Jun 28, 2016

addaleax commented Jun 29, 2016

mcollina commented Jun 29, 2016

addaleax commented Jun 30, 2016

addaleax commented Aug 4, 2016

mcollina commented Aug 4, 2016

addaleax commented Aug 4, 2016

AndreasMadsen commented Aug 4, 2016 •

edited

Loading

addaleax commented Aug 4, 2016

AndreasMadsen commented Aug 4, 2016 •

edited

Loading

jasnell commented Jan 6, 2017

addaleax commented Jan 6, 2017

mscdex commented Jan 6, 2017

addaleax commented Jan 6, 2017

mscdex commented Jan 6, 2017

mscdex commented Jan 7, 2017 •

edited

Loading

jasnell commented Aug 24, 2017

stream: support decoding buffers for Writables #7425

stream: support decoding buffers for Writables #7425

Conversation

addaleax commented Jun 26, 2016

Checklist

Affected core subsystem(s)

Description of change

addaleax commented Jun 26, 2016

mscdex Jun 26, 2016

Choose a reason for hiding this comment

mcollina commented Jun 26, 2016 • edited Loading

addaleax commented Jun 27, 2016

sonewman commented Jun 27, 2016

mcollina commented Jun 27, 2016

sonewman commented Jun 27, 2016

mafintosh commented Jun 27, 2016

mscdex commented Jun 27, 2016

mafintosh commented Jun 27, 2016

mscdex Jun 27, 2016

Choose a reason for hiding this comment

addaleax commented Jun 27, 2016

mcollina commented Jun 28, 2016

addaleax commented Jun 29, 2016

mcollina commented Jun 29, 2016

addaleax commented Jun 30, 2016

addaleax commented Aug 4, 2016

mcollina commented Aug 4, 2016

addaleax commented Aug 4, 2016

AndreasMadsen commented Aug 4, 2016 • edited Loading

addaleax commented Aug 4, 2016

AndreasMadsen commented Aug 4, 2016 • edited Loading

jasnell commented Jan 6, 2017

addaleax commented Jan 6, 2017

mscdex commented Jan 6, 2017

addaleax commented Jan 6, 2017

mscdex commented Jan 6, 2017

mscdex commented Jan 7, 2017 • edited Loading

jasnell commented Aug 24, 2017

mcollina commented Jun 26, 2016 •

edited

Loading

AndreasMadsen commented Aug 4, 2016 •

edited

Loading

AndreasMadsen commented Aug 4, 2016 •

edited

Loading

mscdex commented Jan 7, 2017 •

edited

Loading