Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream: support decoding buffers for Writables #7425

Closed
wants to merge 2 commits into from

Conversation

addaleax
Copy link
Member

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • documentation is changed or added
  • the commit message follows commit guidelines
Affected core subsystem(s)

stream

Description of change

See #7315 for motivation.

Support decoding the input of writable streams to a specific decoding before passing it to _write().

By default, all data written to a writable stream is encoded into Buffers. This change enables the reverse situation, i.e. when it is desired by the stream implementer to process all input as strings, whether it was passed to write() as a string or not.

This makes sense for multi-byte character encodings where the buffers that are written using write() may contain partial characters, so calling chunk.toString() is not universally applicable.

@addaleax addaleax added stream Issues and PRs related to the stream subsystem. semver-minor PRs that contain new features and should be released in the next minor version. labels Jun 26, 2016
@addaleax
Copy link
Member Author

CI: https://ci.nodejs.org/job/node-test-commit/3854/

/cc @mscdex @nodejs/streams

@@ -11,6 +11,7 @@ const util = require('util');
const internalUtil = require('internal/util');
const Stream = require('stream');
const Buffer = require('buffer').Buffer;
const StringDecoder = require('string_decoder').StringDecoder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might lazy-load this as is done in Readable?

@mcollina
Copy link
Member

mcollina commented Jun 26, 2016

Can you please check if this causes any performance issue? Please fix the nits @mscdex pointed out first.
I'm a bit concerned about decodeChunk, as it now has many ifs in it, and it is probably a very hot function.

@addaleax
Copy link
Member Author

Addressed the above comments (both function bodies should be below 600 characters now) & added a benchmark. Output from comparing with master using #7094 shows no significant difference for the existing functionality (no automatic decoding), but feel invited to suggest other/better ways to benchmark:

                                                                                            improvement significant      p.value
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="buffer" n=50000         -2.37 %             3.675244e-01
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="string" n=50000         -0.45 %             8.177667e-01
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="buffer" n=50000          1.72 %             4.993509e-01
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="string" n=50000          0.10 %             9.567027e-01

@sonewman
Copy link
Contributor

If the stream is in objectMode does that just override the decodeBuffer option?

@mcollina
Copy link
Member

Can you please check the "old" net & http benchmarks?
LGTM if there are no issues.

@sonewman
Copy link
Contributor

The only reason I ask is because if it does, it looks like we would still create the string decoder anyway. Even though it would not be used. Please feel free to correct me if I am wrong.

@mafintosh
Copy link
Member

I'm 👎 on adding more features to streams that can be solved in userland. It's already complicated as it is. You can do this now using a string decoder in a pipeline or your write method impl.

@mscdex
Copy link
Contributor

mscdex commented Jun 27, 2016

@mafintosh The same could be said about string decoder in Readable streams ;-)

@mafintosh
Copy link
Member

@mscdex if anything we should remove that from readable streams / deprecate it as well hehe

'use strict';

const common = require('../common');
const v8 = require('v8');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

@addaleax
Copy link
Member Author

@sonewman Yes, the behaviour for object mode is unaltered (no encoding/decoding in any case). If you feel that it is important, I can add a check for object mode.

@mcollina Running the benchmarks now, this might take a while.

@mafintosh

You can do this now using a string decoder in a pipeline

Not sure what you mean. StringDecoder is not actually a stream, even if it looks like it, so piping through it is not really feasible.

And yes, of course this can be done in the _write implementation, but arguably string decoding/encoding is actually part of the stream’s state, so implementing it in core seems reasonable to me, too.

@mcollina
Copy link
Member

@addaleax the general sentiment is that there are already too many features in streams.
I think that strings should be not supported at all in binary streams, as they are something that was introduced when we had no Buffer.

@mafintosh Pragmatically, I don't see us removing any features from streams, and this brings parity between Readable and Writable. I had to do some gymnastic to support utf8 characters in split2: https://github.com/mcollina/split2/blob/master/index.js#L20-L26 mcollina/split2#7.

@addaleax any news on those benchmarks?

@addaleax
Copy link
Member Author

If the outcome here is a decision based on a general “no new features in streams” rule (and I would totally understand that), could we formalize that by setting the stability index of streams to Locked?

@mcollina Sorry, had to restart the HTTP benchmarks (and running each one 30× in a row takes a lot of time). Full output of the analysis for the net ones is here:
net-benchmarks-1.txt

I’ll look into the net/net-c2s.js (where there is are actually an improvement?) and net/tcp-raw-pipe.js ones in more detail.

@mcollina
Copy link
Member

Please do, +/- 1% is definitely in the range of tolerance.
Maybe we can just wait #7094 to land, and then rebase this one on top of that.

@addaleax
Copy link
Member Author

Maybe we can just wait #7094 to land, and then rebase this one on top of that.

I would expect #7094 to land first anyway. :)

And sorry, I had to re-start the HTTP benchmarks again (didn’t have wrk installed… 🙈)… this time I should have gotten it right.

@addaleax
Copy link
Member Author

addaleax commented Aug 4, 2016

Btw, here is the benchmark comparison output now that #7094 has landed (raw data). It’s quite the mix of positive and negative changes, I wouldn’t say there’s a noticable overall impact of this PR.

Again, there are some obviously unrelated but significant results (e.g. in check_is_http_token). /cc @AndreasMadsen for ideas on how to “fix” that?

@mcollina
Copy link
Member

mcollina commented Aug 4, 2016

@addaleax there are a couple of benchmarks on net/tcp-raw-pipe.js that are -10%. Those needs to be on par.

@addaleax
Copy link
Member Author

addaleax commented Aug 4, 2016

I’ll run those again with more iterations to get a clearer picture there (btw, the full 30× run of all HTTP & net benchmarks took over 1½ days… I’m not doing that again :D)

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Aug 4, 2016

@addaleax OMG, that is some bad luck. There are 270 benchmarks, so lets just consider *** to be significant, to reduce type 1 errors. That means we can expect 0.27 to be false positive, or 0.031 if just considering the check_is_http_token benchmarks. Getting 4 or more significant is just ridiculously unlikely (1 - pbinom(3, 31, 0.001) = 3.079267e-08).

This can of cause happen, but I think we should assume it isn't just an accident. Looking at the histogram and kernel estimated distribution, it is clear that the values aren't normally distributed.

rplot

rplot03

The t distribution assumes a normal distribution, however the error is often small when considering 30 or more samples (central limit theorem). I'm not sure what causes this behaviour (maybe external changes). Coming to think about it, I'm not sure the measurements are theoretically normally distributed, it is essentially an inverse mean which by the central limit theorem should make them inverse normal distributed not normal distributed. (A fix could be to measure in sec/op, I need to try it out.) However this still doesn't explain the group (sum of two normal) distribution, that I think can only be caused by external changes (maybe we should just randomize the order).

(btw, the full 30× run of all HTTP & net benchmarks took over 1½ days… I’m not doing that again :D)

Yes. The benchmarks are "calibrated" for eye balling. Now that we have the statistical tools, the number of iterations in each benchmark needs to be reduced.

@addaleax
Copy link
Member Author

addaleax commented Aug 4, 2016

Okay, net/tcp-raw-pipe only:
raw data, compare.js output:

                                                   improvement significant   p.value
 net/tcp-raw-pipe.js dur=5 type="asc" len=102400        0.04 %             0.8971511
 net/tcp-raw-pipe.js dur=5 type="asc" len=16777216      0.29 %             0.1865769
 net/tcp-raw-pipe.js dur=5 type="buf" len=102400        0.34 %             0.2119530
 net/tcp-raw-pipe.js dur=5 type="buf" len=16777216      0.17 %             0.5240107
 net/tcp-raw-pipe.js dur=5 type="utf" len=102400        0.30 %             0.4999322
 net/tcp-raw-pipe.js dur=5 type="utf" len=16777216      0.09 %             0.8339992

About what one would expect if there’s no actual difference.

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Aug 4, 2016

Took a closer look at some of the benchmarks:

                                                                improvement significant      p.value     sw.old.p     sw.new.p
 http/check_is_http_token.js n=500000000 key="Content-Encoding"      0.88 %         *** 2.229408e-05 6.667391e-05 3.628832e-06
 http/check_is_http_token.js n=500000000 key="content-length"        0.33 %             1.936083e-01 2.188204e-05 2.578660e-04
 http/check_is_http_token.js n=500000000 key="Content-Location"      0.93 %         *** 1.668739e-05 6.907060e-05 5.897051e-07
 http/check_is_http_token.js n=500000000 key="Content-Type"          0.89 %             1.769491e-01 3.938976e-10 2.517831e-04
 http/check_is_http_token.js n=500000000 key="content-type"          0.67 %             2.333239e-01 1.960339e-09 4.177647e-04

density

From both the Shapiro-Wilk test (sw) and just eyeballing the kernel density estimates, the data is definitely not normally distributed and the worst cases are also those that show significance. To investigate external affect, I have plotted the measurements together with the index (almost time).

time

It is very hard to eyeball, but I would say that something happened around 140-180 (and a few other places) that caused Content-Encoding and Content-Location to get slower.

Just for fun I tried running almost (decreased n by a factor 10) the same compare script as you:

./node benchmark/compare.js --old ./node-master --new ./node-stream --filter check_is_http_token --set n=50000000 --set key='Content-Encoding' -- http | tee compare.csv

which gives:

                                                               improvement significant p.value  sw.old.p  sw.new.p
 http/check_is_http_token.js n=50000000 key="Content-Encoding"     -0.60 %             0.12222 0.2888716 0.9960188

density

This is much better and does not show significance, so my best guess is that there where to much external input.

edit: I did some reading, apparently the reciprocal normal distribution is a bimodal distribution (has two tops) and has an undefined mean (and thus also variance). So it could explain the bad results (it is always hard to say what happens in practice when the statistical assumptions aren't met). I will spend some more time on it :)

edit 2: while it will theoretically be better to measure in sec/obs, it doesn't matter from a statistical perspective when the obs/sec mean is sufficiently high and its variance is sufficiently low. It is only a normal distribution that is somewhat centered around 0 (not the case here), that becomes bimodal under the reciprocal transformation, otherwise it remains normally distributed.

@rvagg rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:02
@jasnell
Copy link
Member

jasnell commented Jan 6, 2017

Ping @addaleax ... what do you want to do with this one?

@jasnell jasnell added the stalled Issues and PRs that are stalled. label Jan 6, 2017
Support decoding the input of writable streams to a specific
decoding before passing it to `_write()`.

By default, all data written to a writable stream is encoded
into Buffers. This change enables the reverse situation, i.e.
when it is desired by the stream implementer to process all
input as strings, whether it was passed to `write()` as a string
or not.

This makes sense for multi-byte character encodings where
the buffers that are written using `write()` may contain
partial characters, so calling `chunk.toString()` is not universally
applicable.

Fixes: nodejs#7315
@addaleax
Copy link
Member Author

addaleax commented Jan 6, 2017

Kinda still think this should happen, kinda tired of waiting for benchmarks to finish. I’ve rebased this and appreciate reviews… I can try to run the benchmarks sometime soon. But if any @nodejs/collaborators want to help (or even take over), please feel free to do that.

@mscdex
Copy link
Contributor

mscdex commented Jan 6, 2017

@addaleax IMHO it'd be better to just write a stream benchmark instead of relying on the net benchmarks.

@addaleax
Copy link
Member Author

addaleax commented Jan 6, 2017

@mscdex I had run the stream benchmarks for this, and all that net/http benchmarking stuff was just because it was requested in this thread to make sure. If you think the pre-existing stream benchmarks plus the one included in this PR don’t cover enough ground, it would be cool to get some hints as to how they could be extended.

@mscdex
Copy link
Contributor

mscdex commented Jan 6, 2017

@addaleax Oh oops, nevermind, I didn't see that you added one already in this PR.

@mscdex
Copy link
Contributor

mscdex commented Jan 7, 2017

FWIW here's what I get after simplifying the code a bit (see diff below) and increasing the number of runs to 60 and increasing n:

                                                                                          improvement significant      p.value
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="buffer" n=3000000     -3.74 %          ** 0.0048280826
 streams/writable-simple.js decodeAs="" inputEncoding="ucs2" inputType="string" n=3000000      1.44 %             0.1239054500
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="buffer" n=3000000     -4.55 %         *** 0.0008835026
 streams/writable-simple.js decodeAs="" inputEncoding="utf8" inputType="string" n=3000000     -0.15 %             0.7084000024
diff

diff --git a/lib/_stream_writable.js b/lib/_stream_writable.js
index deb6e08..8d786c0 100644
--- a/lib/_stream_writable.js
+++ b/lib/_stream_writable.js
@@ -295,42 +295,40 @@ Writable.prototype.setDefaultEncoding = function setDefaultEncoding(encoding) {
   return this;
 };
 
-function decodeChunk(state, chunk, encoding) {
-  if (state.objectMode)
-    return chunk;
-
-  var sd = state.stringDecoder;
-  if (typeof chunk === 'string') {
-    if (sd !== null && encoding === sd.encoding && sd.lastNeed === 0)
-      return chunk; // No re-encoding encessary.
-
-    if (state.decodeStrings !== false || sd !== null)
-      chunk = Buffer.from(chunk, encoding);
-  }
-
-  if (sd !== null) {
-    // chunk is always a Buffer now.
-    if (state.flushing) {
-      chunk = sd.end(chunk);
-    } else {
-      chunk = sd.write(chunk);
-    }
+function decodeString(state, str, encoding) {
+  if (!state.objectMode &&
+      state.decodeStrings !== false &&
+      typeof str === 'string' &&
+      (!state.stringDecoder ||
+       state.stringDecoder.encoding !== encoding ||
+       state.stringDecoder.lastNeed)) {
+    str = Buffer.from(str, encoding);
+    if (state.stringDecoder)
+      str = decodeBuffer(state, str);
   }
+  return str;
+}
 
-  return chunk;
+function decodeBuffer(state, buf) {
+  if (state.flushing)
+    return state.stringDecoder.end(buf);
+  else
+    return state.stringDecoder.write(buf);
 }
 
 // if we're already writing something, then just put this
 // in the queue, and wait our turn.  Otherwise, call _write
 // If we return false, then we need a drain event, so set that flag.
 function writeOrBuffer(stream, state, isBuf, chunk, encoding, cb) {
-  var sd = state.stringDecoder;
-  if (!isBuf || sd) {
-    chunk = decodeChunk(state, chunk, encoding);
+  if (!isBuf) {
+    chunk = decodeString(state, chunk, encoding);
     if (chunk instanceof Buffer)
       encoding = 'buffer';
-    else if (sd)
-      encoding = sd.encoding;
+    else if (state.stringDecoder)
+      encoding = state.stringDecoder.encoding;
+  } else if (state.stringDecoder && !state.objectMode) {
+    chunk = decodeBuffer(state, chunk);
+    encoding = state.stringDecoder.encoding;
   }
 
   var len = state.objectMode ? 1 : chunk.length;

The reason I made the changes is that they seemed to perform closer/better to current master, at least when running the benchmarks and comparing results visually.

@jasnell
Copy link
Member

jasnell commented Aug 24, 2017

Closing due to lack of any visible forward progress. We can reopen if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor PRs that contain new features and should be released in the next minor version. stalled Issues and PRs that are stalled. stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants