http: interpret strings as utf8 #2629

vkurchatkin · 2015-08-31T18:09:28Z

At some point http began to interpret all strings as binary. This causes various incompatibilities in both incoming and outgoing messages. Incoming message seem more problematic, since both url and header values could be corrupted and there is no good way to tell whether they are.

It's probably right that clients should properly encode urls and headers, but some don't and this used to work, so a lot of code is broken by this change.

R=@bnoordhuis

Fishrock123 · 2015-08-31T18:27:25Z

test/parallel/test-http-utf-incoming.js

+
+'use strict';
+var common = require('../common');
+var assert = require('assert');


Please use const for these. :)

trevnorris · 2015-09-02T05:26:51Z

At some point http began to interpret all strings as binary.

They have always been decoded as binary, and before that ascii. This is the correct way, following the standard to decode HTTP headers.

-1 on the change.

trevnorris · 2015-09-02T05:36:29Z

@vkurchatkin I see your example in the linked issue. Am investigating something. I pointed out the change to use the one byte API and suggested encoding be changed to 'binary' instead of 'ascii'. So the results conflict with my known timeline. Will come back w/ the results.

trevnorris · 2015-09-02T06:44:51Z

The header parsing change came from f674b09. Where String::New() was swapped to use OneByteString(). For reference, this was released in v0.11.6.

The commit I was referring to is 1f9f863. This logic path works a little differently. Take this example code:

require('http').createServer(function(req, res) {
  res.setHeader('foo', '\u0222');
  res.end('hi all');
}).listen(3000);

Here's the response from curl -v http://localhost:3000/:

< HTTP/1.1 200 OK
< foo: Ȣ
< Date: Wed, 02 Sep 2015 06:30:23 GMT
< Connection: keep-alive
< Transfer-Encoding: chunked
<
hi all

Another example script with slight variation demonstrating how to truncate the header:

require('http').createServer(function(req, res) {
  res.setHeader('foo', '\u0222');
  res.end(new Buffer('hi all\n', 'binary'));
}).listen(3000);

Output from curl again:

< HTTP/1.1 200 OK
< foo: "                     <-- This header has been truncated
< Date: Wed, 02 Sep 2015 06:32:14 GMT
< Connection: keep-alive
< Transfer-Encoding: chunked
<
hi all

So this was the logic path I was referring to. My apologies.

So I rescind my -1 and am now on the fence. Will browsers properly parse headers if sent in utf8?

Also, this is a performance hit. Is there a way we can still tell the server to interpret incoming data as 'binary' instead? This would help for the cases when we know utf8 decoding isn't necessary.

On the side, there's an interesting little bug if you use the flushHeaders() API:

require('http').createServer(function(req, res) {
  res.setHeader('foo', '\u0222');
  res.flushHeaders();
  res.end(new Buffer('hi all\n', 'binary'));
}).listen(3000);

The results are the same as if a string had been passed. This is because it only passes in the 'binary' encoding automatically if the data is a Buffer (or if the string encoding is 'hex'/'base64'). But if the encoding is still undefined when it's time for the headers to be written then it falls back to state.defaultEncoding. Which is utf8.

vkurchatkin · 2015-09-02T08:58:24Z

@trevnorris you are talking about writing utf-8 (which is also a problem), this PR is about parsing only.

Also, this is a performance hit.

I'm not sure there is a way around it. Maybe we can treat headers that are supposed to have only latin-1 values

trevnorris · 2015-09-02T09:07:52Z

@vkurchatkin

you are talking about writing utf-8 (which is also a problem), this PR is about parsing only.

Yes. I was correcting my previous erroneous comment about it having always been binary decoded.

Maybe we can treat headers that are supposed to have only latin-1 values

Not clear what you're getting at. All headers are supposed to be only ASCII (we switched to latin1 since it's faster). So I'm not sure how we'd determine this automatically. Hence my request of being able to set the header encoding if this change is made. So I can still parse my headers without the overhead. I think I have a way to make this conditional practically a noop. Though it might also be better left for another PR.

vkurchatkin · 2015-09-02T09:17:19Z

We can have a list of headers that are interpreted as ASCII because it's unlikely that someone sends UTF-8 in them. Custom headers or headers containing urls (like Referer) would still be interpreted as UTF-8.

Hence my request of being able to set the header encoding if this change is made.

We can have a strict option that is default to false

ronkorving · 2015-09-02T23:59:10Z

src/node_http_parser.cc

@@ -136,7 +136,10 @@ struct StringPtr {

  Local<String> ToString(Environment* env) const {
    if (str_)
-      return OneByteString(env->isolate(), str_, size_);
+      return String::NewFromUtf8(env->isolate(),


Besides the question "is this more correct?", I am curious about the performance impact this may have.

The benchmarks would be bias in out current benchmarks. We need to set some up that have various header lengths.

I'm curious too though, aren't HTTP headers supposed to be ASCII and not unicode?

@bnoordhuis gave an explanation here: #1693 (comment). Basically that most headers use US-ASCII, but traditionally ISO-8859-1 (i.e. latin-1) is allowed. latin-1 is the encoding returned by OneByteString(). So is what we currently use. Note that ASCII is a subset of latin-1.

While our current implementation is definitely faster, there's a problem with developers/companies moving from v0.10 to v4 LTS being hit by the change in functionality. Any solution outside of what we currently use will be a performance hit. How much will depend on the chosen solution.

So going with what Ben said, why the change to String::NewFromUtf8? Does v0.10 read UTF-8 strings here?

@ronkorving Yes. v0.10 reads in headers as utf-8. Which is incorrect per the specification.

Thing is, if everyone followed US-ASCII then functionally we wouldn't notice a difference. It's not until someone uses the full latin-1 encoding (which is spec compliant) that they'd experience issues.

Got it, thanks.

Fishrock123 · 2015-09-14T13:26:10Z

Deferring to @trevnorris and @jasnell here. We discussed this at length at nodeconf eu, and it appears going back to using utf8 is a bad thing in regards to web standards and security or something. -1

jasnell · 2015-09-14T16:52:13Z

Header field values are very strictly defined for a reason (See https://tools.ietf.org/html/rfc7230#section-3.2.6). Allowing any value other than that spec'd, especially UTF-8 or arbitrary binary is dangerous. -1 on this change.

vkurchatkin · 2015-09-16T08:56:08Z

It looks like we have consensus here. Just wanted to point that this can cause hard to debug breakages after update from 0.10.

Flimm · 2017-05-30T13:26:36Z

I've created a related issue with the current master of Node #13296

http: interpret strings as utf8

6e3df44

Fishrock123 added the http Issues or PRs related to the http subsystem. label Aug 31, 2015

Fishrock123 mentioned this pull request Aug 31, 2015

incorrect URL parsing #2114

Closed

Fishrock123 reviewed Aug 31, 2015
View reviewed changes

ronkorving reviewed Sep 2, 2015
View reviewed changes

rvagg force-pushed the master branch from 11c25c2 to ba02bd0 Compare September 6, 2015 11:55

vkurchatkin closed this Sep 16, 2015

Fishrock123 mentioned this pull request Sep 16, 2015

Issue with cyrillic symbols in location headers #1693

Closed

Mickael-van-der-Beek mentioned this pull request Nov 29, 2017

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http: interpret strings as utf8 #2629

http: interpret strings as utf8 #2629

vkurchatkin commented Aug 31, 2015

Fishrock123 Aug 31, 2015

trevnorris commented Sep 2, 2015

trevnorris commented Sep 2, 2015

trevnorris commented Sep 2, 2015

vkurchatkin commented Sep 2, 2015

trevnorris commented Sep 2, 2015

vkurchatkin commented Sep 2, 2015

ronkorving Sep 2, 2015

trevnorris Sep 3, 2015

ronkorving Sep 3, 2015

trevnorris Sep 3, 2015

ronkorving Sep 3, 2015

trevnorris Sep 3, 2015

ronkorving Sep 3, 2015

Fishrock123 commented Sep 14, 2015

jasnell commented Sep 14, 2015

vkurchatkin commented Sep 16, 2015

Flimm commented May 30, 2017

http: interpret strings as utf8 #2629

http: interpret strings as utf8 #2629

Conversation

vkurchatkin commented Aug 31, 2015

Choose a reason for hiding this comment

trevnorris commented Sep 2, 2015

trevnorris commented Sep 2, 2015

trevnorris commented Sep 2, 2015

vkurchatkin commented Sep 2, 2015

trevnorris commented Sep 2, 2015

vkurchatkin commented Sep 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fishrock123 commented Sep 14, 2015

jasnell commented Sep 14, 2015

vkurchatkin commented Sep 16, 2015

Flimm commented May 30, 2017