Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: interpret strings as utf8 #2629

Closed
wants to merge 1 commit into from

Conversation

vkurchatkin
Copy link
Contributor

See #1693

At some point http began to interpret all strings as binary. This causes various incompatibilities in both incoming and outgoing messages. Incoming message seem more problematic, since both url and header values could be corrupted and there is no good way to tell whether they are.

It's probably right that clients should properly encode urls and headers, but some don't and this used to work, so a lot of code is broken by this change.

R=@bnoordhuis

@Fishrock123 Fishrock123 added the http Issues or PRs related to the http subsystem. label Aug 31, 2015

'use strict';
var common = require('../common');
var assert = require('assert');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use const for these. :)

@trevnorris
Copy link
Contributor

At some point http began to interpret all strings as binary.

They have always been decoded as binary, and before that ascii. This is the correct way, following the standard to decode HTTP headers.

-1 on the change.

@trevnorris
Copy link
Contributor

@vkurchatkin I see your example in the linked issue. Am investigating something. I pointed out the change to use the one byte API and suggested encoding be changed to 'binary' instead of 'ascii'. So the results conflict with my known timeline. Will come back w/ the results.

@trevnorris
Copy link
Contributor

The header parsing change came from f674b09. Where String::New() was swapped to use OneByteString(). For reference, this was released in v0.11.6.

The commit I was referring to is 1f9f863. This logic path works a little differently. Take this example code:

require('http').createServer(function(req, res) {
  res.setHeader('foo', '\u0222');
  res.end('hi all');
}).listen(3000);

Here's the response from curl -v http://localhost:3000/:

< HTTP/1.1 200 OK
< foo: Ȣ
< Date: Wed, 02 Sep 2015 06:30:23 GMT
< Connection: keep-alive
< Transfer-Encoding: chunked
<
hi all

Another example script with slight variation demonstrating how to truncate the header:

require('http').createServer(function(req, res) {
  res.setHeader('foo', '\u0222');
  res.end(new Buffer('hi all\n', 'binary'));
}).listen(3000);

Output from curl again:

< HTTP/1.1 200 OK
< foo: "                     <-- This header has been truncated
< Date: Wed, 02 Sep 2015 06:32:14 GMT
< Connection: keep-alive
< Transfer-Encoding: chunked
<
hi all

So this was the logic path I was referring to. My apologies.

So I rescind my -1 and am now on the fence. Will browsers properly parse headers if sent in utf8?

Also, this is a performance hit. Is there a way we can still tell the server to interpret incoming data as 'binary' instead? This would help for the cases when we know utf8 decoding isn't necessary.


On the side, there's an interesting little bug if you use the flushHeaders() API:

require('http').createServer(function(req, res) {
  res.setHeader('foo', '\u0222');
  res.flushHeaders();
  res.end(new Buffer('hi all\n', 'binary'));
}).listen(3000);

The results are the same as if a string had been passed. This is because it only passes in the 'binary' encoding automatically if the data is a Buffer (or if the string encoding is 'hex'/'base64'). But if the encoding is still undefined when it's time for the headers to be written then it falls back to state.defaultEncoding. Which is utf8.

@vkurchatkin
Copy link
Contributor Author

@trevnorris you are talking about writing utf-8 (which is also a problem), this PR is about parsing only.

Also, this is a performance hit.

I'm not sure there is a way around it. Maybe we can treat headers that are supposed to have only latin-1 values

@trevnorris
Copy link
Contributor

@vkurchatkin

you are talking about writing utf-8 (which is also a problem), this PR is about parsing only.

Yes. I was correcting my previous erroneous comment about it having always been binary decoded.

Maybe we can treat headers that are supposed to have only latin-1 values

Not clear what you're getting at. All headers are supposed to be only ASCII (we switched to latin1 since it's faster). So I'm not sure how we'd determine this automatically. Hence my request of being able to set the header encoding if this change is made. So I can still parse my headers without the overhead. I think I have a way to make this conditional practically a noop. Though it might also be better left for another PR.

@vkurchatkin
Copy link
Contributor Author

We can have a list of headers that are interpreted as ASCII because it's unlikely that someone sends UTF-8 in them. Custom headers or headers containing urls (like Referer) would still be interpreted as UTF-8.

Hence my request of being able to set the header encoding if this change is made.

We can have a strict option that is default to false

@@ -136,7 +136,10 @@ struct StringPtr {

Local<String> ToString(Environment* env) const {
if (str_)
return OneByteString(env->isolate(), str_, size_);
return String::NewFromUtf8(env->isolate(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the question "is this more correct?", I am curious about the performance impact this may have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks would be bias in out current benchmarks. We need to set some up that have various header lengths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious too though, aren't HTTP headers supposed to be ASCII and not unicode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis gave an explanation here: #1693 (comment). Basically that most headers use US-ASCII, but traditionally ISO-8859-1 (i.e. latin-1) is allowed. latin-1 is the encoding returned by OneByteString(). So is what we currently use. Note that ASCII is a subset of latin-1.

While our current implementation is definitely faster, there's a problem with developers/companies moving from v0.10 to v4 LTS being hit by the change in functionality. Any solution outside of what we currently use will be a performance hit. How much will depend on the chosen solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So going with what Ben said, why the change to String::NewFromUtf8? Does v0.10 read UTF-8 strings here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronkorving Yes. v0.10 reads in headers as utf-8. Which is incorrect per the specification.

Thing is, if everyone followed US-ASCII then functionally we wouldn't notice a difference. It's not until someone uses the full latin-1 encoding (which is spec compliant) that they'd experience issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

@Fishrock123
Copy link
Contributor

Deferring to @trevnorris and @jasnell here. We discussed this at length at nodeconf eu, and it appears going back to using utf8 is a bad thing in regards to web standards and security or something. -1

@jasnell
Copy link
Member

jasnell commented Sep 14, 2015

Header field values are very strictly defined for a reason (See https://tools.ietf.org/html/rfc7230#section-3.2.6). Allowing any value other than that spec'd, especially UTF-8 or arbitrary binary is dangerous. -1 on this change.

@vkurchatkin
Copy link
Contributor Author

It looks like we have consensus here. Just wanted to point that this can cause hard to debug breakages after update from 0.10.

@Flimm
Copy link
Contributor

Flimm commented May 30, 2017

I've created a related issue with the current master of Node #13296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants