-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http: interpret strings as utf8 #2629
Conversation
|
||
'use strict'; | ||
var common = require('../common'); | ||
var assert = require('assert'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use const
for these. :)
They have always been decoded as binary, and before that ascii. This is the correct way, following the standard to decode HTTP headers. -1 on the change. |
@vkurchatkin I see your example in the linked issue. Am investigating something. I pointed out the change to use the one byte API and suggested encoding be changed to |
The header parsing change came from f674b09. Where The commit I was referring to is 1f9f863. This logic path works a little differently. Take this example code: require('http').createServer(function(req, res) {
res.setHeader('foo', '\u0222');
res.end('hi all');
}).listen(3000); Here's the response from
Another example script with slight variation demonstrating how to truncate the header: require('http').createServer(function(req, res) {
res.setHeader('foo', '\u0222');
res.end(new Buffer('hi all\n', 'binary'));
}).listen(3000); Output from curl again:
So this was the logic path I was referring to. My apologies. So I rescind my -1 and am now on the fence. Will browsers properly parse headers if sent in utf8? Also, this is a performance hit. Is there a way we can still tell the server to interpret incoming data as On the side, there's an interesting little bug if you use the require('http').createServer(function(req, res) {
res.setHeader('foo', '\u0222');
res.flushHeaders();
res.end(new Buffer('hi all\n', 'binary'));
}).listen(3000); The results are the same as if a string had been passed. This is because it only passes in the |
@trevnorris you are talking about writing utf-8 (which is also a problem), this PR is about parsing only.
I'm not sure there is a way around it. Maybe we can treat headers that are supposed to have only latin-1 values |
Yes. I was correcting my previous erroneous comment about it having always been binary decoded.
Not clear what you're getting at. All headers are supposed to be only ASCII (we switched to latin1 since it's faster). So I'm not sure how we'd determine this automatically. Hence my request of being able to set the header encoding if this change is made. So I can still parse my headers without the overhead. I think I have a way to make this conditional practically a noop. Though it might also be better left for another PR. |
We can have a list of headers that are interpreted as ASCII because it's unlikely that someone sends UTF-8 in them. Custom headers or headers containing urls (like Referer) would still be interpreted as UTF-8.
We can have a |
@@ -136,7 +136,10 @@ struct StringPtr { | |||
|
|||
Local<String> ToString(Environment* env) const { | |||
if (str_) | |||
return OneByteString(env->isolate(), str_, size_); | |||
return String::NewFromUtf8(env->isolate(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the question "is this more correct?", I am curious about the performance impact this may have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks would be bias in out current benchmarks. We need to set some up that have various header lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious too though, aren't HTTP headers supposed to be ASCII and not unicode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnoordhuis gave an explanation here: #1693 (comment). Basically that most headers use US-ASCII, but traditionally ISO-8859-1 (i.e. latin-1) is allowed. latin-1 is the encoding returned by OneByteString()
. So is what we currently use. Note that ASCII is a subset of latin-1.
While our current implementation is definitely faster, there's a problem with developers/companies moving from v0.10 to v4 LTS being hit by the change in functionality. Any solution outside of what we currently use will be a performance hit. How much will depend on the chosen solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So going with what Ben said, why the change to String::NewFromUtf8
? Does v0.10 read UTF-8 strings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ronkorving Yes. v0.10 reads in headers as utf-8. Which is incorrect per the specification.
Thing is, if everyone followed US-ASCII then functionally we wouldn't notice a difference. It's not until someone uses the full latin-1 encoding (which is spec compliant) that they'd experience issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
Deferring to @trevnorris and @jasnell here. We discussed this at length at nodeconf eu, and it appears going back to using utf8 is a bad thing in regards to web standards and security or something. -1 |
Header field values are very strictly defined for a reason (See https://tools.ietf.org/html/rfc7230#section-3.2.6). Allowing any value other than that spec'd, especially UTF-8 or arbitrary binary is dangerous. -1 on this change. |
It looks like we have consensus here. Just wanted to point that this can cause hard to debug breakages after update from 0.10. |
I've created a related issue with the current |
See #1693
At some point
http
began to interpret all strings asbinary
. This causes various incompatibilities in both incoming and outgoing messages. Incoming message seem more problematic, since both url and header values could be corrupted and there is no good way to tell whether they are.It's probably right that clients should properly encode urls and headers, but some don't and this used to work, so a lot of code is broken by this change.
R=@bnoordhuis