Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding headers with ASCII breaks applications expecting them to be encoded in latin-1 #1778

Closed
jrast opened this issue May 8, 2018 · 28 comments · Fixed by #1914
Closed

Comments

@jrast
Copy link

jrast commented May 8, 2018

Applications, like flask, which expect that the headers are using latin-1 encoding as specified in rfc5987 will break gunicorn because gunicorn uses ASCII encoding.

Issue #1353 was closed with the comment

I'm going to close this as it's well documented around the web that HTTP headers should be ASCII, unfortunately.

which is, according to rfc5987 wrong. Headers should be encoded ISO-8859-1 (aka latin-1). However, a comment in the issue also mentioned commit 5f4ebd2, in which the encoding of the headers was switched back from latin-1 to ascii based on a section in rfc7230 which states that US-ASCII should be used:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. [...]

Now, I'm not into all this RFC specifications and don't know which is the right way to do it. However, as other applications expect latin-1 as specified in rfc5987 but gunicorn implements the version given in rfc7230, it might be worth to discuss this, even if it's just for a clarification on which track gunicorn is.

Related Issues in gunicorn:
#1353 : wsgi.py > send_headers: encoding problem.

Related Issues in Flask:
pallets/flask#2766 : send_file: latin-1 encoding not compatible with gunicorn
pallets/flask#2223 : Fix send_file's attachment_filename to work with non-ascii filenames

Related RFC:
https://tools.ietf.org/html/rfc5987#section-1

@benoitc
Copy link
Owner

benoitc commented May 9, 2018

gunicorn implements the last specification (and it used to work with previous versions of flask).

If we want to relax it for old applications maybe we may want yo try to encode to latin1 if encoding to ascii didn't work. Thoughts ? cc @tilgovi

@benoitc
Copy link
Owner

benoitc commented May 9, 2018

to be more complete, gunicorn use the updated HTTP 1.1 spec (7230 and related). That could be indeed documented.

@tilgovi
Copy link
Collaborator

tilgovi commented May 9, 2018

From my position, the spec says "SHOULD" and PEP 333 mentions latin-1 so I think Gunicorn could support latin-1 even if applications are advised not to use non-ASCII characters.

That's my initial reading, but I'll wait for others to weigh in.

@tilgovi
Copy link
Collaborator

tilgovi commented May 9, 2018

I think Gunicorn could support latin-1

By this I mean only that Gunicorn would not be breaking any spec. Right now Gunicorn is more strict than it needs to be. It is not necessarily wrong to be so strict. We can decide together.

@jrast
Copy link
Author

jrast commented May 9, 2018

RFC 7320 says that new headers "SHOULD" use ASCII. That most headers use ASCII right now seems to be just information about the current situation. But I think this sentence is more important:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding.

So, latin-1 is clearly allowed.

If RFC7320 says that new headers "SHOULD" use ASCII, the question is, who defines the headers? Is it the application using gunicorn or is it gunicorn itself. As I understand, the application defines the headers.

@benoitc
Copy link
Owner

benoitc commented May 9, 2018

What annoyed me is that change has been done 3 years ago without much people beeing blocked by it on the different frameworks until that recent change in flask. I would have thought that in 2018 all applications would follow a specification published in 2014. Part of the issue is that PEP3333 allowing ISO-8859-1 aka latin 1.

Anyway I don't disagree to be more relaxed on it. I propose to do the following :

  • allow encoding to latin1
  • if latin 1 is spotted , log a warning to let the developer know he should instead encode the filenames using the RFC 2616 or use ascii to comply with the RFC 7320

Thoughts ?

@davidism
Copy link
Contributor

davidism commented May 9, 2018

@benoitc sorry for causing trouble! It's hard to keep track of all the different specs. 😓

If you show that warning, I predict a regular stream of issues being open on Flask about it though. I think it makes more sense to just allow Latin-1 silently, as it's part of the PEP.

@benoitc
Copy link
Owner

benoitc commented May 18, 2018

@davidism well that's kind of the point: bugging the user until it makes the correct change ;) I think that would be better but I have no strong opinion on it. @tilgovi @berkerpeksag ?

@tilgovi
Copy link
Collaborator

tilgovi commented May 18, 2018

@davidism no worries :) Thanks for bringing it up.

I'm inclined towards being tolerant in this case, since it aligns with the PEP and with popular frameworks.

@berkerpeksag
Copy link
Collaborator

I'm inclined towards being tolerant in this case, since it aligns with the PEP and with popular frameworks.

+1

@benoitc
Copy link
Owner

benoitc commented May 19, 2018

Until when the status quo would be kept though... yes users will open some bugs which prompted our change. I don’t see why it couldn’t be also fixed in other libraries and the pep updated.

At least we should document it somewhere so users using a strict client won’t be surprised as well. Also if we do the change it will have to be in a major version since it’s a breaking change.

@tilgovi
Copy link
Collaborator

tilgovi commented May 23, 2018

I'm happy to make this change. Do we need a major version? It makes requirements looser for applications. Existing applications should continue to work. Applications which did not work before, but which violate the "SHOULD" will now work.

In any case, I can document it somehow.

@tilgovi
Copy link
Collaborator

tilgovi commented May 23, 2018

Maybe 19.9? We can keep making 19.9.x or 19.x.y releases for Python 2 maintenance critical bug fixes for a bit, but still keep 20 for Python 3 only?

@davidism
Copy link
Contributor

davidism commented May 23, 2018

I've been thinking about this more, and I'm ok switching Flask and Werkzeug to use ASCII here.

I still think it's probably good if Gunicorn supports it, since it's part of the PEP. I'm assuming you've never had an issue reported about this before now so it seems like it's probably not well known anyway. If you're worried about breaking changes, I can handle it on my side only.

I do plan on implementing ASGI in Werkzeug, which leaves the header encoding to the application, so that would be a good time for me to fix inconsistent use of ASCII vs Latin-1 on my side.

@jrast
Copy link
Author

jrast commented May 23, 2018

I support the idea of @davidism: Flask should follow the recommendation and use ASCII, but gunicorn should also follow PEP3333 and allow Latin-1, even if other specifications do not recommend to do so.

@davidism
Copy link
Contributor

I guess I should have done this sooner, but I just tried sending a unicode filename with Flask and Gunicorn and it worked fine: pallets/flask#2766 (comment). Now I'm not clear where the issue is.

@benoitc
Copy link
Owner

benoitc commented May 26, 2018

@tilgovi 19.9 would be ok. also #1151 that was closed by the latest change will need to be addressed.

Related to the last comment of @davidism , @jrast what is the filename that was failing? I think we should have a test to reproduce the issue before making any change.

@jrast
Copy link
Author

jrast commented May 28, 2018

I can't remeber exactly and I don't have a logfile at hand at the moment. But the problem arised with filenames containing german umlauts, like äöü. For example a filename like "Sterntour-Glärnischhütte.xlsx" would cause problems, caused by the "ä" and "ü" in the filename.
I can search for the logfile if you need more details.

@davidism
Copy link
Contributor

Thanks, that filename caused the issue. The name used in the test in Flask triggered the filename* behavior, which ended up as ASCII. Your example encoded to Latin-1, so there was no extra encoding done and it was passed on as-is.

@jrast
Copy link
Author

jrast commented May 29, 2018

Just for the reference: @davidism fixed the encoding in flask with pull request pallets/flask#2804, which will be released in flask 1.0.3. This will make flask compatible with the current behaviour of gunicorn.

At this occasion, i want to thank @benoitc , @tilgovi and @davidism (and all other involved devs!) for their work on gunicorn and flask. I am positively surprised about the discussion which unfold in this issue! The issue was addressed very quick, the discussion was based on facts and a good solution for both packages was found very quickly! I'm allways surprised how quick issues are addressed in the open source world and that, most of the time, good solutions can be found for all involved parties. I think a fix would have taken much longer if both packages would have been closed source libraries, owned by some big coorperations. Thank you very much and keep up the great work!

@georgexsh
Copy link
Contributor

georgexsh commented Jul 12, 2018

flask/werkzeug will encode than decode cookie value to Unicode with latin1:
https://github.com/pallets/werkzeug/blob/master/werkzeug/http.py#L1107-L1108

If a unicode string is returned it’s tunneled through latin1 as required by PEP 3333.

@benoitc
Copy link
Owner

benoitc commented Jul 12, 2018

@georgexsh so what's your expectation?

@georgexsh
Copy link
Contributor

@benoitc to point out flask/werkzeug is still had incompatible behavior with gunicorn. I will be happier if #1778 (comment) get implemented.

btw, your reply is very agile, I admire your diligence.

@davidism
Copy link
Contributor

Given that it's never had a problem before between that code and Gunicorn, I'm not too concerned, although I may switch more things to ASCII eventually. Do you have a real failing example?

@georgexsh
Copy link
Contributor

georgexsh commented Jul 13, 2018

from werkzeug.wrappers import Request, Response

class MyResp(Response):
    charset = 'utf-8-sig'

@Request.application
def application(request):
    resp = MyResp()
    resp.set_cookie('foo', 'bar')
    return resp

run it with gunicorn yields an error:

Traceback (most recent call last):
  File "gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "gunicorn/workers/sync.py", line 183, in handle_request
    resp.close()
  File "gunicorn/http/wsgi.py", line 409, in close
    self.send_headers()
  File "gunicorn/http/wsgi.py", line 329, in send_headers
    util.write(self.sock, util.to_bytestring(header_str, "ascii"))
  File "gunicorn/util.py", line 507, in to_bytestring
    return value.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 155-157: ordinal not in range(128)

the error is from this header:

Set-Cookie: foo="\357\273\277bar"; Path=%EF%BB%BF/?%EF%BB%BF#%EF%BB%BF

when werkzeug dump cookies with utf-8-sig, BOM is added, it could be decoded with 8bit latin1 but can not reencode with as 7bit ASCII:

>>> ''.encode('utf-8-sig').decode('latin1')
''
>>> ''.encode('utf-8-sig').decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

there is more detail in apache/superset#5377

but I think this is a bug in the application (superset), it should not rely on flask/werkzeug to encode CSV string to bytes but encode by itself.

@davidism
Copy link
Contributor

That's definitely an issue with that application.

@javabrett
Copy link
Collaborator

Some notes on the behaviour differences between Werkzeug(+Flask) and Gunicorn(+Flask):

Application attempts to set header value Werkzeug/0.14.1+Flask Gunicorn 19.9/master
b'f\xc3\xb6o'.decode("utf-8", "strict") = "föo" 200 OK, header value is sent, contains bytes \x66\xf6\x6f = föo 500, UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 164: ordinal not in range(128)
b'Swan: \xF0\x9F\xA6\xA2'.decode("utf-8", "strict") = "Swan: (swan glyph)" 200 but empty response, errors out while encoding headers - UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f9a2' in position 19: ordinal not in range(256) 500, UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f9a2' in position 169: ordinal not in range(128)

Note that when writing-out the response headers, Werkzeug delegates to Python builtin BaseHTTPRequestHandler:

https://github.com/pallets/werkzeug/blob/7477be2853df70a022d9613e765581b9411c3c39/werkzeug/serving.py#L235

... which uses latin-1:

https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Lib/http/server.py#L516

That will explain the behaviour differences.

My reading of the considerable number and history of RFCs on this suggests that applications and servers should be restricting themselves to US-ASCII, but that (perhaps for historical reasons) latin-1 is allowed, but clients should treat such octets as opaque.

Either way, ideally Gunicorn and Werkzeug should behave the same here if possible, and it seems that a latin-1 character should not error the response. Gunicorn has a superior way of serializing the header values, so if there is a non-latin-1 character present, it detects this before committing 200 OK (like happens in Werkzeug), and can send 500. But it appears to me that it should allow latin-1.

@javabrett
Copy link
Collaborator

Another possibly interesting check in corelibs:

https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Lib/wsgiref/validate.py#L118

header_re = re.compile(r'^[a-zA-Z][a-zA-Z0-9\-_]*$')
bad_header_value_re = re.compile(r'[\000-\037]')

So Python WSGI is very strict on characters allowed in the header name (alpha-num, hyphen, underscore), and bans non-printables from values.

javabrett added a commit to javabrett/gunicorn that referenced this issue Nov 9, 2018
This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <[email protected]>
javabrett added a commit to javabrett/gunicorn that referenced this issue Nov 9, 2018
This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <[email protected]>
javabrett added a commit to javabrett/gunicorn that referenced this issue Jan 23, 2019
This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <[email protected]>
javabrett added a commit to javabrett/gunicorn that referenced this issue Jan 31, 2019
This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <[email protected]>
javabrett added a commit to javabrett/gunicorn that referenced this issue Feb 22, 2019
This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <[email protected]>
berkerpeksag pushed a commit that referenced this issue Apr 18, 2019
This commit reverts one aspect changed by 5f4ebd2 (#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed #1778.

Signed-off-by: Brett Randall <[email protected]>
di pushed a commit to di/gunicorn that referenced this issue Sep 21, 2019
This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <[email protected]>
(cherry picked from commit 879651b)
tilgovi pushed a commit that referenced this issue Oct 13, 2019
This commit reverts one aspect changed by 5f4ebd2 (#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed #1778.

Signed-off-by: Brett Randall <[email protected]>
(cherry picked from commit 879651b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants