-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle invalid accept-charset in requests - default to utf-8 #584
Conversation
e7fd945
to
479dfa1
Compare
"render text with requested encoding if an `Accept-Charset` request header requests a non-UTF-8 encoding" in { | ||
Get() ~> `Accept-Charset`(`ISO-8859-1`) ~> complete(foo) ~> check { | ||
responseEntity shouldEqual HttpEntity(ContentType(`text/plain`, `ISO-8859-1`), foo) | ||
} | ||
} | ||
"render text with UTF-8 encoding if an `Accept-Charset` request header requests an unknown encoding" in { | ||
Get() ~> `Accept-Charset`(HttpCharset("unknown")(Nil)) ~> complete(foo) ~> check { | ||
responseEntity shouldEqual HttpEntity(ContentType(`text/plain`, `UTF-8`), foo) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is a comment on the current behavior, and not on your change, but this seems strange to me: it looks like the default string marshaller will return the string passed to it as text/plain
claiming it is in whatever charset that was requested by the user, without doing anything to make sure the string is actually in that charset.
This is only correct if you assume the implementation correctly took into account the charset, which seems extremely unlikely. I wonder if it wouldn't make more sense to leave out the optional charset
parameter to the return content type in this case. If people have specific requirements (which seems somewhat unlikely) they can specify the charset explicitly.
This would be a behavior change, but it seems like a reasonable one if we note it in the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before the change to the main source code in this PR, this test actually fails - the response is an error response without my main source change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks for having a look at this bug / feature request. I recently sent in a duplicate issue (#583) unaware of this one. I worked on a similar PR the past two days, and have now found and looked through your changes - that would seem to fix it nicely! Great.
For your consideration, I'd like to suggest adding in an extra unit-test in "MarshallingSpec.scala" with something along the lines of:
val invalidAcceptCharsetHeader = `Accept-Charset`(HttpCharsetRange(HttpCharset.custom("invalid")))
val request = HttpRequest().withHeaders(Seq(invalidAcceptCharsetHeader))
marshalToResponse("Hello", request).entity.contentType.charsetOption should be(Some(HttpCharsets.`UTF-8`))
Some refactoring thoughts; Perhaps the try-catch in "PredefinedToEntityMarshallers.scala" could be more subtly placed in "HttpCharset.scala" as:
/** Returns this HttpCharset instance if the provided nio charset is available or a desired default otherwise */
def withNioCharsetOrElse(default: HttpCharset): HttpCharset = {
if (_nioCharset.isSuccess) {
// Return this instance, as the provided nioCharset did not result in a java.nio.charset.UnsupportedCharsetException
this
} else {
default
}
}
and then have the "PredefinedToEntityMarshallers.scala" do:
Marshaller.withOpenCharset(mediaType) { (s, cs) => {
HttpEntity(mediaType.withCharset(cs.withNioCharsetOrElse(HttpCharsets.`UTF-8`)), s)
} }
Looking forward to seeing your changes merged.
Thanks again and best regards,
Michael
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @michaelf-crwd - I have added a new HttpCharsets function similar to what you suggested (c8514e7).
I will look later at adding extra tests like the one you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raboof would you be able to take a look at this again? I don't think this change breaks anything and makes the pekko-http support for Accept-Charset
a bit more tolerant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without doing anything to make sure the string is actually in that charset
@raboof, WDYM with that? Java strings are basically UTF-16 encoded, so they are known to be unicode. What else could go wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I meant is, when the route does complete(someUnicodeString)
, and the marshalling logic results in a Content-Type: text/plain; charset=ISO-8859-1
header, that response seems wrong: the header promises the body will be in ISO-8859-1, but the body is actually in unicode. I think it would make more sense to make sure we produce a Content-Type: text/plain; charset=UTF-8
header unless the developer explicitly marked the response as being in a different charset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or will the marshalling logic actually convert the unicode input into bytes using the charset from the content type? in that case of course it's all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it does, sorry about the noise 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Sorry for not finding time to look into this sooner, will do so now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, I agree with the idea to treat the Accept-XYZ
headers rather softly, especially Accept-Charset
which is losing relevancy these days. As long as the server clearly indicates what representation it uses, it seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested various scenario's:
Explicit charset in response
i.e. a route with complete(HttpEntity(ContentTypes.
text/html(UTF-8), "Say Hällö to Pekko HTTP"))
:
- request:
curl -v localhost:8080/hello
old behavior: response in Content-Type: text/html; charset=UTF-8 💚
new behavior: same 💚 - request:
curl -v -H "Accept-Charset: blop" localhost:8080/hello
old behavior: 500 error and stacktrace in the logs 🛑
new behavior: 406 Not Acceptable, "Resource representation is only available with these types: text/html; charset=UTF-8" 💚 - request:
curl -v -H "Accept-Charset: US-ASCII" localhost:8080/hello
old behavior: 406 Not Acceptable, "Resource representation is only available with these types: text/html; charset=UTF-8" 💚
new behavior: same 💚 - curl -v -H "Accept-Charset: ISO-8859-1" localhost:8080/hello
old -> response 406 Not Acceptable 💚
new -> same 💚
No explicit charset in response
i.e. a route with complete("Say Hällö to Pekko HTTP")
:
with complete("Say Hällö to Pekko HTTP")
- request:
curl -v localhost:8080/hello
old behavior: response intext/plain; charset=UTF-8
💚
new behavior: same 💚 - request:
curl -v -H "Accept-Charset: blop" localhost:8080/hello
old behavior: 500 error and stacktrace in the logs 🛑
new behavior: response intext/plain; charset=UTF-8
💚 - request:
curl -v -H "Accept-Charset: US-ASCII" localhost:8080/hello
old behavior: response withContent-Type: text/plain; charset=US-ASCII
header,but UTF-8 bodyresponse converted to US-ASCII🛑💚
new behavior: same🛑💚 - curl -v -H "Accept-Charset: ISO-8859-1" localhost:8080/hello
old -> response withContent-Type: text/plain; charset=ISO-8859
header,but UTF-8 bodyresponse converted to US-ASCII🛑💚
new -> same🛑💚
In summary, I think this PR is a definite improvement. Some behavior seems wrong to me, but that behavior was already like that, so that doesn't need to delay merging this PR or releasing 1.1.0. I'll put my code where my mouth is and propose a fix for the behavior that looks wrong to me, but that's independent of 1.1.0. Sorry for taking so long before confirming ;) the strings are converted to bytes taking into account the encodings just fine, all good, sorry about the noise 🤦
The whole negotiation story is really sophisticated and tries to take that into account, indeed. It's also one of the performance hotspots of the routing layer. |
Anyway thanks for that comprehensive testing session, @raboof 👍 |
Thanks everyone. I'll merge as is but if anyone has any requests for changes, I'll look at them. |
see #300
I've added #585 to look into the XML case - this PR focuses on text/plain.