Skip to content
This repository has been archived by the owner on Mar 27, 2023. It is now read-only.

fix: guess charset naively #77

Merged
merged 1 commit into from
Oct 3, 2021
Merged

fix: guess charset naively #77

merged 1 commit into from
Oct 3, 2021

Conversation

weihanglo
Copy link
Owner

Fixes #76.

This bug was introduced by #69. In order not to perform any additional I/O, we guess charset naively from MIME. Mainly

  • text/*, */xml, */javascript, */json -> UTF-8
  • */*+xml, */*+json -> UTF-8
  • others -> leave it as is

src/extensions.rs Outdated Show resolved Hide resolved
@weihanglo weihanglo changed the title fix: guess chartset naively fix: guess charset naively Oct 3, 2021
@ZacJW
Copy link

ZacJW commented Oct 3, 2021

Just compiled this and this change does fix #76, so it's fine for my usecase. Can't help thinking that a more thorough check may later prove necessary.

As a quick test I had sfz serve a UTF-16 LE enocded .txt file and the content type header was mistakenly text/plain; charset=utf-8. I would argue that if sfz is not going to verify the file is of any particular charset, it should leave it out of the content-type altogether and let the client figure it out rather than lead them down the wrong path.

If reporting the charset proves necessary for some usecase, maybe sfz could check the encoding of the file (possibly using encoding_rs if introducing a dependency is acceptable) and cache the determined encoding to avoid checking every time.

Thanks for the quick fix, hope I've not made too much work for you.

@ZacJW
Copy link

ZacJW commented Oct 3, 2021

I've done a bit more investigation so I'll share it here. As I understand it, your decision to default to charset=utf-8 was informed by #68 in which they have an issue viewing yaml with non ASCII characters in chrome when served by sfz 0.6.0. However as per yaml 1.2.2 spec (and since 1.0) all yaml character streams are unicode, defaulting to utf-8 encoding. This leads me to believe that the bug in #68 is acutally a bug in chrome, not sfz. I've reproduced that bug with sfz 0.6.0 and with a modified sfz 0.6.1 that never includes a charset in the content-type header. This is all viewed with chromium 94.0.4606.71.

@weihanglo
Copy link
Owner Author

Thanks for the review and research! Your concern is reasonable. And yes, this is a thing that client can handle. Unfortunately, the market share of Chromium is large which we cannot simply ignore this inconvenience.

My concern is that Encoding detection is hard and error-prone. It's time to introduce additional dependency to handle this. I plan to add a flag to toggle this auto-detection feature.

@ZacJW
Copy link

ZacJW commented Oct 3, 2021

You're right about not being able to ignore chromium. Looks like they're still using Windows-1252 encoding for this by default 😆. Guess I'll go write a bug report for them. Thanks again.

@weihanglo
Copy link
Owner Author

I am going to merge this PR since it at least make the blind guess a bit better.
Next, I'll try to integrate chardetng, which is used in Firefox. Its design doc is very well-written and worth a read 👍🏾

Thanks for your review again!

@weihanglo weihanglo merged commit 0ec5ea6 into master Oct 3, 2021
@weihanglo weihanglo deleted the guess-charset branch October 3, 2021 16:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Content-type response header incorrect for .wasm
2 participants