Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't tell non-blocking parser what charset to use for decoding input #596

Open
mizosoft opened this issue Feb 6, 2020 · 3 comments
Open
Labels
documentation Issue that documentation improvements could solve or alleviate

Comments

@mizosoft
Copy link

mizosoft commented Feb 6, 2020

I'm using Jackson's non-blocking parser to implement a BodySubscriber for use with Java's non-blocking HTTP client. The parser is created by JsonFactory#createNonBlockingByteArrayParser() using the factory instance associated with the ObjectMapper . It's working like a charm, but it seems that it uses UTF-8 by default and there is no way of telling it other encodings to use (such as the encoding specified by the response headers other than UTF-8).

I figured it might auto-detect the response body's encoding like it's the case with other parsers, but it turned out that it assumes all input is UTF-8. For example, this snippet would crash:

    ObjectMapper mapper = new JsonMapper();
    byte[] jsonBytes = "{\"Psst!\": \"I'm not UTF-8\"}".getBytes(StandardCharsets.UTF_16);
    JsonParser asyncParser = mapper.getFactory().createNonBlockingByteArrayParser();
    ByteArrayFeeder feeder = ((ByteArrayFeeder) asyncParser.getNonBlockingInputFeeder());
    feeder.feedInput(jsonBytes, 0, jsonBytes.length);
    feeder.endOfInput();
    Map<String, String> map = mapper.readValue(asyncParser, new TypeReference<>() {});
    System.out.println(map);

It works fine if the JSON string is encoded with UTF-8.

@cowtowncoder
Copy link
Member

Yes, you can not define other encodings so it only works for UTF-8 and 7-bit ASCII (since that is a subset).
This is a fundamental limitation and it is unlikely implementations for other encodings would be added.
If support was to be added it would likely require version that handles byte-to-character encoding separate from tokenization, and that would be full rewrite.

So: non-blocking parser will only work on UTF-8 input. I should probably mention this better in Javadocs.

@cowtowncoder cowtowncoder added the documentation Issue that documentation improvements could solve or alleviate label Feb 6, 2020
@mizosoft
Copy link
Author

mizosoft commented Feb 7, 2020

I see...

I think in my case then I should use the non-blocking parser only if the response charset is UTF-8 or a subset of it, else fallback to loading the response as a string and deserialize from there. I agree that the Javadocs should mention this to clear up confusion.

@cowtowncoder
Copy link
Member

Right. Vast majority of JSON really should be UTF-8, especially considering that only officially standard legal charsets are UTF-8, UTF-16 and UTF-32 (as per original JSON specification). But there are so many broken systems that emit other encodings (UTF-8859-x) that.... it is frustrating. Considering that JSON document itself has no mechanism for declaring encoding -- unlike XML which has this capability! -- so documents are not stand-alone any more.
But if sticking to standard supported encodings, auto-detection does work (UTF-16 and UTF-32 can be auto-detected, distinct from UTF-8; Latin-1 and others can not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issue that documentation improvements could solve or alleviate
Projects
None yet
Development

No branches or pull requests

2 participants