Can't tell non-blocking parser what charset to use for decoding input #596

mizosoft · 2020-02-06T03:01:08Z

I'm using Jackson's non-blocking parser to implement a BodySubscriber for use with Java's non-blocking HTTP client. The parser is created by JsonFactory#createNonBlockingByteArrayParser() using the factory instance associated with the ObjectMapper . It's working like a charm, but it seems that it uses UTF-8 by default and there is no way of telling it other encodings to use (such as the encoding specified by the response headers other than UTF-8).

I figured it might auto-detect the response body's encoding like it's the case with other parsers, but it turned out that it assumes all input is UTF-8. For example, this snippet would crash:

    ObjectMapper mapper = new JsonMapper();
    byte[] jsonBytes = "{\"Psst!\": \"I'm not UTF-8\"}".getBytes(StandardCharsets.UTF_16);
    JsonParser asyncParser = mapper.getFactory().createNonBlockingByteArrayParser();
    ByteArrayFeeder feeder = ((ByteArrayFeeder) asyncParser.getNonBlockingInputFeeder());
    feeder.feedInput(jsonBytes, 0, jsonBytes.length);
    feeder.endOfInput();
    Map<String, String> map = mapper.readValue(asyncParser, new TypeReference<>() {});
    System.out.println(map);

It works fine if the JSON string is encoded with UTF-8.

The text was updated successfully, but these errors were encountered:

cowtowncoder · 2020-02-06T18:47:19Z

Yes, you can not define other encodings so it only works for UTF-8 and 7-bit ASCII (since that is a subset).
This is a fundamental limitation and it is unlikely implementations for other encodings would be added.
If support was to be added it would likely require version that handles byte-to-character encoding separate from tokenization, and that would be full rewrite.

So: non-blocking parser will only work on UTF-8 input. I should probably mention this better in Javadocs.

mizosoft · 2020-02-07T03:08:28Z

I see...

I think in my case then I should use the non-blocking parser only if the response charset is UTF-8 or a subset of it, else fallback to loading the response as a string and deserialize from there. I agree that the Javadocs should mention this to clear up confusion.

cowtowncoder · 2020-02-07T17:54:42Z

Right. Vast majority of JSON really should be UTF-8, especially considering that only officially standard legal charsets are UTF-8, UTF-16 and UTF-32 (as per original JSON specification). But there are so many broken systems that emit other encodings (UTF-8859-x) that.... it is frustrating. Considering that JSON document itself has no mechanism for declaring encoding -- unlike XML which has this capability! -- so documents are not stand-alone any more.
But if sticking to standard supported encodings, auto-detection does work (UTF-16 and UTF-32 can be auto-detected, distinct from UTF-8; Latin-1 and others can not).

cowtowncoder added the documentation Issue that documentation improvements could solve or alleviate label Feb 6, 2020

mizosoft mentioned this issue Feb 7, 2020

AbstractJackson2Decoder is not aware of jackson's non-blocking parser only supporting UTF-8 spring-projects/spring-framework#24489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't tell non-blocking parser what charset to use for decoding input #596

Can't tell non-blocking parser what charset to use for decoding input #596

mizosoft commented Feb 6, 2020 •

edited

Loading

cowtowncoder commented Feb 6, 2020

mizosoft commented Feb 7, 2020

cowtowncoder commented Feb 7, 2020

Can't tell non-blocking parser what charset to use for decoding input #596

Can't tell non-blocking parser what charset to use for decoding input #596

Comments

mizosoft commented Feb 6, 2020 • edited Loading

cowtowncoder commented Feb 6, 2020

mizosoft commented Feb 7, 2020

cowtowncoder commented Feb 7, 2020

mizosoft commented Feb 6, 2020 •

edited

Loading