dump's parameter "ensure_ascii" creates too long sequences #656

nlohmann · 2017-07-12T19:39:05Z

I detected a problem in the code of PR #654:

The code seems to created too long \uxxxx sequences. Take the € sign for instance. It is U+20AC and should be encoded as string "\u20ac". The current code encodes it as "\u00e2\u0082\u00ac". This is incorrect, as this does not roundtrip.

Example:

#include <iostream>
#include <fstream>
#include "json.hpp"

using json = nlohmann::json;

int main() {
    json j1 = u8"€";
    std::cout << j1.dump(0, ' ', false) << std::endl;
    std::cout << j1.dump(0, ' ', true) << std::endl;
    
    json j2 = json::parse("\"\\u20ac\"");
    std::cout << j2.dump(0, ' ', false) << std::endl;
    std::cout << j2.dump(0, ' ', true) << std::endl;
    
    json j3 = json::parse(j1.dump(0, ' ', true));
    std::cout << j3.dump(0, ' ', false) << std::endl;
    std::cout << j3.dump(0, ' ', true) << std::endl;
}

Output:

"€"
"\u00e2\u0082\u00ac"
"€"
"\u00e2\u0082\u00ac"
"â�¬"
"\u00c3\u00a2\u00c2\u0082\u00c2\u00ac"

Expected output:

"€"
"\u20ac"
"€"
"\u20ac"
"€"
"\u20ac"

Sorry for not detecting this earlier. The provided test case was correct as it coped with Emojis which created longer sequences anyway.

The text was updated successfully, but these errors were encountered:

nlohmann · 2017-07-12T19:40:50Z

Example in Python:

import json
print json.dumps('€')

Output:

"\u20ac"

nlohmann · 2017-07-12T20:21:07Z

We basically need a conversion from UTF-8 encoded chars to the Unicode codepoint and then use the existing escaping if the codepoint is 0..127 and UTF-16 hex(es) otherwise.

ryanjmulder · 2017-07-12T20:21:47Z

yup, I agree. Sorry about the bug, my mistake.

nlohmann · 2017-07-12T20:24:32Z

No worries. I used the all_unicode.json file, created a serialization in Python with the ensure_ascii parameter and compared the output with the library's dump output.

Any help appreciated - I'm going to bed now ;)

A complete rewrite of the string escape function. It now provides codepoint-to-\uxxxx escaping. Invalid UTF-8 byte sequences are not escaped, but copied as-is. I haven’t spent much time optimizing the code - but the library now agrees with Python on every single Unicode character’s escaping (see file test/data/json_nlohmann_tests/all_unicode_ascii.json). Other minor changes: replaced "size_t" by "std::size_t"

nlohmann · 2017-07-17T05:55:35Z

I rewrote the escaping code. @ryanjmulder - if you can find the time, I would be happy if you could have a look at the diff.

nlohmann added the kind: bug label Jul 12, 2017

nlohmann mentioned this issue Jul 12, 2017

add ensure_ascii parameter to dump. #330 #654

Merged

nlohmann added the confirmed label Jul 12, 2017

nlohmann self-assigned this Jul 16, 2017

nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label Jul 17, 2017

nlohmann added this to the Release 3.0.0 milestone Jul 17, 2017

nlohmann closed this as completed Jul 18, 2017

Convery mentioned this issue Nov 1, 2020

from_json<std::wstring> is treated as an array on latest MSVC #2453

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dump's parameter "ensure_ascii" creates too long sequences #656

dump's parameter "ensure_ascii" creates too long sequences #656

nlohmann commented Jul 12, 2017

nlohmann commented Jul 12, 2017

nlohmann commented Jul 12, 2017

ryanjmulder commented Jul 12, 2017

nlohmann commented Jul 12, 2017

nlohmann commented Jul 17, 2017

dump's parameter "ensure_ascii" creates too long sequences #656

dump's parameter "ensure_ascii" creates too long sequences #656

Comments

nlohmann commented Jul 12, 2017

nlohmann commented Jul 12, 2017

nlohmann commented Jul 12, 2017

ryanjmulder commented Jul 12, 2017

nlohmann commented Jul 12, 2017

nlohmann commented Jul 17, 2017