Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dump's parameter "ensure_ascii" creates too long sequences #656

Closed
nlohmann opened this issue Jul 12, 2017 · 5 comments
Closed

dump's parameter "ensure_ascii" creates too long sequences #656

nlohmann opened this issue Jul 12, 2017 · 5 comments
Assignees
Labels
confirmed kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Milestone

Comments

@nlohmann
Copy link
Owner

I detected a problem in the code of PR #654:

The code seems to created too long \uxxxx sequences. Take the sign for instance. It is U+20AC and should be encoded as string "\u20ac". The current code encodes it as "\u00e2\u0082\u00ac". This is incorrect, as this does not roundtrip.

Example:

#include <iostream>
#include <fstream>
#include "json.hpp"

using json = nlohmann::json;

int main() {
    json j1 = u8"";
    std::cout << j1.dump(0, ' ', false) << std::endl;
    std::cout << j1.dump(0, ' ', true) << std::endl;
    
    json j2 = json::parse("\"\\u20ac\"");
    std::cout << j2.dump(0, ' ', false) << std::endl;
    std::cout << j2.dump(0, ' ', true) << std::endl;
    
    json j3 = json::parse(j1.dump(0, ' ', true));
    std::cout << j3.dump(0, ' ', false) << std::endl;
    std::cout << j3.dump(0, ' ', true) << std::endl;
}

Output:

"€"
"\u00e2\u0082\u00ac"
"€"
"\u00e2\u0082\u00ac"
"�"
"\u00c3\u00a2\u00c2\u0082\u00c2\u00ac"

Expected output:

"€"
"\u20ac"
"€"
"\u20ac"
"€"
"\u20ac"

Sorry for not detecting this earlier. The provided test case was correct as it coped with Emojis which created longer sequences anyway.

@nlohmann
Copy link
Owner Author

Example in Python:

import json
print json.dumps('€')

Output:

"\u20ac"

@nlohmann
Copy link
Owner Author

We basically need a conversion from UTF-8 encoded chars to the Unicode codepoint and then use the existing escaping if the codepoint is 0..127 and UTF-16 hex(es) otherwise.

@ryanjmulder
Copy link
Contributor

yup, I agree. Sorry about the bug, my mistake.

@nlohmann
Copy link
Owner Author

No worries. I used the all_unicode.json file, created a serialization in Python with the ensure_ascii parameter and compared the output with the library's dump output.

Any help appreciated - I'm going to bed now ;)

@nlohmann nlohmann self-assigned this Jul 16, 2017
nlohmann added a commit that referenced this issue Jul 17, 2017
A complete rewrite of the string escape function. It now provides codepoint-to-\uxxxx escaping. Invalid UTF-8 byte sequences are not escaped, but copied as-is. I haven’t spent much time optimizing the code - but the library now agrees with Python on every single Unicode character’s escaping (see file test/data/json_nlohmann_tests/all_unicode_ascii.json).

Other minor changes: replaced "size_t" by "std::size_t"
@nlohmann nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label Jul 17, 2017
@nlohmann nlohmann added this to the Release 3.0.0 milestone Jul 17, 2017
@nlohmann
Copy link
Owner Author

I rewrote the escaping code. @ryanjmulder - if you can find the time, I would be happy if you could have a look at the diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests

2 participants