-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dump's parameter "ensure_ascii" creates too long sequences #656
Comments
Example in Python: import json
print json.dumps('€') Output:
|
We basically need a conversion from UTF-8 encoded chars to the Unicode codepoint and then use the existing escaping if the codepoint is 0..127 and UTF-16 hex(es) otherwise. |
yup, I agree. Sorry about the bug, my mistake. |
No worries. I used the all_unicode.json file, created a serialization in Python with the Any help appreciated - I'm going to bed now ;) |
A complete rewrite of the string escape function. It now provides codepoint-to-\uxxxx escaping. Invalid UTF-8 byte sequences are not escaped, but copied as-is. I haven’t spent much time optimizing the code - but the library now agrees with Python on every single Unicode character’s escaping (see file test/data/json_nlohmann_tests/all_unicode_ascii.json). Other minor changes: replaced "size_t" by "std::size_t"
I rewrote the escaping code. @ryanjmulder - if you can find the time, I would be happy if you could have a look at the diff. |
I detected a problem in the code of PR #654:
The code seems to created too long
\uxxxx
sequences. Take the€
sign for instance. It is U+20AC and should be encoded as string"\u20ac"
. The current code encodes it as"\u00e2\u0082\u00ac"
. This is incorrect, as this does not roundtrip.Example:
Output:
Expected output:
Sorry for not detecting this earlier. The provided test case was correct as it coped with Emojis which created longer sequences anyway.
The text was updated successfully, but these errors were encountered: