-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. #649
GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. #649
Conversation
00bf1be
to
91263a9
Compare
…d in UnmarshalExtJSON test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as new tests pass! 🎉
…mplify test types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm if tests pass!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! My main question is what should happen when codepoints represented by surrogate pairs are encoded back.
Note: in case it is helpful David Golden wrote a script for parsing BSON bytes as color coded output bsonview, which I used when looking over the ticket (colors lost in copy-paste):
% echo "1100000002610005000000f09d849e0000" | bson-corpus/tests/bsonview -x
11000000 02 "a" 00 05000000 "𝄞" 00 00
…and 4-byte UTF-8 extJSON marshaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…nicode surrogate value.
…lExtJSON. (#649) * GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. * Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test. * Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types. * Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling. * Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
* 'master' of https://github.com/mongodb/mongo-go-driver: (39 commits) GODRIVER-2004 Add Versioned API connection examples for Docs (mongodb#665) GODRIVER-1961 Run OCSP tests against RHEL 7.0 (mongodb#664) GODRIVER-1844 finer precision for getSecondsSinceEpoch (mongodb#666) GODRIVER-1973 create internal copy of aws v4 signing code (mongodb#657) GODRIVER-1951 Update the Go version for Evergreen builds to 1.16 (mongodb#663) GODRIVER-1949 add more ignored killAllSessions errors for unified tes… (mongodb#658) GODRIVER-1963 remove dropDatabase result (mongodb#660) GODRIVER-1180 Remove legacy transform functions from mongo (mongodb#583) GODRIVER-1937 Update legacy ListCollections to support the BatchSize option for server version 2.6 (mongodb#656) GODRIVER-1933 remove xtrace from shell scripts (mongodb#661) fix README error handling of FindOne (mongodb#636) GODRIVER-1938 update mongocryptd serverSelectionTimeout to 10 seconds (mongodb#659) GODRIVER-1925 Surface cursor errors in DownloadStream fillBuffer (mongodb#653) GODRIVER-1955 create labeledError interface (mongodb#651) GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. (mongodb#649) Changed order of actions in ObjectIDFromHex func (mongodb#637) GODRIVER-1750 Ensure contexts are always cancelled during server monitoring (mongodb#654) GODRIVER-1931 Sync new cursors and SDAM LB tests (mongodb#655) GODRIVER-1981 Sync new transactions tests (mongodb#652) GODRIVER-1931 Run tests against LBs in Evergreen (mongodb#648) ...
…lExtJSON. (mongodb#649) * GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. * Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test. * Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types. * Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling. * Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
…lExtJSON. (mongodb#649) * GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. * Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test. * Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types. * Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling. * Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
…lExtJSON. (mongodb#649) * GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. * Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test. * Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types. * Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling. * Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
Fix
UnmarshalExtJSON
incorrect handling of surrogate pairs. Currently,UnmarshalExtJSON
converts each value in the surrogate pair to a Unicode replacement character.Correct handling:
RFC 8259 section 7 requires special handling of surrogate pairs like
"\uD834\uDd1e"
, which should decode to𝄞
:Changes:
"\u00BF"
) as runes instead of strings using thegetu4
function copied and lightly modified from the Go"encoding/json"
package.utf16.IsSurrogate
to check if the decoded Unicode rune is a high or low surrogate value and, if true, attempt to read and decode the surrogate pair.jsonScanner.scanString
jsonScanner
andUnmarshalExtJSON