Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. #649

Merged

Conversation

matthewdale
Copy link
Collaborator

Fix UnmarshalExtJSON incorrect handling of surrogate pairs. Currently, UnmarshalExtJSON converts each value in the surrogate pair to a Unicode replacement character.

Correct handling:
RFC 8259 section 7 requires special handling of surrogate pairs like "\uD834\uDd1e", which should decode to 𝄞:

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

Changes:

  • Decode Unicode escape sequences in JSON (e.g. "\u00BF") as runes instead of strings using the getu4 function copied and lightly modified from the Go "encoding/json" package.
  • Use utf16.IsSurrogate to check if the decoded Unicode rune is a high or low surrogate value and, if true, attempt to read and decode the surrogate pair.
  • Add a missing error handling block when reading input bytes in jsonScanner.scanString
  • Add valid and invalid surrogate pair test cases for jsonScanner and UnmarshalExtJSON

@matthewdale matthewdale force-pushed the godriver1947-fix-surrogate-pairs branch from 00bf1be to 91263a9 Compare April 26, 2021 18:20
bson/bsonrw/json_scanner.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Outdated Show resolved Hide resolved
bson/bsonrw/json_scanner.go Show resolved Hide resolved
bson/unmarshaling_cases_test.go Outdated Show resolved Hide resolved
@benjirewis benjirewis self-requested a review April 27, 2021 20:09
Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as new tests pass! 🎉

bson/bsonrw/json_scanner.go Outdated Show resolved Hide resolved
bson/unmarshal_test.go Outdated Show resolved Hide resolved
@matthewdale matthewdale requested a review from iwysiu April 28, 2021 16:36
@divjotarora divjotarora removed their request for review April 28, 2021 17:06
Copy link
Contributor

@iwysiu iwysiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm if tests pass!

Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! My main question is what should happen when codepoints represented by surrogate pairs are encoded back.

Note: in case it is helpful David Golden wrote a script for parsing BSON bytes as color coded output bsonview, which I used when looking over the ticket (colors lost in copy-paste):

% echo "1100000002610005000000f09d849e0000" | bson-corpus/tests/bsonview -x 
 11000000 02 "a" 00 05000000 "𝄞" 00 00

bson/bsonrw/json_scanner_test.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Outdated Show resolved Hide resolved
bson/bsonrw/json_scanner_test.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Show resolved Hide resolved
@matthewdale matthewdale requested a review from kevinAlbs May 3, 2021 17:37
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

bson/bsonrw/json_scanner.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Show resolved Hide resolved
bson/bsonrw/json_scanner.go Outdated Show resolved Hide resolved
@matthewdale matthewdale merged commit 0810758 into mongodb:master May 3, 2021
@matthewdale matthewdale deleted the godriver1947-fix-surrogate-pairs branch May 3, 2021 20:05
matthewdale added a commit that referenced this pull request May 4, 2021
…lExtJSON. (#649)

* GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON.

* Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test.

* Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types.

* Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling.

* Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
stlimtat pushed a commit to stlimtat/mongo-go-driver that referenced this pull request May 17, 2021
* 'master' of https://github.com/mongodb/mongo-go-driver: (39 commits)
  GODRIVER-2004 Add Versioned API connection examples for Docs (mongodb#665)
  GODRIVER-1961 Run OCSP tests against RHEL 7.0 (mongodb#664)
  GODRIVER-1844 finer precision for getSecondsSinceEpoch (mongodb#666)
  GODRIVER-1973 create internal copy of aws v4 signing code (mongodb#657)
  GODRIVER-1951 Update the Go version for Evergreen builds to 1.16 (mongodb#663)
  GODRIVER-1949 add more ignored killAllSessions errors for unified tes… (mongodb#658)
  GODRIVER-1963 remove dropDatabase result (mongodb#660)
  GODRIVER-1180 Remove legacy transform functions from mongo (mongodb#583)
  GODRIVER-1937 Update legacy ListCollections to support the BatchSize option for server version 2.6 (mongodb#656)
  GODRIVER-1933 remove xtrace from shell scripts (mongodb#661)
  fix README error handling of FindOne (mongodb#636)
  GODRIVER-1938 update mongocryptd serverSelectionTimeout to 10 seconds (mongodb#659)
  GODRIVER-1925 Surface cursor errors in DownloadStream fillBuffer (mongodb#653)
  GODRIVER-1955 create labeledError interface (mongodb#651)
  GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. (mongodb#649)
  Changed order of actions in ObjectIDFromHex func (mongodb#637)
  GODRIVER-1750 Ensure contexts are always cancelled during server monitoring (mongodb#654)
  GODRIVER-1931 Sync new cursors and SDAM LB tests (mongodb#655)
  GODRIVER-1981 Sync new transactions tests (mongodb#652)
  GODRIVER-1931 Run tests against LBs in Evergreen (mongodb#648)
  ...
tsedgwick pushed a commit to mongodb-forks/mongo-go-driver that referenced this pull request May 27, 2021
…lExtJSON. (mongodb#649)

* GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON.

* Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test.

* Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types.

* Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling.

* Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
tsedgwick pushed a commit to mongodb-forks/mongo-go-driver that referenced this pull request Jun 1, 2021
…lExtJSON. (mongodb#649)

* GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON.

* Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test.

* Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types.

* Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling.

* Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
faem pushed a commit to kubedb/mongo-go-driver that referenced this pull request Mar 17, 2022
…lExtJSON. (mongodb#649)

* GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON.

* Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test.

* Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types.

* Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling.

* Explicitly write the Unicode replacement character for an un-paired Unicode surrogate value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants