Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Stop splitting surrogate-pairs in diff-match-patch deltas #91

Merged
merged 1 commit into from
Nov 14, 2019

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Nov 12, 2019

See google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
diff-match-patch when it's building the delta string. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b

This crashes when trying to encode the invalid sequence of UTF-16 code
units into URIEncode, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the node-simperium library should be able to process those
problematic changesets.

See google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.
@dmsnell dmsnell force-pushed the fix/encoding-issue-by-updating-diff-match-patch branch from 38547c5 to 9ecfe99 Compare November 12, 2019 23:46
dmsnell added a commit to Automattic/simplenote-electron that referenced this pull request Nov 13, 2019
See: Simperium/node-simperium#91
See: google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.

This patch updates the dependency on the patched version of
`node-simperium`.
@dmsnell dmsnell merged commit 6377429 into master Nov 14, 2019
@dmsnell dmsnell deleted the fix/encoding-issue-by-updating-diff-match-patch branch November 14, 2019 03:56
dmsnell added a commit to Automattic/simplenote-electron that referenced this pull request Nov 14, 2019
* Fix: No more crashing/skipped changes for certain changes

See: Simperium/node-simperium#91
See: google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.

This patch updates the dependency on the patched version of
`node-simperium`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant