Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff Match Patch #215

Merged
merged 3 commits into from
Jan 16, 2020
Merged

Diff Match Patch #215

merged 3 commits into from
Jan 16, 2020

Conversation

theck13
Copy link
Contributor

@theck13 theck13 commented Dec 13, 2019

Fix

Update the diff_match_patch class with the Java changes from google/diff-match-patch#80.

Review

Only one developer is required to review these changes, but anyone can perform the review.

@theck13 theck13 added the Bug label Dec 13, 2019
@theck13 theck13 requested a review from dmsnell December 13, 2019 04:30
throw new IllegalArgumentException(
"Invalid number in diff_fromDelta: " + tokens[x + 2].substring(1), e);
}
tokens[x + 2] = tokens[x + 2].charAt(0) + String.valueOf(m + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's my understanding that @jleandroperez did not incorporate this into the iOS library, meaning that while I'm pretty confident in it, it's untested in practice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing these changes, the (null) part was not removed from a corrupted note with the high-surrogate (null) low-surrogate pattern. According to the comment above this block, this code should be attempting to repair notes with that corrupted content, right?

Here is how I tested it.

  1. Create corrupted note using another platform.
  2. Launch app and open corrupted note.
  3. Edit corrupted note before or after corruption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChannelTest.testRequestObjectForInvalidChange() test is failing when running ./gradlew connectedCheck on this branch. Interestingly, removing the block in question here makes the test pass.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theck13 @dmsnell reason why I didn't add the (repair) part of the fix, is that:

  • Once a string is badly split, (null) is generated by the iOS SDK.
  • And we do get data loss (the (null) is sent over the wire).

Meaning that... once the issue shows up, it's irrecoverable. That is, as long as the badly split string is generated ios-side.

Now, that being said, I couldn't trigger any test case in which a malformed string could be patched up, other than unit tests. For that reason, for the time being, I've opted to change as little as possible in DMP (it's been quite stable for the past couple releases!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine @jleandroperez - the patch doesn't attempt to recover lost data - it attempts to undo the change vs. leave the unexpected (null) in there - something I'm willing to acknowledge is more of a "how do you prefer to proceed when things are broken" question than "this is the right choice"

that being said I'm surprised that tests failed and I'll be looking into the code to try and ifgure out what's up, also surprised that it didn't remove the (null)

my guess is that I got something wrong in the Java code but I do have at least one test confirming the behavior in the diff-match-patch library PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create corrupted note using another platform.
Launch app and open corrupted note.

After investigating why we had unexpected corruption on Android and pinpointing sqlite as the perpetrator, I believe that it makes sense that the reason we're not seeing the (null) issue fixed here is because we're not seeing the corrupted patch come in. This fix will only adjust patching which when applied would produce that bug and it appears like we're starting with a copy of the note from the server which won't present the buggy patch.

In other words, in those steps I don't think we're reproducing the (null) bug.

}
throw new IllegalArgumentException();
}
return decoded.toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely appreciate it if you could scrutinize this block @theck13. I'm making some assumptions about how URLDecoder() works and it's worth pointing out that this entirely replaces that behavior. In the JavaScript library we're only switching over to the custom parser once the native one fails with the Unicode error.

I ran the speedtest and couldn't find any significant performance impact but I wasn't sure what exceptional cases could exist here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making some assumptions about how URLDecoder() works

Did you look at the URLDecoder.decode(String s, String enc) source code?

In the JavaScript library we're only switching over to the custom parser once the native one fails with the Unicode error.

Do you mean something like URLDecoder.decode(String s, String enc) when you refer to "the native one" above? If so, we could do something similar by trying URLDecoder.decode(param, "UTF-8") first then reverting to decodeURI(param) if the try fails.

try {
  param = URLDecoder.decode(param, "UTF-8");
} catch (UnsupportedEncodingException e) {
  param = this.decodeURI(param);
} catch (IllegalArgumentException e) {
  // Malformed URI sequence.
  throw new IllegalArgumentException(
      "Illegal escape in diff_fromDelta: " + param, e);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried exploring using another decoder but didn't get anything working - it was mainly due to the combined work of percent-encoding decoding and UTF-8 decoding.

if that's working for you it'd make the patch better. we would want to try/catch this.decodeURI() as well because it can also throw upon malformed input

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would want to try/catch this.decodeURI() as well because it can also throw upon malformed input

I'm not sure I understand this part. Are you saying we should add UnsupportedEncodingException to the second catch since this.decodeURI() could throw that exception as well?

The way that snippet above is written, the param = this.decodeURI(param); statement will only be run if the param = URLDecoder.decode(param, "UTF-8"); throws an UnsupportedEncodingException. Isn't that what we want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have two levels of exceptions here and it's confusing.

we can percent-encode a surrogate half's code unit as if it were valid Unicode and the UTF-8 algorithm can convert that value into three percent-hex-digit-pairs.

\ud83c for example would turn into %ED%A0%BC even though what we think would be the corresponding code point, 0xD83C, is marked as entirely off-limits because it would cause confusion with surrogate pairs in UTF-16.

that being said, there's no fundamental issue in encoding that number as there is with encoding negative numbers or numbers bigger than 0x10000000

if we had \ud83c\udd70 as two UTF16 code units then they would represent \U0001F70 as the code point and that would encode into four percent-hex-digit-pairs %F0%9F%85%B0

so this is the kind of exception we want to handle here. we specifically don't want to fail just because someone encoded half a character. we can recover from that.

on the other hand, there are structural issues we can't deal with, such as if someone sends %123 or %F%D%C and there are other issues such as if they send an invalid byte sequence through the percent-hex-digit-pairs. for instance, if we got %30 instead of %F0 as the first pair in that encoding - %30%9F%85%B0 - then we wouldn't be able to decode it according to the algorithm because it would specify the wrong continuation and length bits.

does that help clarify what's happening? the main point is that this.decodeURI() can also fail for reasons that URLDecoder.decode() would fail which also have nothing to do with surrogates

@dmsnell
Copy link
Contributor

dmsnell commented Jan 15, 2020

@theck13 I've reconfirmed all the tests in google/diff-match-patch#80 and copied the entire Java file from there into here. also I rebased against develop to incorporate the "ignore the corrupted ghost" changes into this branch. if you think I shouldn't have done this we can git reset --hard f705baac94ec7e43976ebc8ef9650eb5d3a5f11f to where it was.

there are other updates from diff-match-patch including a couple changed data structures. if this is an "update the library" PR those are probably welcome, but you can make the call if you'd rather split them out; I'm happy to backup and manually apply the changes from google/diff-match-patch#80 specifically

@dmsnell
Copy link
Contributor

dmsnell commented Jan 16, 2020

Merging as discussed in Slack

@dmsnell dmsnell merged commit 89ecb33 into develop Jan 16, 2020
@dmsnell dmsnell deleted the diff-match-patch branch January 16, 2020 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants