diffLines seems broken #68

wfalkwallace · 2015-08-10T03:20:03Z

wfalkwallace · 2015-08-10T04:26:28Z

UPDATE: It seems like it's treating the trailing newline incorrectly:

and diffTrimmedLines is doing the same. It's seeing the trailing newline difference and treating the line as changed. Is that really intended?

kpdecker · 2015-08-10T05:50:15Z

What browser are you seeing this on?

wfalkwallace · 2015-08-10T05:55:09Z

Chrome 44.0.2403.130 (64-bit) on a MacBook Pro

the diffLines seemed weird, but ¯\_(ツ)_/¯ -- I'm not sure how to count the newline wrt added/removed. But diffTrimmedLines should definitely fix the above situaiton, no?

kpdecker · 2015-08-10T06:34:55Z

Thanks. This is not browser specific but an issue with the newline being included in the tokenizer's output, so the rest of the algorithm considers them to be distinct.

[ 'restaurant' ] [ 'restaurant\n', 'hello' ]

Under the trimmed case, we even go as far as explicitly adding these back in after the trim:

jsdiff/src/diff/line.js

Line 23 in 470ed65

line += '\n';

The reason that all of this is done is that the string can not be reconstructed without these newlines. All of the operations that combine tokens just do + (More accurately a join('')) and omitting the \n from the token would cause that to merge all of the lines together.

It is possible to do a custom comparison the ignores the line for the sake of comparison checks but this too could run into issues with the newline values sometimes being included and sometimes being not, based on the diff path that is selected for two given inputs.

I suspect that this is basically the reason that \\ No newline at end of file exists in GNU diff's output and I'm not sure that I have a good answer to this other than inserting a newline into the end of the string if it is omitted.

wfalkwallace · 2015-08-10T16:24:00Z

I figured it was the line tokenizer. I see this as a real bug though, since it's arbitrarily adding back newlines, which means the value isn't necessarily valid (ie. it may not incorporate a trailing newline and will break non-unix newlines).

Do you need to reconstruct the string from the tokens at any point? Why not use character offsets attached to your tokens, the way esprima (for example) does in its range property? That would allow you to grab valid substrings from the original content.

Even if you don't want to go that far, I still think it's worth it to add the newlines back after the comparison, or trim again when comparing -- right now diffTrimmedLines('restaurant', 'restaurant\nhello') spits out [{value: 'restaurant', added: undefined, removed: true, count: 1}, {value: 'restaurant\nhello', added: true, removed: undefined, count: 2}] which just isn't correct.

kpdecker · 2015-08-11T04:44:57Z

Regarding the arbitrary newline insertion, I merged the patch and line diff implementations last night since I ended up noticing they were basically the same while investigating this issue. This change should resolves that particular issue: 1597705

Regarding ranges, that's a ton of work that will break all custom diff implementations. That's not going to be feasible anytime in the near term. It is possible to tokenize or compare ignoring the newline characters but this ties into the issue above of newlines sometimes being included and sometimes not (ranges suffer from the same problem)

Regarding correctness, the code is operating correctly under the definition that a line is inclusive of it's newline character. This is needed to properly create patches, etc. Unfortunately this directly conflicts with making human readable diffs, like I presume is what you are going for.

Using a tokenizer similar to the following, the behavior is closer to what I think you are looking for:

tokenize(value) {
 return value.split(/(\n|\r\n)/);
}

This results in [ 'restaurant' ] and [ 'restaurant', '\n', 'hello' ] and an eventual diff of restaurant<ins>\nhello</ins>.

Basically I think a technical solution has been found... but implementing it without breaking other use cases might be a concern. I need to think about the best way to expose this as an API but will try to include something in the next release that exposes this behavior, one way or the other.

wfalkwallace · 2015-08-11T04:54:17Z

sounds good on the arbitrary reinsertion, and the truth is I'm not particularly concerned with microsoft crlf's (but did notice it).

I agree the definition that a line is inclusive of its newline character makes sense for diffLines but it doesn't for diffTrimmedLines (trim typically includes line breaks) which would mean diffTrimmedLines is not operating correctly. Is there any other way around this without a major overhaul and apart from a fork of the repo for me?

your solution might work though - thanks for adding this to the roadmap!

kpdecker · 2015-08-11T06:29:01Z

Trimmed I think we can reasonably go the route I was proposing and this is not a major change. For the non-trimmed implementation, it's more of that I need to figure out what to call it as well as finalize the other features in this release. diffLinesWithoutNewLines is a bit clunky.

You can also do a non-trimmed implementation in the interim using something like the following:

var myDiff = new JsDiff.Diff();
myDiff.tokenize = function(value) {
  return value.split(/(\n|\r\n)/);
};
myDiff.diff(oldStr, newStr);

kpdecker · 2015-08-26T08:36:02Z

Functionality implemented via diffLinesNL, I'm not a huge fan of the name but wasn't sure of a better option. Open to suggestions if there is something better suggested before this is in a tagged release.

kpdecker · 2015-08-27T07:42:22Z

Released in 2.1.0

kpdecker added this to the 2.1.0 milestone Aug 11, 2015

wfalkwallace mentioned this issue Aug 20, 2015

diffWords treats \n at the end as significant whitespace #70

Closed

kpdecker closed this as completed in 4a0ba65 Aug 26, 2015

kpdecker mentioned this issue Aug 26, 2015

Consider using options object API for flag permutations #72

Closed

oBusk mentioned this issue Feb 5, 2022

Option to strip/ignore cr at end of line(?) #343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffLines seems broken #68

diffLines seems broken #68

wfalkwallace commented Aug 10, 2015

wfalkwallace commented Aug 10, 2015

kpdecker commented Aug 10, 2015

wfalkwallace commented Aug 10, 2015

kpdecker commented Aug 10, 2015

wfalkwallace commented Aug 10, 2015

kpdecker commented Aug 11, 2015

wfalkwallace commented Aug 11, 2015

kpdecker commented Aug 11, 2015

kpdecker commented Aug 26, 2015

kpdecker commented Aug 27, 2015

diffLines seems broken #68

diffLines seems broken #68

Comments

wfalkwallace commented Aug 10, 2015

wfalkwallace commented Aug 10, 2015

kpdecker commented Aug 10, 2015

wfalkwallace commented Aug 10, 2015

kpdecker commented Aug 10, 2015

wfalkwallace commented Aug 10, 2015

kpdecker commented Aug 11, 2015

wfalkwallace commented Aug 11, 2015

kpdecker commented Aug 11, 2015

kpdecker commented Aug 26, 2015

kpdecker commented Aug 27, 2015