Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
diff: track total cost of search and bail if high
This is the last piece of the puzzle to get somewhat comparable to GNU diff performance without implementing all of its tricks - although this one is also used by GNU diff, in its own way. It brings down a diff which still takes over a minute with the previous commit to under a second. Benchmark 1: diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 2.533 s ± 0.011 s [User: 2.494 s, System: 0.027 s] Range (min … max): 2.519 s … 2.553 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 2: ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 65.798 s ± 1.080 s [User: 65.367 s, System: 0.053 s] Range (min … max): 64.962 s … 68.137 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 3: ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 580.6 ms ± 6.5 ms [User: 521.9 ms, System: 38.8 ms] Range (min … max): 570.7 ms … 589.6 ms 10 runs Warning: Ignoring non-zero exit code. Summary ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp ran 4.36 ± 0.05 times faster than diff test-data/b.cpp test-data/c.cpp 113.33 ± 2.26 times faster than ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp It basically keeps track of how much work we have done overall for a diff job and enables giving up completely on trying to find ideal split points if the cost implies we had to trigger the "too expensive" heuristic too often. From that point forward it only does naive splitting of the work. This should not generate diffs which are much worse than doing the diagonal search, as it should only trigger in cases in which the files are so different it won't find good split points anyway. This is another case in which GNU diff's additional work with hashing and splitting large chunks of inclusion / deletion from the diff work and trying harder to find ideal splits seem to cause it to perform slightly poorer: That said, GNU diff probably still generates better diffs not due to this, but due to its post-processing of the results, trying to create more hunks with nearby changes staying close to each other, which we do not do (but we didn't do that before anyway).
- Loading branch information