Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a custom Myers-based LCS engine #103

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions BENCHMARKING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Benchmarking diff

The engine used by our diff tool tries to balance execution time with patch
quality. It implements the Myers algorithm with a few heuristics which are also
used by GNU diff to avoid pathological cases.

The original paper can be found here:
- https://link.springer.com/article/10.1007/BF01840446

Currently, not all tricks used by GNU diff are adopted by our implementation.
For instance, GNU diff will isolate lines that only exist in each of the files
and not include them on the diffing process. It also does post-processing of the
edits to produce more cohesive hunks. Both of these combinar should make it
produce better patches for large files which are very different.

Run `cargo build --release` before benchmarking after you make a change!

## How to benchmark

It is recommended that you use the 'hyperfine' tool to run your benchmarks. This
is an example of how to run a comparison with GNU diff:

```
> hyperfine -N -i --warmup 2 --output=pipe 'diff t/huge t/huge.3'
'./target/release/diffutils diff t/huge t/huge.3'
Benchmark 1: diff t/huge t/huge.3
Time (mean ± σ): 136.3 ms ± 3.0 ms [User: 88.5 ms, System: 17.9 ms]
Range (min … max): 131.8 ms … 144.4 ms 21 runs

Warning: Ignoring non-zero exit code.

Benchmark 2: ./target/release/diffutils diff t/huge t/huge.3
Time (mean ± σ): 74.4 ms ± 1.0 ms [User: 47.6 ms, System: 24.9 ms]
Range (min … max): 72.9 ms … 77.1 ms 41 runs

Warning: Ignoring non-zero exit code.

Summary
./target/release/diffutils diff t/huge t/huge.3 ran
1.83 ± 0.05 times faster than diff t/huge t/huge.3
>
```

As you can see, you should provide both commands you want to compare on a single
invocation of 'hyperfine'. Each as a single argument, so use quotes. These are
the relevant parameters:

- -N: avoids using a shell as intermediary to run the command
- -i: ignores non-zero exit code, which diff uses to mean files differ
- --warmup 2: 2 runs before measuring, warms up I/O cache for large files
- --output=pipe: disable any potential optimizations based on output destination

## Inputs

Performance will vary based on several factors, the main ones being:

- how large the files being compared are
- how different the files being compared are
- how large and far between sequences of equal lines are

When looking at performance improvements, testing small and large (tens of MBs)
which have few differences, many differences, completely different is important
to cover all of the potential pathological cases.
Loading
Loading