Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/tools/diff: a package for computing text differences #58893

Closed
adonovan opened this issue Mar 6, 2023 · 28 comments
Closed

proposal: x/tools/diff: a package for computing text differences #58893

adonovan opened this issue Mar 6, 2023 · 28 comments

Comments

@adonovan
Copy link
Member

adonovan commented Mar 6, 2023

Recent work in gopls resulted in the creation of an internal package for computing text differences in the manner of the UNIX diff command, for applying those differences to a file in the manner of the patch command, and for presenting line-oriented diffs using +/- prefix notation aka GNU "unified" diff format (diff -u). Diff functionality is invaluable for developer tools that transform source files, and for tests that compare expected and actual outputs. We propose to publish our diff package with the public API shown below.

// Package diff computes differences between text files or strings.
package diff // import "golang.org/x/tools/diff"

// -- diff --

// Strings computes the differences between two strings.
// The resulting edits respect rune boundaries.
func Strings(before, after string) []Edit

// Bytes computes the differences between two byte slices.
// The resulting edits respect rune boundaries.
func Bytes(before, after []byte) []Edit

// An Edit describes the replacement of a portion of a text file.
type Edit struct {
	Start, End int    // byte offsets of the region to replace
	New        string // the replacement
}

func (e Edit) String() string

// -- apply --

// Apply applies a sequence of edits to the src buffer and returns the
// result. Edits are applied in order of start offset; edits with the
// same start offset are applied in the order they were provided.
//
// Apply returns an error if any edit is out of bounds,
// or if any pair of edits is overlapping.
func Apply(src string, edits []Edit) (string, error)

// ApplyBytes is like Apply, but it accepts a byte slice.
// The result is always a new array.
func ApplyBytes(src []byte, edits []Edit) ([]byte, error)

// SortEdits orders a slice of Edits by (start, end) offset.
// This ordering puts insertions (end = start) before deletions
// (end > start) at the same point, but uses a stable sort to preserve
// the order of multiple insertions at the same point.
// (Apply detects multiple deletions at the same point as an error.)
func SortEdits(edits []Edit)

// -- unified --

// Unified returns a unified diff of the old and new strings.
// If the strings are equal, it returns the empty string.
// The old and new labels are the names of the old and new files.
func Unified(oldLabel, newLabel, old, new string) string

// ToUnified applies the edits to content and returns a unified diff.
// It returns an error if the edits are inconsistent; see [Apply].
// The old and new labels are the names of the content and result files.
func ToUnified(oldLabel, newLabel, content string, edits []Edit) (string, error)

@pjweinb @findleyr

@gopherbot gopherbot added this to the Proposal milestone Mar 6, 2023
@pjweinb
Copy link

pjweinb commented Mar 6, 2023

The idea of doing this seems fine, but perhaps the documentation should include a few caveats to cover the following points. (There's lots of ways to write diff, and sadly no best one. This one is for inputs that are fairly similar, or fairly short.)

  1. When the edit distance between the inputs is fairly small, the algorithm might find a minimal edit sequence, but it might not. All that is claimed is that applying the Edits to the 'before' gets the 'after'.
  2. There is an internal (inaccessible) parameter that controls the maximum length of the returned []Edit. If this bound is hit, the returned []Edit will have sensible edits at the beginning and end of the input, but a big replace in the middle.

@bcmills
Copy link
Contributor

bcmills commented Mar 6, 2023

See previously #23113, #41980.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Mar 6, 2023
@ianlancetaylor
Copy link
Member

See also #45200.

@DeedleFake
Copy link

Instead of SortEdits(), maybe func (Edit) Less(Edit) bool would be more general. It's pretty straightforward to plug in via slices.SortFunc(edits, diff.Edit.Less).

@adonovan
Copy link
Member Author

adonovan commented Mar 6, 2023

I agree that we should guarantee only that the composition of diff.Strings and diff.Apply is the identity, and nothing about the specific edits that it returns. We should probably also define the Unified text form in more detail.

Thanks for the links to related proposals. There's a fair bit of interest in both the narrow concept of text diff as proposed here, and in richer kinds of structured value diff for use in testing. If some form of the latter is accepted into the standard library, then perhaps simple text diff, on which it would depend, would also need to be in the standard library, though not necessarily exposed. I'm going to resist the temptation to argue that this should be a standard package. We can always do that later.

@adonovan
Copy link
Member Author

adonovan commented Mar 6, 2023

Instead of SortEdits(), maybe func (Edit) Less(Edit) bool would be more general. It's pretty straightforward to plug in via slices.SortFunc(edits, diff.Edit.Less).

I tried that initially, but it turns out to be incorrect: it's imperative that you use sort.Stable for edits since insertions at the same point must preserve their relative order.

@DeedleFake
Copy link

I tried that initially, but it turns out to be incorrect: it's imperative that you use sort.Stable for edits since insertions at the same point must preserve their relative order.

There is also a slices.SortStableFunc().

Alternatively, you could add a mechanism to Edit that could help them maintain their relative ordering outside of external context such as a numbered priority. Otherwise, if you need a function to sort a []Edit, the implicit assumption is that an unsorted slice is likely to be obtained from somewhere, but if the relative ordering of the elements of that slice is important than there's also an assumption that that slice will always already be partially ordered correctly. That seems kind of error-prone to me.

@adonovan
Copy link
Member Author

adonovan commented Mar 6, 2023

the implicit assumption is that an unsorted slice is likely to be obtained from somewhere, but if the relative ordering of the elements of that slice is important than there's also an assumption that that slice will always already be partially ordered correctly. That seems kind of error-prone to me.

The definition of Apply makes clear that the slice of edits is a list, not a set: the relative ordering of insertions is important. But Apply can call SortEdits internally. Within gopls, we use SortEdits after merging lists of edits to the same file, but simple concatenation should suffice. It's also used to ensure to ensure a deterministic order, which some clients have mistakenly assumed.

Perhaps we should remove SortEdits from the API and let gopls implement its own copy of that function.

@earthboundkid
Copy link
Contributor

Can the Strings and Bytes functions be unified behind [byteseq string|[]byte]? Perhaps diff.Of?

@adonovan
Copy link
Member Author

adonovan commented Mar 6, 2023

Can the Strings and Bytes functions be unified behind [byteseq string|[]byte]? Perhaps diff.Of?

They could, but it seems like a lot of trouble just to achieve name overloading.

@earthboundkid
Copy link
Contributor

The other way around, I feel like it's a lot of work to have duplicate Strings and Bytes functions that work the same way instead of having callers cast their []byte to string or having a single generic function.

@adonovan
Copy link
Member Author

adonovan commented Mar 6, 2023

The other way around, I feel like it's a lot of work to have duplicate Strings and Bytes functions that work the same way instead of having callers cast their []byte to string or having a single generic function.

The allocation penalty of the bytes conversions for large files when the diff is empty (the common case, which needn't allocate any Edits) is a pretty significant multiple of the cost of the actual diff algorithm: 8x in one quick benchmark of a 10KB file.

@DeedleFake
Copy link

There's no need to maintain two versions of the functionality in either case. It's very easy as proposed to just use the same implementation since it doesn't perform any modification in Bytes():

func Strings(before, after string) []Edit {
  return Bytes(
    unsafe.Slice(unsafe.StringData(before), len(before)),
    unsafe.Slice(unsafe.StringData(after), len(after)),
  )
}

Or, so as to avoid the unsafe usage, you could have each of the exported functions call a third, unexported function that's generic like your suggestion. I don't think that's really necessary, though.

@adonovan
Copy link
Member Author

adonovan commented Mar 6, 2023

it doesn't perform any modification in Bytes():

That's true, but the proof requires auditing a thousand lines of code across half a dozen files, so I hesitate.

The generics approach is safer and more appealing, and it should nicely handle even the intentional differences between bytes and strings. For example, string(x)==string(y) is an efficient way to compare bytestring values, and Edit{New: string(x[i:j])} allocates if and only if x is a []byte. (Sadly golang.org/x/tools still can't use generics quite yet, but soon.)

It may cause significant expansion of the object code though; I should measure.

@adonovan
Copy link
Member Author

adonovan commented Mar 8, 2023

Chatting with @rsc yesterday he raised a number of interesting questions:

  • Should the Edit.New field be a string as proposed, potentially incurring allocation when the input is a []byte slice, or a pair of indices into the "new" text? The latter avoids allocation but requires that the new text be present in order to further interpret the slice of edits.

  • Should the order of arguments to Unified be (oldLabel string, oldContent []byte, newLabel string, newContent []byte), alternating labels and contents, so that it's impossible to switch a label and a content by mistake? Personally I'm not sure the risk is so great that it's worth doing. If you execute the statement even once with the arguments switched you'll notice immediately (e.g. your test for an empty diff will fail). And if we provided a string-content variant of Unified (as perhaps we should for both convenience and efficiency) then it would nullify the benefit.

  • What exactly is promised regarding the diff? I think we should promise only that the composition Apply(Diff(x, y), x) yields y. For unified diffs we don't have a parser; in practice the one that matters is GNU patch, so we could require that our output can be successfully parsed and applied by it, but GNU patch's "detailed description" of the unified diff format leaves some things unstated.

  • The Myers algorithm (without a depth bound) is quadratic. Our implementation does use a depth bound, but may yet generate very suboptimal diffs for large files with widespread edits. (Classic worst case: removing carriage returns before line feeds in Windows text files.) Is that acceptable? Russ added an implementation of Hunt-Szymanski internal to the standard library; it is O(n log n) and based on unique lines.

  • Our implementation returns sub-line edits (e.g. the diff for a fix to a single-letter typo may be minimal). Internally it merges and expands these to line granularity when printing unified output, but the function to do this is not exposed. Do real clients want sub-line edits? (gopls does)

  • Proposals to add a testing-specific diff function (e.g. testing.Diff) might have an easier time answering some of these questions because the domain is more constrained: typically, in a test, the inputs and outputs small, they are nearly always equal, and the desired output is just unified text, not a list-of-edits data structure. A general purpose API such as this one might need to consider more extreme performance outliers.

@bcmills
Copy link
Contributor

bcmills commented Mar 8, 2023

For unified diffs we don't have a parser; in practice the one that matters is GNU patch, so we could require that our output can be successfully parsed and applied by it, but GNU patch's "detailed description" of the unified diff format leaves some things unstated.

That sounds like an excellent application for a fuzz test! 🙃

@rsc rsc moved this from Incoming to Active in Proposals Mar 8, 2023
@rsc
Copy link
Contributor

rsc commented Mar 8, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@dsnet
Copy link
Member

dsnet commented Mar 8, 2023

On one hand, I would like this to be a generic Diff function that can compare any two []E, which is what cmp needs. However, I recognize that there are a lot of cool optimizations that only exist if we treat []E as essentially a sequence of runes (or bytes).

I think there's utility in providing a Diff that is specialized for text, but would still love to see a Diff that can find the difference between two lists of length n and m given only an equality function (similar to the internal Difference function used by cmp).

@dmitshur

This comment was marked as duplicate.

@dolmen
Copy link
Contributor

dolmen commented Apr 12, 2023

This package doesn't seem to be specific to the Go language like others packages in x/tools. From README:

This repository provides the golang.org/x/tools module, comprising various tools and packages mostly for static analysis of Go programs

So why not x/text/diff?

@adonovan
Copy link
Member Author

This package doesn't seem to be specific to the Go language like others packages in x/tools. [...] So why not x/text/diff?

The text module is primarily concerned with internationalization and localization, and deals with Unicode normalization, whereas this implementation computes differences over rune sequences without regard to such matters. Also, diff is an important component of typical code transformation tools, which is very much the domain of the x/tools module.

@rsc
Copy link
Contributor

rsc commented Apr 12, 2023

FWIW I think x/text/diff would be a fine home for a diff package - text doesn't have to be all about Unicode. For example text/template is not. But we probably still shouldn't commit to a diff API right now anyway.

@pjweinb
Copy link

pjweinb commented Apr 25, 2023

I think this package is too special purpose. It's designed for an editor, where sub-line diffs are useful, but there's no point in producing a large number of them.

@pjweinb pjweinb closed this as completed Apr 25, 2023
@mpx
Copy link
Contributor

mpx commented Apr 25, 2023

This issue is actively tracked through the proposal process, presumably it should be re-opened to allow that process to continue?

@DeedleFake
Copy link

@pjweinb doesn't seem to be a member of the Go team. How did he close the issue? Shouldn't that be something only the author of the issue or a moderator can do?

@pjweinb
Copy link

pjweinb commented Apr 25, 2023 via email

@adonovan
Copy link
Member Author

Hi, Peter is indeed a member of the Go team, and the main author of the package. The proposal was a joint one, and we are both happy to retract it in light of the various concerns raised. Sorry for the confusion.

@earthboundkid
Copy link
Contributor

@pjweinb, I think there's a good avatar you could use for your Github account online somewhere, so people recognize you.

@adonovan adonovan moved this from Active to Declined in Proposals Apr 25, 2023
@golang golang locked and limited conversation to collaborators Apr 24, 2024
@rsc rsc removed this from Proposals Apr 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests