Lorax TODO
- Signature: calculate and persist signatures and weights for nodes in a single document
- Match: represents a match between two nodes
- MatchSet: composed of Signatures and Matches.
- Matcher: an algorithm that operates on a MatchSet statelessly to generate matches.
- Generator: generates a DeltaSet from a MatchSet
- Delta: an atomic change step
- DeltaSet: an ordered set of Deltas
- Apply: f(doc1, DeltaSet) -> doc2
- too many web sites fuck that up
- libxml2 allows duplicate ids
- algorithm would ignore changed content
if we do “phase 3” in weight-order, and recursively match parents, can’t we avoid the “propagate to parent” step of phase 4?
- ruby hash / md5 / sha1
- benchmark? collision rate?
think about diffing any tree, e.g. AST, YAML