Lorax TODO

docs

rdocs

class description notes

Signature: calculate and persist signatures and weights for nodes in a single document
Match: represents a match between two nodes
MatchSet: composed of Signatures and Matches.
Matcher: an algorithm that operates on a MatchSet statelessly to generate matches.
Generator: generates a DeltaSet from a MatchSet
Delta: an atomic change step
DeltaSet: an ordered set of Deltas
Apply: f(doc1, DeltaSet) -> doc2

algorithmic notes

ignoring ID

too many web sites fuck that up
libxml2 allows duplicate ids
algorithm would ignore changed content

indexes (ascendant lookahead) needs to be implemented?

if we do “phase 3” in weight-order, and recursively match parents, can’t we avoid the “propagate to parent” step of phase 4?

core

write integration test for MODIFY delta

write integration test for DELETE delta

write integration test for MODIFY delta with move

change API to specify HTML or XML. or should we make user pass in Nokogirified docs?

pick a hashing algorithm

ruby hash / md5 / sha1
benchmark? collision rate?

additional

build an rspec matcher for xml

build a test/unit assertion for xml

try to make the code independent of the tree we’re diffing

think about diffing any tree, e.g. AST, YAML

benchmark suite so we can try different algorithms