Skip to content

Latest commit

 

History

History
35 lines (34 loc) · 1.44 KB

TODO

File metadata and controls

35 lines (34 loc) · 1.44 KB

Lorax TODO

docs

rdocs

class description notes

  • Signature: calculate and persist signatures and weights for nodes in a single document
  • Match: represents a match between two nodes
  • MatchSet: composed of Signatures and Matches.
  • Matcher: an algorithm that operates on a MatchSet statelessly to generate matches.
  • Generator: generates a DeltaSet from a MatchSet
  • Delta: an atomic change step
  • DeltaSet: an ordered set of Deltas
  • Apply: f(doc1, DeltaSet) -> doc2

algorithmic notes

ignoring ID
  • too many web sites fuck that up
  • libxml2 allows duplicate ids
  • algorithm would ignore changed content
indexes (ascendant lookahead) needs to be implemented?
if we do “phase 3” in weight-order, and recursively match parents, can’t we avoid the “propagate to parent” step of phase 4?

core

write integration test for MODIFY delta

write integration test for DELETE delta

write integration test for MODIFY delta with move

change API to specify HTML or XML. or should we make user pass in Nokogirified docs?

pick a hashing algorithm

  • ruby hash / md5 / sha1
  • benchmark? collision rate?

additional

build an rspec matcher for xml

build a test/unit assertion for xml

try to make the code independent of the tree we’re diffing

think about diffing any tree, e.g. AST, YAML

benchmark suite so we can try different algorithms