A demonstration how simple and powerful parser combinators are.
Converting Word HTM documents to semantically structured XML includes building up hierarchical structures from flat h1 and p elements.
The problem of transforming flat input to some tree-ish structure smells like a parser is somehow useful.
Informal answer: Something that consumes a sequence of values and builds up some data structure.
A more formal, FP-ish answer:
Every function p
conforming to the signature [input] -> [result remaining-input]
+-------------+
| |--> result | :invalid
input --> | p |
| |--> remaining-input
+-------------+
which returns two values:
:invalid
if it does not recognizeinput
, or otherwise a possibly transformed prefix ofinput
asresult
.- the
remaining-input
that it did not consume.
A parser eventually succeeds on an input if it returns nil (or an empty list) as remaining input.
pred->parser
: Create a parser from a predicatealt
: Accept input if at least one of the given parsers accepts itmany*
: Accept input if the given parser repeatedly accepts valuesmany+
: Likemany*
but expects at least one valueoptional
: Accept input with given parser, or return [nil input] if parser failssequence
: Apply given parsers in order, fail if not all parsers succeedtransform
: Apply function to result of given parserdescend
: Regard first value of input als sub-input and apply given parser to it
;; a helper transformation
(defn into-first
[[x xs]]
(into [x] xs))
;; define some parsers
(def paragraph-parser
(pc/pred->parser #{:p}))
(def section-parser
(pc/transform
into-first
(pc/sequence (pc/pred->parser #{:h1})
(pc/many* paragraph-parser))))
(def tr-parser
(pc/transform into-first
(pc/sequence
(pc/pred->parser #{:tr})
(pc/many+ (pc/pred->parser #{:td})))))
(def table-parser
(pc/transform into-first
(pc/sequence
(pc/pred->parser #{:table})
(pc/many+ (pc/descend tr-parser)))))
(def content-parser
(pc/many+ (pc/alt section-parser paragraph-parser (pc/descend table-parser))))
;; invoke one on an input sequence
(content-parser [:p :p :h1 :p [:table [:tr :td :td] [:tr :td]]])
;= [[:p :p [:h1 :p] [:table [:tr :td :td] [:tr :td]]]
nil]
- Basically everything that a recursive descent parser can recognize. AFAIK it's an LL(1) parser.
- Beware non-termination when combining
many
withoptional
! - If ambiguities exist in the grammar it is hard to predict the results.