Redesign the parser #49

ndmitchell · 2016-02-10T10:56:20Z

The current parser is a String parser that can also plug in ByteString and Text, and then treats them like String, which is inefficient. The plan (which will fix #12, #13, #11 and #5) is to:

Read the HTML5 spec a lot. Write it as a DSL. Have that DSL generate code pretty directly which is high performance. That will either be a deep embedding, or a shallow embedding that might spit out some generated Haskell (I suspect the latter).
Have the primitive parser have specialisations for different types. There should be a strict Text and ByteString one, and one with and without position numbers. These parsers should take a strict buffer, parse what it can, and output a stream of "steps". Something like:

parseable :: IsText a => S -> a -> ([Fragment a], S)

The Fragment would say things like "start a tag", "start an attribute name", "start an attribute value", "here is some text". The S would be the state to continue with more data. Given this interface, you can write a lazy parser which is fast because it is strict in each chunk of the strict data structure.

CC @ChristopherKing42 who was asking about this on another ticket.

The text was updated successfully, but these errors were encountered:

ChristopherKing42 · 2016-02-10T13:28:56Z

Maybe using parsec or attoparsec or something would be a good idea.

ndmitchell · 2016-02-10T13:39:14Z

I'm not adverse to using a parser combinator library, but:

The HTML spec isn't written in those terms, so it needs a DSL on top.
We want to deal with String, ByteString and Text, which might require doing something a bit more special.
The HTML spec involves chopping up a string. It would be nice if given a 1Mb ByteString, we could guarantee most String values in the Tag structure were substrings of that, so we didn't have to reallocate them.
I suspect with whatever the key primitives of our DSL are, writing a parser by hand is trivial, since it won't have backtracking (the main advantage of a parser combinator library).

ChristopherKing42 · 2016-02-10T13:49:59Z

I think parser libraries do support different types (which is why I suggested it). Parser libraries go together quite nicely with DSL's as well. Also, the fastest ones are adverse to back tracking. I'm fine either way though.

ndmitchell · 2016-02-10T14:04:07Z

Yep, me too - I suspect that's the final 10% of the effort, and most of the work is before there.

ndmitchell · 2016-07-02T04:38:55Z

I made some random first stabs at this some time ago, but didn't really get anywhere. Uploading them here mostly as a backup - I can't imagine they will be useful or interesting to anyone other than me, and probably not even me: tagsoup-misc.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign the parser #49

Redesign the parser #49

ndmitchell commented Feb 10, 2016

ChristopherKing42 commented Feb 10, 2016

ndmitchell commented Feb 10, 2016

ChristopherKing42 commented Feb 10, 2016

ndmitchell commented Feb 10, 2016

ndmitchell commented Jul 2, 2016

Redesign the parser #49

Redesign the parser #49

Comments

ndmitchell commented Feb 10, 2016

ChristopherKing42 commented Feb 10, 2016

ndmitchell commented Feb 10, 2016

ChristopherKing42 commented Feb 10, 2016

ndmitchell commented Feb 10, 2016

ndmitchell commented Jul 2, 2016