-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign the parser #49
Comments
Maybe using parsec or attoparsec or something would be a good idea. |
I'm not adverse to using a parser combinator library, but:
|
I think parser libraries do support different types (which is why I suggested it). Parser libraries go together quite nicely with DSL's as well. Also, the fastest ones are adverse to back tracking. I'm fine either way though. |
Yep, me too - I suspect that's the final 10% of the effort, and most of the work is before there. |
I made some random first stabs at this some time ago, but didn't really get anywhere. Uploading them here mostly as a backup - I can't imagine they will be useful or interesting to anyone other than me, and probably not even me: tagsoup-misc.zip |
The current parser is a String parser that can also plug in ByteString and Text, and then treats them like String, which is inefficient. The plan (which will fix #12, #13, #11 and #5) is to:
Read the HTML5 spec a lot. Write it as a DSL. Have that DSL generate code pretty directly which is high performance. That will either be a deep embedding, or a shallow embedding that might spit out some generated Haskell (I suspect the latter).
Have the primitive parser have specialisations for different types. There should be a strict Text and ByteString one, and one with and without position numbers. These parsers should take a strict buffer, parse what it can, and output a stream of "steps". Something like:
parseable :: IsText a => S -> a -> ([Fragment a], S)
The
Fragment
would say things like "start a tag", "start an attribute name", "start an attribute value", "here is some text". TheS
would be the state to continue with more data. Given this interface, you can write a lazy parser which is fast because it is strict in each chunk of the strict data structure.CC @ChristopherKing42 who was asking about this on another ticket.
The text was updated successfully, but these errors were encountered: