Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign the parser #49

Open
ndmitchell opened this issue Feb 10, 2016 · 5 comments
Open

Redesign the parser #49

ndmitchell opened this issue Feb 10, 2016 · 5 comments

Comments

@ndmitchell
Copy link
Owner

The current parser is a String parser that can also plug in ByteString and Text, and then treats them like String, which is inefficient. The plan (which will fix #12, #13, #11 and #5) is to:

  • Read the HTML5 spec a lot. Write it as a DSL. Have that DSL generate code pretty directly which is high performance. That will either be a deep embedding, or a shallow embedding that might spit out some generated Haskell (I suspect the latter).

  • Have the primitive parser have specialisations for different types. There should be a strict Text and ByteString one, and one with and without position numbers. These parsers should take a strict buffer, parse what it can, and output a stream of "steps". Something like:

    parseable :: IsText a => S -> a -> ([Fragment a], S)

The Fragment would say things like "start a tag", "start an attribute name", "start an attribute value", "here is some text". The S would be the state to continue with more data. Given this interface, you can write a lazy parser which is fast because it is strict in each chunk of the strict data structure.

CC @ChristopherKing42 who was asking about this on another ticket.

@ChristopherKing42
Copy link
Contributor

Maybe using parsec or attoparsec or something would be a good idea.

@ndmitchell
Copy link
Owner Author

I'm not adverse to using a parser combinator library, but:

  • The HTML spec isn't written in those terms, so it needs a DSL on top.
  • We want to deal with String, ByteString and Text, which might require doing something a bit more special.
  • The HTML spec involves chopping up a string. It would be nice if given a 1Mb ByteString, we could guarantee most String values in the Tag structure were substrings of that, so we didn't have to reallocate them.
  • I suspect with whatever the key primitives of our DSL are, writing a parser by hand is trivial, since it won't have backtracking (the main advantage of a parser combinator library).

@ChristopherKing42
Copy link
Contributor

I think parser libraries do support different types (which is why I suggested it). Parser libraries go together quite nicely with DSL's as well. Also, the fastest ones are adverse to back tracking. I'm fine either way though.

@ndmitchell
Copy link
Owner Author

Yep, me too - I suspect that's the final 10% of the effort, and most of the work is before there.

@ndmitchell
Copy link
Owner Author

I made some random first stabs at this some time ago, but didn't really get anywhere. Uploading them here mostly as a backup - I can't imagine they will be useful or interesting to anyone other than me, and probably not even me: tagsoup-misc.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants