Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is parsing markdown necessary? #288

Open
websiddu opened this issue Nov 27, 2024 · 1 comment
Open

Is parsing markdown necessary? #288

websiddu opened this issue Nov 27, 2024 · 1 comment

Comments

@websiddu
Copy link

So, I have been experimenting with harper lately, for example if you pass markdown content the content is parsed by a parser and converted into AST, do we need such parsing?

Alternatively, I wrote a function clean the markup and replace it with spaces, and then run its as plain text, here is my version https://github.com/websiddu/harper/blob/master/harper-wasm/src/lib.rs#L21

This implementation is currently live on https://stubby.io/

I'm really not sure if this is more efficient than doing a full syntax tree and then getting the word position based on that. Just sharing an idea as I thought this simplify a lot of the code.

@elijah-potter
Copy link
Owner

elijah-potter commented Nov 27, 2024

Hey, thanks for reaching out!

Harper's parsing infrastructure is admittedly poorly documented at the moment, so I'll try to explain it enough to answer your question here. Expect a proper guide on it in the future.

So, I have been experimenting with harper lately, for example if you pass markdown content the content is parsed by a parser and converted into AST, do we need such parsing?

To directly answer your question: yes, and it takes negligible time. The Markdown library we use is really fast (I think it actually might be the fastest CommonMark implementation out there), so it consumes a trivial percentage of our execution time, while significantly improving Harper's internal document model.

Your implementation, while interesting, is not spec compliant, and recompiling and running so many regex expressions every time is quite slow. I intend to properly support MDX in the future, but in the meantime you can probably get significantly better results by using your same Regex stripping inside the Markdown parser (whose code you can find here). A cheap solution would involve making a copy of that file and pasting your stripping inside.

If you would like to parse MDX properly (which would give Harper the best internal document model and therefore significantly better linting) you just have to implement the Parser trait, which can be done by wrapping another existing parser, including one generated by Treesitter.

P.S. I'm so glad you're using Harper for your project. I'm honored. We've got significant JS API improvements on the way, so stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants