Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Markdown implementation - Or: Should we separate compiler tools? #4613

Closed
straight-shoota opened this issue Jun 23, 2017 · 91 comments · Fixed by #11040
Closed

Improve Markdown implementation - Or: Should we separate compiler tools? #4613

straight-shoota opened this issue Jun 23, 2017 · 91 comments · Fixed by #11040
Assignees
Labels
help wanted This issue is generally accepted and needs someone to pick it up kind:refactor topic:stdlib

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Jun 23, 2017

The Markdown parser and renderer is still very basic and misses important features like for example raw html.

I suggest to implement the CommonMark specification which is very reasonable and the de-facto standard reference for Markdown implementations. There is an entire test suite jgm/CommonMark to validate implementations. Currently Markdown.to_html passes only 137 of 621 (22 %).

It would also be nice if the Markdown parser/renderer can be easily extended to add support for more specialized features like Github Flavoured Markdown (e.g #2217)

@makenowjust
Copy link
Contributor

GitHub Flavored Markdown (GFM) has the spec now.

@straight-shoota
Copy link
Member Author

Yes, that's the CommonMark spec plus a few extensions, so GFM is essentially a superset of CommonMark. These extensions are:

As asterite and ysbadden stated it is probably best if the default implementation does not include extensions.

There is already a Crystal implementation for CommonMark at https://github.com/ujifgc/crmark, though it is a direct port of the JavaScript reference implementation, so it can certainly be optimized in many ways.

@straight-shoota
Copy link
Member Author

straight-shoota commented Jun 25, 2017

Instead of implementing a parser in Crystal we could also use the existing C implementation. It supports custom extensions as well, though I have no idea how to incorporate them.

@bcardiff
Copy link
Member

It would be good to adopt the full CommonMark specs in the stdlib. But I think it will be better to have a crystal native implementation. That way we could toggle extensions like GFM which many probably want.

If the failing CommonMark specs can be grouped and the design of the parser the reference implementation can be ported then we might be in a better shape to improve the markdown implementation as you suggest.

The markdown parser is mainly used to generate the docs from crystal code. And it is slightly used in the playground. That is basically why the current implementation didn't need all markdown features.

@straight-shoota
Copy link
Member Author

What do you mean with grouping the failing specs? The spec file is essentially a literate description (HTML at spec.commonmark.org) with embedded examples from which the test cases are extracted. So the tests are part of the hierarchical document structure. The default test runner for example can be limited to specs from specific sections with parameter --pattern.

I would prefer a Crystal implementation as well. Seems like I was mistaken about extensibility of cmark. There is however an open PR to include extensions. Github have forked cmark to implement their syntax additions.
@ysbaddaden has already created a shard with bindings to cmark: ysbadden/crystal-cmark.
So it would probably be very easy to replace the current implementation with cmark. But that would not give extensibility, it would probably still be difficult even if cmark supported it.

Besides the already mentioned projects, there are also huacnlee/remarkdown, an extension of stdlib's Markdown with support for GFM, and icyleaf/markd a WIP markdown parser in a very early stage.
/cc @icyleaf @huacnlee

@icyleaf
Copy link
Contributor

icyleaf commented Jun 26, 2017

I created markd just to validate how to write a CommonMark and better support extensions. In my other project named wasp is a static site generator. It requires powerful markdown parser for the users.

cmark is not the best solution to wasp, and same to crystal community.

BTW, PL#4496 is another discussion.

@akzhan
Copy link
Contributor

akzhan commented Jun 26, 2017

I'm sure that GFM should be default. Because it will be default everywhere.

@straight-shoota
Copy link
Member Author

What makes you sure of this? Even Github understands GFM as a set of extensions to the standard CommonMark. A default implementation should use the most basic common denominator. It is a better approach to add certain extensions to the basic set when they are necessary, instead of removing them when they are not needed.
However, we should consider including GFM into the stdlib and make it easily accessible (maybe as Markdown::GFM).

@akzhan
Copy link
Contributor

akzhan commented Jun 26, 2017

Just because any modern language/framework documentation browser based on GFM. But Yes, GFM as option in stdlib is OK.

@kazzkiq
Copy link

kazzkiq commented Jun 26, 2017

Don't you guys think that two markdown parsers are a little bit of overkill? I personally would advocate that none should be embedded in Crystal at all. Unless I'm losing some details, I really see no need for markdown parsers when you could simply implement it as a shard and import as needed.

@akzhan
Copy link
Contributor

akzhan commented Jun 26, 2017

@kazzkiq crystal docs depends on Markdown.

@RX14
Copy link
Contributor

RX14 commented Jun 26, 2017

I've said it before and I'll say it again: If the only reason the markdown parser is in the stdlib is so that it can be used in crystal doc then the solution is to develop a method to separate it into a shard and still have the compiler depend on it. This could very well include simply vendoring sourcecode into src/compiler/vendor and treating it as compiler sourcecode.

@bcardiff
Copy link
Member

@straight-shoota in #4613 (comment) i meant that if the failing specs can be grouped by features, then this issue is more actionable / can be tackled down based on that list of features. Maybe that list is just the TOC of http://spec.commonmark.org/0.27/ (now that I've opened the link).

That been said, PRs are welcome to move the crystal stdlib markdown module towards CommonMark.

If the module migrate to a separate shard that is another story. Let's first have a commonmark native implementation we are all happy about.

@straight-shoota
Copy link
Member Author

The current implementation probably needs some architectural changes to properly support all of CommonMark and to make it extensible.
For example, it is recommended to use a two-stage parser for block nodes and inline nodes.

As a side node: documentation flags (#3519) could also be implemented as an extension to the default markdown class and can then be used by the docs generator.

@icyleaf
Copy link
Contributor

icyleaf commented Jul 19, 2017

Now markd is 100% compliant to Commonmark, and pass all specs.

Here is the result of a sample markdown file parse at MacBook Pro Retina 2015 (2.2 GHz):

Crystal Markdown   3.28k (305.29µs) (± 0.92%)       fastest
           Markd 305.36  (  3.27ms) (± 5.52%) 10.73× slower

parse cost time:

preparing input: 1.218ms
block parsing: 1.685ms
inline parsing: 2.187ms
renderering: 1.472ms

note: preparing input is only process the source as String not File.

@straight-shoota
Copy link
Member Author

Wow, that's great! A performance loss is certainly to be expected, but I suppose there is room for optimization...?
What "sample markdown" did you use for this benchmark?

@icyleaf
Copy link
Contributor

icyleaf commented Jul 19, 2017

@straight-shoota Updated the result, i found and used a complete commonmark source as a sample to benchmark. 🤣

@icyleaf
Copy link
Contributor

icyleaf commented Jul 19, 2017

Add ujifgc/crmark, It was best with commonmark support for now.

     crystal markdown   3.06k (327.25µs) (± 1.25%)       fastest
                markd 278.73  (  3.59ms) (± 1.12%) 10.96× slower
crmark in :commonmark 635.85  (  1.57ms) (± 1.43%)  4.81× slower
crmark in :markdownit 118.54  (  8.44ms) (± 4.52%) 25.78× slower

I can't install ysbaddaden/crystal-cmark, it missed the benchmark:

$ shards
Updating https://github.com/icyleaf/markd.git
Updating https://github.com/ujifgc/crmark.git
Updating https://github.com/ysbaddaden/crystal-cmark.git
Using markd (f58ed78fd0cdcc6e9dd274ac9f8696bc778dea84)
Using crmark (457c602725834429cd40544fbcaa505034637c8b)
Installing common_mark (0.1.0)
Postinstall cd ext && make
Failed cd ext && make:
/bin/sh: line 1: cd: ext: No such file or directory

@straight-shoota
Copy link
Member Author

straight-shoota commented Jul 19, 2017

Would you care comparing the performance to crystal-cmark (libcmark bindings)? And maybe some implementations in other languages (cmark and commonmark.js are reference implementations)?

@RX14
Copy link
Contributor

RX14 commented Jul 19, 2017

@icyleaf use branch: master on common_mark, the latest 0.2.0 release isn't tagged and it's trying to use 0.1.0.

@icyleaf
Copy link
Contributor

icyleaf commented Jul 19, 2017

Simple benchmark, each ran 10 times. source code

bm_crystal_builtin average cost 8.6305ms, min 7.899ms, max 9.948ms
bm_crystal_crmark average cost 18.2027ms, min 17.303ms, max 19.282ms
bm_crystal_markd average cost 14.9459ms, min 14.08ms, max 15.783ms
bm_node_commonmarkjs average cost 114.2585ms, min 109.916ms, max 119.234ms

@icyleaf
Copy link
Contributor

icyleaf commented Jul 20, 2017

Updated and add common_mark (thanks @RX14).

     crystal markdown   2.69k (372.18µs) (± 0.93%)       fastest
                markd 257.93  (  3.88ms) (± 1.18%) 10.42× slower
crmark in :commonmark 640.94  (  1.56ms) (± 1.53%)  4.19× slower
crmark in :markdownit 123.85  (  8.07ms) (± 3.67%) 21.69× slower
        crystal-cmark   1.86k (536.57µs) (± 3.64%)  1.44× slower

without built-in markdown

                markd 249.45  (  4.01ms) (± 3.86%)  6.99× slower
crmark in :commonmark 511.05  (  1.96ms) (± 2.72%)  3.41× slower
crmark in :markdownit 111.65  (  8.96ms) (± 2.43%) 15.61× slower
        crystal-cmark   1.74k (573.86µs) (± 3.68%)       fastest

@asterite
Copy link
Member

I'd say we move the Markdown class from the std inside the compiler's source code and make it private to it. If you want to use markdown, use a shard. As @RX14 says, the only reason we have Markdown in the compiler's source code is because we use it for docs. If we later find out we like a shard, the compiler can depend on it.

Thoughts?

@straight-shoota
Copy link
Member Author

The current markdown implementation is insufficient in many ways, even for the job of producing the docs. Therefore I'd like to see an improvement to the markdown employed by the compiler, be it in stdlib or from an external shard (I don't know about the practical application of this).

I'd question if it would be worth sticking around with the current implementation if it gets hidden in the compiler's source. It's not good enough for this purpose and it's unlikely to get improved if nobody uses it except from the compiler.

@asterite
Copy link
Member

@straight-shoota The current std is documented with it and I didn't find it lacking. We could perfectly support a subset of it, like for lists, links and codeblocks.

@straight-shoota
Copy link
Member Author

Yes, for the current stdlib documentation it is sufficient. But I've run into problems several times documenting my own code, mostly cause by missing support for raw HTML. While this shouldn't be needed for most documentation purposes, there are occasions where e.g. a table would help a lot. This is a common feature used in many API documentations.
Plus, the markdown renderer is not only used for the inline API documentation but also to render README.md. For this purpose it feels like a necessity to have full support of Common Mark and don't settle for anything short of that.

@RX14
Copy link
Contributor

RX14 commented Sep 11, 2017

@straight-shoota surely since we all agree it's substandard, surely the best thing to do is to limit it's usage to to docs tool. If someone wants to implement all of commonmark properly and completely, i'm sure we can vendor in that shard and use it in the compiler.

@asterite
Copy link
Member

@j8r markd is already feature complete and passes all specs. There haven't been recent commits that fix stuff in the library, just optimizations (it seems) or new features. So I think moving this to the standard library is good:

  • it's unlikely to change that often
  • it's available to all users without using an extra shard
  • having it in std means it's more likely to be used and noticed and so more bugs and performance optimizations could be solved/added

@straight-shoota
Copy link
Member Author

@j8r I agree that having libraries tied to Crystal's stdlib release schedule can be problematic. But as @asterite said, it's a huge issue regarding Markdown. There are already other libraries in the stdlib (HTTP, SSL for example) that would benefit much more from having individual releases.

In the future we might eventually ending up with the stdlib consisting of essentially a set of standard shards that come installed with crystal by default, but individual projects can use different versions of them (installed in local lib). We don't have that setup now, and I don't think it's urgent before 1.0. But we'd like to have markdown support in stdlib, so the best solution currently is to merge markd.

@j8r
Copy link
Contributor

j8r commented Sep 10, 2019

Ok that may be fine if you consider the markdown implementation very stable. In fact, the main issue I find is to put everything possible (I exaggerate) in the crystal/crystal monorepo.

Can we still keep in its own repo, and include it with subtree (or whatever) in the stdlib (option 3)?
This would be a shame that @icyleaf, the markd creator, won't be able to review PRs and contribute to the stdlib's markdown.
Furthermore, the owners will have to backport changes upstream and downstream.
For the record, Golang and Rust do this division in some extent (not with subtree, though).

@asterite
Copy link
Member

This would be a shame that @icyleaf, the markd creator, won't be able to review PRs and contribute to the stdlib's markdown

Why not?

Furthermore, the owners will have to backport changes upstream and downstream

I imagine markd will be moved to the std and then it wouldn't make sense for markd to continue existing beside the std.

Can we still keep in its own repo, and include it with subtree (or whatever) in the stdlib (option 3)?

This only affects Crystal std/compiler developers, so just a bunch of people compared to the entire Crystal userbase. So I personally don't see this decision as having a lot of importance and it could always be tackled later.

@j8r
Copy link
Contributor

j8r commented Sep 10, 2019

Why not?

Because he won't be the owner of the repo anymore, just another Crystal contributor.

I imagine markd will be moved to the std and then it wouldn't make sense for markd to continue existing beside the std.

Not really. It would also be harder to contribute than if markd lives in its own repo. Review would be longer, CI too, the crystal-lang/crystal have more setup requirements.
Wanting a custom markdown parser to build our own flavor, or implement an existing one, will require to fork the whole Crystal repo (EDIT: or advanced git-fu).

@bcardiff
Copy link
Member

If there is a chance for the markdown module to be available directly in the std-lib I would like to avoid having a release without it. Right now 0.31.0 might come out without a markdown module.
That will cause dependencies to move out of the std lib since there is no deprecation in place.

While we settle how/if do this, should rollback #8115 to avoid reverting a breaking change in a release?

@straight-shoota
Copy link
Member Author

@j8r You're making generally valid points and more separation might be a good path for the future. But currently, we have a monolith repository. And I think this is fine for now and we don't need to change this setup. That would require a lot of effort and simply isn't necessary at the moment.

Besides, your argument only targets markdown, because the markd implementation is currently a separate shard, suggested to be moved into stdlib. It has not been a part of the stdlib before now, but the same is true for to many other standard libraries. The only difference is, they're currently already in stdlib. I don't think it makes sense to go a separate way right now just for markdown.

@bcardiff Yes, we should avoid breaking markdown in 0.31.0. But I suppose, if we settle on @asterite's proposal, Markd could be imported rather quickly. If we target for a release at the beginning of October, this should work out.

@asterite
Copy link
Member

Another way to see it: we can import markd into the standard library and rename it to Markdown. markd can still exist and you can use it instead of that if you really want to. Markdown can be seen as the markdown implementation provided with the std, whose updates follow the Crystal release cycle and contributing to it is "slower" than contributing to markd (which might not be true because here we are many while in markd it's just one). It just happens that right now Markdown has the same implementation as markd, or uses that under the hood.

@asterite
Copy link
Member

I think I won't have time to do this, so if someone can take over I'll be very grateful 🙏

@asterite
Copy link
Member

If someone wants to tackle this, they will need to:

  • copy markd by moving it to the standard library under src, but specifically the code in this PR. Include the specs and put them in spec/src/markdown (I think the specs are very simple and short so we can include them there). Make sure to put the helpers in that same file as private methods.
  • rename Markd to Markdown
  • avoid reopening HTML and overriding decode_entities and others
  • implement the custom doc renderer by somehow subclassing Markd::HTMLRenderer: the crystal docs renderer will highlight code and do some links in code sections
  • remove the old internal Markdown module

@asterite asterite added the help wanted This issue is generally accepted and needs someone to pick it up label Sep 11, 2019
@icyleaf
Copy link
Contributor

icyleaf commented Sep 11, 2019

@j8r It's alright to move markd to crystal-lang organization, I believe this move will make markd and
crystal both better :)

@asterite
Copy link
Member

Actually, let me try once. more today. Yesterday I bumped into a compiler bug that was recently reported and I desisted, but maybe I can find a workaround.

@bew
Copy link
Contributor

bew commented Sep 11, 2019

Which bug was it? (maybe someone can have a try at it while you have fun with markd?)

@asterite
Copy link
Member

Ah, nevermind the bug. I somehow copied the text from #8163 (comment) into the markdown code and it was exploding because of that. No idea how that happened.

Later today I'll submit the markd inside std PR. It won't have the git history, sorry. But the history can be kept in markd's repo.

@asterite
Copy link
Member

I have the code but I won't send the PR. It seems Brian and Juan agreed that it might not be good to have Markdown in the standard library and that we should move crystal docs to an external tool where we could use external shards there.

So for now it might be better to just do nothing here.

@bcardiff
Copy link
Member

We were iterating with @waj on what to do here, our proposal would be:

  • fork icyleaf/markd to crystal-lang/crystal-markdown
    • keep @icyleaf as maintainer of crystal-lang/crystal-markdown
    • rename markd to use the top level Markdown namespace that was available before
    • check that the main api from Crytstal 0.30.0 is honored by crystal-lang/crystal-markdown
  • encourage users to include crystal-lang/crystal-markdown as a dependency if needed in the Crystal 0.31.0 changelog
  • Delay for 0.32.0 the discussion on how to use crystal-lang/crystal-markdown in the compiler
  • Leave the docs tool as is for 0.31.0, with it existing bugs

This way we avoid copying the whole shard, allow @icyleaf to keep iterating, offer migration from Crystal 0.30 to 0.31, aim for reducing the std-lib size and maybe split the compiler.

@asterite
Copy link
Member

fork icyleaf/markd to crystal-lang/crystal-markdown

What's the purpose of that? Why can't markd continue existing as a shard on its own? Why do we need to become its owners?

@bcardiff
Copy link
Member

Avoiding dependencies outside the organization is a safe policy for avoiding them to disappear.
Is easier to keep permissions for the core-team.

@j8r
Copy link
Contributor

j8r commented Sep 12, 2019

An option for the future would be to have the markdown library as a dev dependency (that could be bundled, or not, in the release).
In fact, generating the API docs is optional, and it could even be possible to generate docs with no markdown.
This would prevent the chicken and egg issue @ysbaddaden was pointing out if the library becomes a hard requirement.

@bcardiff
Copy link
Member

In the future the docs tool might even get out of the compiler directly. Yet bundled in the distribution packages. Doing that could also remove the chicken and egg situation of shards used in the compiler if needed.

Again, I want to focus on what the users should use as a replacement of the moved Markdown in order include that in the release 0.31.

We can focus on how to improve the docs with better markdown support later.

@straight-shoota
Copy link
Member Author

@bcardiff If we want to postpone the decision to after 0.31.0, I suggest to hold off on moving markd to crystal-lang/crystal-markdown as well. That's a change that looks good and might be done anyway, but maybe we come to a different solution. I'd rather avoid it until the immediate goal is settled.
Postponing would also mean to revert the removal of stdlib's `Markdown' for 0.31.0.

@bcardiff
Copy link
Member

That would be a full postpone. :-) But I'm fine with it.

While we settle how/if do this, should rollback #8115 to avoid reverting a breaking change in a release?
#4613 (comment)

@asterite
Copy link
Member

asterite commented Sep 13, 2019

I think this is all just a big mistake. Markdown in docs has a very tiny priority. The next big steps are getting parallelism tried out and done well and Windows, not focusing on extracting tools outside of the compiler, which is transparent for users and probably a maintenance burden for us and Crystal distributors.

But improving Markdown is already here without a cost at all (I already sent a PR) and brings an immediate improvement without breaking any API (the main Markdown.to_html method is still there).

If I never continued commenting on this issue this entire "let's wait and extract compiler tools" discussion wouldn't even exist.

But do as you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted This issue is generally accepted and needs someone to pick it up kind:refactor topic:stdlib
Projects
None yet