Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOC header anchor generator doesn't replace dots #385

Closed
olivierlacan opened this issue Jun 5, 2014 · 8 comments · Fixed by #428
Closed

TOC header anchor generator doesn't replace dots #385

olivierlacan opened this issue Jun 5, 2014 · 8 comments · Fixed by #428
Milestone

Comments

@olivierlacan
Copy link

This is sort of a minor aesthetic issue, but generally I feel like any heading text being processed for a TOC header should follow the kind of conversion that ActiveSupport::Inflector#parameterize does since it's going to end up in a URL:

parameterize "## 1. Commit your code in your feature branch."
=> "1-commit-your-code-in-your-feature-branch"

However this is how Redcarpet currently handles the same string:

markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML_TOC, with_toc_data: true)
=> #<Redcarpet::Markdown:0x007fb013172d28 @renderer=#<Redcarpet::Render::HTML_TOC:0x007fb013172d50>>
markdown.render "## 1. Commit your code in your feature branch."
=> "<ul>\n<li>\n<a href=\"#1.-commit-your-code-in-your-feature-branch.\">1. Commit your code in your feature branch.</a>\n</li>\n</ul>\n"

Seems like a simple regex update inside of header_anchor would do the trick. It currently uses: <\\/?[^>]*>

This is the regex that parameterize uses: /[^a-z0-9\-_]+/i

C isn't my strong suit, but I can try to provide a PR for this if it seems sane.

@timcheadle
Copy link

I spent some time looking into this today, and it's a bit more complicated than it seems.

First off, header_anchor() currently only strips spaces. There are many more "unsafe" characters to strip from id= attributes and anchor links. I'm playing with this code as a replacement:

char *header_anchor(struct buf *text)
{
  VALUE str = rb_str_new2(bufcstr(text));
  VALUE tags_regex = rb_reg_new("<\\/?[^>]*>", 10 /* length */, 0);
  VALUE unsafe_char_regex = rb_reg_new("[ $&+,\\/:;=?@\"<>#%{}|\\^~\\[\\]`]+", 31, 0);
  VALUE trailing_dash_regex = rb_reg_new("-\\Z", 3, 0);

  VALUE heading = rb_funcall(str, rb_intern("gsub"), 2, tags_regex, rb_str_new2(""));
  heading = rb_funcall(heading, rb_intern("gsub"), 2, unsafe_char_regex, rb_str_new2("-"));
  heading = rb_funcall(heading, rb_intern("gsub"), 2, trailing_dash_regex, rb_str_new2(""));
  heading = rb_funcall(heading, rb_intern("downcase"), 0);

  return StringValueCStr(heading);
}

With this code, the TOC links render just fine. This markdown:

# How does "this" work?

Becomes this TOS link:

<ul>
  <li>
    <a href="#how-does-this-work">How does "this" work?</a>
  </li>
</ul>

The problem is in the regular header anchors. That same markdown renders as this header (with :with_toc_data passed in):

<h1 id="how-does-quot-this-quot-work">How does &quot;this&quot; work?</h1>

The reason for this is that the *text passed into header_anchor() for header tags has been run through escape_html() before it's run through the regular expressions to strip unsafe characters. Short of including the unescape utilities from Houdini, I'm not well versed enough with this code to fix that issue. Unescaping a string that you escaped earlier seems silly, anyway.

@robin850
Copy link
Collaborator

Hello there,

Thanks for your inputs on this! I agree that the TOC headers are a hot-spot that we should improve for the coming release. At least, you mentioned these different issues:

  • Unescaping of non-alphanumeric or non-common (e.g. asterisks) characters
  • Inconsistency between HTML and HTML_TOC for the header anchors

However, we'd love to rewrite this area of the code without the Ruby C API relying on regular regex.h (see 6d881e8) so there is a lot more work to do here. I've taken the liberty to snug the issue under the 3.2.0 milestone. I will try to address this as soon as possible unless someone wants to send a patch. :-)

@robin850 robin850 added this to the 3.2.0 milestone Jun 15, 2014
@robin850 robin850 modified the milestones: 3.3.0, 3.2.0 Jul 19, 2014
robin850 added a commit that referenced this issue Oct 27, 2014
In order to generate valid anchors, let's remove non-alphanumeric chars
from the output string. The generated strings are along the lines of
strings returned by Active Support's `#parameterize`.

Fixes #385.
@robin850
Copy link
Collaborator

Ok guys this should has been resolved by the aforementioned pull request ; thanks again for reporting ! :-)

@olivierlacan
Copy link
Author

@robin850 Oh that's great, thanks! ❤️

@timcheadle
Copy link

Nice work! My only comment is that maybe the tests should include a double quote in one of the strings, but I have every confidence that it works regardless.

robin850 added a commit that referenced this issue Dec 13, 2014
This commit is a continuation of ac16518 so that grouped stripped chars
are now correctly stripped out (like "::" when referencing a namespaced
object for instance).

In response to a comment in #385.
@robin850
Copy link
Collaborator

@timcheadle : Whoah, thanks for spotting this ! 👍 Actually no, it didn't work but this is now fixed !

@timcheadle
Copy link

❇️

@gjtorikian gjtorikian mentioned this issue Jun 4, 2015
@Runtanawat

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@olivierlacan @robin850 @timcheadle @Runtanawat and others