Skip to content

Commit

Permalink
Parse legislative list as part of the main conversion
Browse files Browse the repository at this point in the history
Kramdown handles abbreviations and footnotes natively.

Call to action and legislative list are custom Govspeak components built as
extensions to Kramdown. These sections are parsed separately in preprocessing,
they are ignored in any subsequent parsing because Kramdown ignores HTML.

Because of this, we have added custom handling of abbreviations and footnotes
specifically for these components [1]. This custom implementation means that
abbreviations defined anywhere in the document will be applied to call to
action and legislative list components. Otherwise, only abbreviations defined
within these components would be applied.

This custom implementation has been shown to contain bugs through several
Zendesk tickets. The main issues reported are:

1. Acronyms are inserted into the produced HTML, even if undesirable. For
example, a link `<[email protected]>` with the acronym `*[email]: Electronic
mail is a method of transmitting and receiving messages` would produce
`<p>href="mailto:<abbr title="Electronic mail is a method of transmitting and
receiving messages">email</abbr>@example.com"<abbr title="Electronic mail is a
method of transmitting and receiving messages">email</abbr>@example.com</p>`
rather than the expected link.

2. Because the `add_acronym_alt_text` method runs through the text for each
acronym, acronyms can be inserted into other acronyms creating invalid HTML.

If we change tack, and instead allow Kramdown to parse these components as part
of the main conversion to HTML (instead of in preprocessing), Kramdown will
handle abbreviations and footnotes correctly, fixing both of these issues (as
well as some other undocumented issues). 

The legislative list component currently works by overriding Kramdowns `list`
extension [2] and disabling ordered lists [3]. So that we can parse this
component as part of the main conversion, this applies two options:
`parse_block_html` [4], so that markdown inside the `legislative-list-wrapper`
`div` is parsed by Kramdown, and a new custom option of
`ordered_lists_disabled`, so that we can control flow inside our
`Parser::Govuk` class (subclass of the Kramdown parser). It is possible to
override `parse_list` instead of `parse_block_html`, but this means that the
outermost list will not be parsed (by the time `parse_list` is called the
outermost list is already being parsed)

The only difference between this new iteration of the legislative list and the
previous iteration, is that we now wrap the component in a
`legislative-list-wrapper` `div`. We could remove this in postprocessing, but
this requires additional modification of the Kramdown produced HTML which seems
unnecessary.

This also allows us to remove all remaining code which replicates the footnote
and acronym functions of Kramdown.

Some of the tests can likely be removed as we are now testing the functionality
of the library, however, for now these represent that there are no
regressions.

[1]: #285
[2]:
https://github.com/gettalong/kramdown/blob/bd678ecb59f70778fdb3b08bdcd39e2ab7379b45/lib/kramdown/parser/kramdown/list.rb#L54
[3]: #25
[4]: https://kramdown.gettalong.org/quickref.html#html-elements
  • Loading branch information
jkempster34 committed Oct 11, 2023
1 parent 6882d93 commit 5cc3954
Show file tree
Hide file tree
Showing 4 changed files with 258 additions and 264 deletions.
80 changes: 7 additions & 73 deletions lib/govspeak.rb
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,6 @@ def initialize(source, options = {})
sanitize: true,
syntax_highlighter: nil }.merge(options)
@options[:entity_output] = :symbolic
@footnote_definition_html = nil
@acronyms = []
end

def to_html
Expand All @@ -76,16 +74,6 @@ def to_html
kramdown_doc.to_html
end

unless @footnote_definition_html.nil?
regex = /<div class="footnotes".*[<\/div>]/m

if html.scan(regex).empty?
html << @footnote_definition_html
else
html.gsub!(regex, @footnote_definition_html)
end
end

Govspeak::PostProcessor.process(html, self)
end
end
Expand Down Expand Up @@ -125,8 +113,6 @@ def preprocess(source)
source = Govspeak::BlockquoteExtraQuoteRemover.remove(source)
source = remove_forbidden_characters(source)

footnote_definitions(source)

self.class.extensions.each do |_, regexp, block|
source.gsub!(regexp) do
instance_exec(*Regexp.last_match.captures, &block)
Expand All @@ -135,36 +121,6 @@ def preprocess(source)
source
end

def footnote_definitions(source)
is_legislative_list = source.scan(/\$LegislativeList.*?\[\^\d\]*.*?\$EndLegislativeList/m).size.positive?
footnotes = source.scan(/^\s*\[\^(\d+)\]:(.*)/)
@acronyms.concat(source.scan(/(?<=\*)\[(.*)\]:(.*)/))
if is_legislative_list && footnotes.size.positive?
list_items = footnotes.map do |footnote|
number = footnote[0]
text = footnote[1].strip
footnote_definition = Govspeak::Document.new(text).to_html[/(?<=<p>).*(?=<\/p>)/]
footnote_definition = add_acronym_alt_text(footnote_definition)

<<~HTML_SNIPPET
<li id="fn:#{number}" role="doc-endnote">
<p>
#{footnote_definition}<a href="#fnref:#{number}" class="reversefootnote" role="doc-backlink" aria-label="go to where this is referenced">↩</a>
</p>
</li>
HTML_SNIPPET
end

@footnote_definition_html = <<~HTML_CONTAINER
<div class="footnotes" role="doc-endnotes">
<ol>
#{list_items.join.strip}
</ol>
</div>
HTML_CONTAINER
end
end

def remove_forbidden_characters(source)
# These are characters that are not deemed not suitable for
# markup: https://www.w3.org/TR/unicode-xml/#Charlist
Expand Down Expand Up @@ -344,25 +300,13 @@ def render_image(image)
end

extension("legislative list", /#{NEW_PARAGRAPH_LOOKBEHIND}\$LegislativeList\s*$(.*?)\$EndLegislativeList/m) do |body|
Govspeak::KramdownOverrides.with_kramdown_ordered_lists_disabled do
body = add_acronym_alt_text(body.strip)

Kramdown::Document.new(body.strip).to_html.tap do |doc|
doc.gsub!("<ul>", "<ol>")
doc.gsub!("</ul>", "</ol>")
doc.sub!("<ol>", '<ol class="legislative-list">')

footnotes = body.scan(/\[\^(\d+)\]/).flatten

footnotes.each do |footnote|
html = "<sup id=\"fnref:#{footnote}\" role=\"doc-noteref\">" \
"<a href=\"#fn:#{footnote}\" class=\"footnote\" rel=\"footnote\">" \
"[footnote #{footnote}]</a></sup>"

doc.sub!(/(\[\^#{footnote}\])/, html)
end
end
end
# The surrounding div is neccessary to control flow in `parse_block_html` and
# maintain the same functionality as a previous version of this extension.
<<~BODY
{::options parse_block_html=\"true\" ordered_lists_disabled=\"true\" /}
<div class="legislative-list-wrapper">#{body}</div>
{::options parse_block_html=\"false\" ordered_lists_disabled=\"false\" /}
BODY
end

extension("numbered list", /^[ \t]*((s\d+\.\s.*(?:\n|$))+)/) do |body|
Expand Down Expand Up @@ -439,16 +383,6 @@ def kramdown_doc
def encode(text)
HTMLEntities.new.encode(text)
end

def add_acronym_alt_text(html)
return unless html

@acronyms.each do |acronym|
html.gsub!(acronym[0], "<abbr title=\"#{acronym[1].strip}\">#{acronym[0]}</abbr>")
end

html
end
end
end

Expand Down
9 changes: 9 additions & 0 deletions lib/govspeak/post_processor.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,15 @@ def self.extension(title, &block)
end
end

extension("covert legislative list ul to ol") do |document|
document.css(".legislative-list-wrapper").map do |el|
el.inner_html = el.inner_html
.sub("<ul>", "<ol class=\"legislative-list\">")
.gsub("</ul>", "</ol>")
.gsub("<ul>", "<ol>")
end
end

# This "fix" here is tied into the rendering of images as one of the
# pre-processor tasks. As images can be created inside block level elements
# it's possible that their block level elements can be HTML entity escaped
Expand Down
19 changes: 18 additions & 1 deletion lib/kramdown/parser/govuk.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@ def ==(_other)
DESCRIPTION
simple_array_validator(val, :document_domains, AlwaysEqual.new)
end

define(:ordered_lists_disabled, Boolean, false, <<~DESCRIPTION)
Disables ordered lists
Default: false
Used by: KramdownWithAutomaticExternalLinks
DESCRIPTION
end

module Parser
Expand Down Expand Up @@ -46,7 +53,17 @@ def add_link(element, href, title, alt_text = nil, ial = nil)
def parse_block_html
return false if CUSTOM_INLINE_ELEMENTS.include?(@src[1].downcase)

super
return super unless @options[:ordered_lists_disabled]

# Kramdown loads parsers into the instance from `options` which are scoped to
# the class. Because we are changing these options inside an instance, we must
# reconfigure the parser before and after disbaling ordered lists.
Govspeak::KramdownOverrides.with_kramdown_ordered_lists_disabled do
configure_parser
super
end

configure_parser
end
end
end
Expand Down
Loading

0 comments on commit 5cc3954

Please sign in to comment.