-
-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial fix for fatal crash on use of so-called "paired shortcodes" #7330
Conversation
Related to gohugoio#7322 Related to gohugoio#7106
If you can add a failing test case that this PR fixes it would be easier to understand; I also need that before I can consider merging this. |
I would do as you request, but just let me show you something. With my commit and the following shortcode template: <div class="flex-l mv0 pa3 center">
<article class="center mw7">
<div class="nested-copy-line-height lh-copy f6 nested-links nested-img mid-gray">
{{ .Get 0 }}
</div>
</article>
</div> ...This in
...Produces the following generated output: <pre>
<code> </div>
</article>
</code>
</pre>
<!-- raw HTML omitted -->
<p>testing 1 2 3</p>
<!-- raw HTML omitted -->
<pre>
<code> </div>
</article>
</code>
</pre> Why is this? I have no idea. While this in
...Works as expected, formatting and all, while this:
...And:
...Both produce the following build error:
Without my commit, variations with the slash before the name of the shortcode produce the following build error:
...And variations with the slash before the right angle bracket produce the stack trace described in issue #7322. I'm sure that this is a very simple parsing problem, but it is no less an impenetrable mystery to me. I simply do not have the domain-specific knowledge — of the language or of the project — to even begin to untangle this. I've taken this as far as I can. If you or another guru o' Hugo don't know why these things be like it is, then I guess no one understands what the **** is happening here. |
The formatting question you raise is documented and has nothing to do with the "crash". You cannot copy a fraction of a big site and tell me that "this case fails" when there are millions of similar working examples out there. I'm not doubting that the problem is real, but the devil is in the detail -- we need a failing test to be able to fix this. |
I'm not really sure what you mean. Please clarify:
Is parsing behavior dependent on site and/or theme configuration? Under what conditions does Hugo consider an unhandled (fatal) exception to be an acceptable outcome? I have observed this behavior on the following versions:
The last version to correctly parse the name of a shortcode was Are or are not these messages evidence of parsing error?
Should the ellipsis in Thank you for your time. |
As a user, I like Hugo a lot. As a guru, Hugo's shortcode parsing mechanism is too verbose to be understood by mortal man. To help alleviate this problem, I have conceived the kernel of a parsing mechanism that is thick, solid, tight — as well as a number of other unspecified but highly positive attributes. I begin by defining the pattern of a well-formed shortcode as a regular expression. This pattern is compiled to a finite automaton named In addition to // `scRex` is the regular expression automaton that matches all valid shortcode
// variant forms, including simple, paired, and self-closing. A paired shortcode
// is like a div tag, and usually has inner content, e.g., `<div>alpha bravo
// charlie</div>`; a simple shortcode is like an img tag, e.g., `<img
// src="sauce.jpg" />`; and a self-closing shortcode is like a paired tag made
// singular, with no inner content, e.g., `div i="1" ii="2" iii=3 />.
// The structure of a shortcode has five parts, two of them optional. They are:
// ftype: the opening token of a shortcode; specifies the formatting type of the
// inner content (if any)
// name: the name of the shortcode; corresponds to the template
// posi: positional parameters passed to the template, if any
// keyw: keyword parameters passed to the template, if any
// epytf: the terminating token of the shortcode
var ftype = `(?P<ftype>{{[<%])[ \t]+`
var name = `/?(?P<name>[-\w]+)[ \t]+`
var posi = `(?P<posi>(?:\w+[ \t]+)+)?`
var keyw = strings.Join([]string{
`(?P<keyw>(?:\w+=(?:`,
`\w+`, // 1. Captures word-type unquoted content.
`|'[^'\\]*(?:\\.[^'\\]*)*'`, // 2. Captures singly- and doubly-quoted strings,
`|"[^"\\]*(?:\\.[^"\\]*)*"`, // including escaped quotation marks of their
`)[ \t]+)+)?`, // respective type.
}, "")
var epytf = `(?P<epytf>/?[%>]}})`
var scRex = regexp.MustCompile(ftype + name + posi + keyw + epytf)
var scLooseRex = regexp.MustCompile(ftype + `.*?` + epytf)
// Where `scRex` matches both opening and closing tags, `isClosingRex` matches
// closing tags only. The name pattern differs from the former in that the
// leading slash before the name — the indicator of a closing tag — is
// mandatory. It is used solely to determine the "closing-ness" of a tag.
var nameIsClosingPattern = `/[-\w]+`
var isClosingPattern = ftype + nameIsClosingPattern
var isClosingRex = regexp.MustCompile(isClosingPattern)
// Used to guarantee that a shortcode has the same leading and trailing ftype.
var validFtypes = []string{"{{<.*?>}}", "{{%.*?%}}"}
var validFtypesPattern = strings.Join(validFtypes, "|")
var validFtypeRex = regexp.MustCompile(validFtypesPattern) To discriminate between opening, closing, and self-closing shortcodes: // Where `scRex` matches both opening and closing tags, `isClosingRex` matches
// closing tags only. The name pattern differs from the former in that the
// leading slash before the name — the indicator of a closing tag — is
// mandatory. It is used solely to determine the "closing-ness" of a tag.
var nameIsClosingPattern = `/[-\w]+`
var isClosingPattern = ftype + nameIsClosingPattern
var isClosingRex = regexp.MustCompile(isClosingPattern)
// Matches self-closing shortcodes. In the program logic, self-closing
// shortcodes have a higher priority than "merely" closing shortcodes.
var isSelfClosingPattern = `/>}}`
var isSelfClosingRex = regexp.MustCompile(isSelfClosingPattern) Finally we see the expressions used to extract positional and keyword parameters individually: // Matches a valid positional parameter.
var extractPosiPattern = `(?P<positional>\w+)[ \t]+`
var extractPosiRex = regexp.MustCompile(extractPosiPattern)
// Matches a valid keyword parameter.
// Similar to `keyw`, but captures the key and value as groups, and doesn't need
// to match more than once.
var extractKeyWPattern = strings.Join([]string{
`(?P<key>\w+)=(?P<value>`,
`\w+`,
`|'[^'\\]*(?:\\.[^'\\]*)*'`,
`|"[^"\\]*(?:\\.[^"\\]*)*"`,
`)`,
}, "")
var extractKeyWRex = regexp.MustCompile(extractKeyWPattern) These latter are paired with the following very simple functions: func extractPosiParameters(s string) []string {
// Extracts the positional parameters.
var parameters []string
matches := extractPosiRex.FindAllStringSubmatch(s, -1)
for _, match := range matches {
parameters = append(parameters, strings.TrimSpace(match[0]))
}
return parameters
}
type KeywordParameter struct {
key string
value string
}
func extractKeyWParameters(s string) []KeywordParameter {
// Extracts all key-value pairs present in the keyword parameters and
// returns a convenient slice thereof.
var parameters []KeywordParameter
matches := extractKeyWRex.FindAllStringSubmatch(s, -1)
// fmt.Println("matches:", matches)
for _, match := range matches {
parameters = append(parameters, KeywordParameter{key: match[1], value: match[2]})
}
return parameters
} They do precisely what you would expect them to do. Let us continue to the heart of the matter. Fortune favors the bold. One pass with Since well-formed shortcodes are a subset all identifiable shortcodes, running func parseShortcodes(s string) ([]*Shortcode, error) {
// Parses all shortcodes present in the content string `s`.
var scs []*Shortcode
looseMatches := scLooseRex.FindAllStringIndex(s, -1)
for _, looseMatchIndex := range looseMatches {
start, end := looseMatchIndex[0], looseMatchIndex[1]
looseMatch := s[start:end]
tightMatch := scRex.FindStringSubmatch(looseMatch)
if len(tightMatch) == 0 {
return scs, fmt.Errorf(`ill-formed shortcode "%s"`, looseMatch)
}
if validFtypeRex.MatchString(looseMatch) == false {
return scs, fmt.Errorf(`shortcode "%s" has mismatched formatting type (e.g., '{{< ... %%}}')`, looseMatch)
}
scs = append(scs, extractShortcode(tightMatch, start))
}
return scs, sanityCheckShortcodePairs(scs)
}
func isSelfClosing(name string, epytf string) bool {
// A hacked-together demo function. Will need to be replaced with the
// template-lookup functionality. "Trust the template", the comment says.
// Okay.
hardcoded := []string{
"figure", "gist", "highlight", "instagram", "param",
"ref", "relref", "tweet", "vimeo", "youtube",
}
if isSelfClosingRex.MatchString(epytf) {
return true
}
for _, x := range hardcoded {
if name == x {
return true
}
}
return false
}
type Shortcode struct {
start int
end int
ftype string
name string
posi []string
keyw []KeywordParameter
epytf string
isClosing bool
isSelfClosing bool
}
func extractShortcode(match []string, offset int) *Shortcode {
// Extracts the pertinent properties from the shortcode match.
return &Shortcode{
start: offset,
end: offset + len(match[0]),
ftype: match[1],
name: match[2],
posi: extractPosiParameters(match[3]),
keyw: extractKeyWParameters(match[4]),
epytf: match[5],
isClosing: isClosingRex.MatchString(match[0]),
isSelfClosing: isSelfClosing(match[2], match[5]),
}
} This is what
And here is what is necessary to ensure that all paired shortcodes match up, that they "pair well together": func pop(stack []*Shortcode) (*Shortcode, []*Shortcode) {
i := len(stack) - 1
last := &Shortcode{}
if i >= 0 {
last = stack[i]
stack = stack[:i]
}
return last, stack
}
func sanityCheckShortcodePairs(shortcodes []*Shortcode) error {
var stack []*Shortcode
for _, sc := range shortcodes {
if sc.isSelfClosing == true {
continue
}
if sc.isClosing == true {
last, stac := pop(stack)
if last.isClosing == true || sc.name != last.name || sc.ftype != last.ftype {
return fmt.Errorf(`paired closing shortcode "%s" has no corresponding opening shortcode`, sc.name)
}
stack = stac
} else {
stack = append(stack, sc)
}
}
if len(stack) > 0 {
last, _ := pop(stack)
return fmt.Errorf(`paired opening shortcode "%s" has no corresponding closing shortcode`, last.name)
}
return nil
} Simple and to the point. Spartan-like. Bronze-age glory and power. In all, these 131 lines of code comprise the entirety of the core algorithm. With the possible exception of func renderPage(content string, shortcodes []*Shortcode) string {
// A dummy function to show how the rendering process works.
var parsed string
var pStack []*Shortcode // the stack of all as-yet-unclosed shortcodes
var qStack []*Shortcode // the stack of all shortcodes
for _, sc := range shortcodes {
if sc.isSelfClosing == true {
last, _ := pop(qStack)
parent, _ := pop(pStack)
parsed += formatInnerContent(parent.ftype, content[last.end:sc.start])
parsed += beginTemplate(sc.name, sc.posi, sc.keyw) + endTemplate(sc.name) + "\n"
qStack = append(qStack, sc)
} else if sc.isClosing == true {
last, _ := pop(qStack)
parent, pStac := pop(pStack)
parsed += formatInnerContent(parent.ftype, content[last.end:sc.start])
parsed += endTemplate(parent.name) + "\n"
pStack = pStac
qStack = append(qStack, sc)
} else {
last, _ := pop(qStack)
parent, _ := pop(pStack)
parsed += beginTemplate(sc.name, sc.posi, sc.keyw) + "\n"
parsed += formatInnerContent(parent.ftype, content[last.end:sc.start])
pStack = append(pStack, sc)
qStack = append(qStack, sc)
}
}
return parsed
}
Processing a page consists of iterating through the shortcodes. As shortcodes are encountered, they are pushed to two stacks. The first stack, The inner content is that which occurs in the interval between the previously and currently encountered shortcodes. The content in this interval is formatted with the pertinent properties of the shortcode stored as the last item of And here is everything else: func stripWhitespace(ss []string) []string {
// Strips surrounding whitespace from each string in the slice of strings.
var nSS []string
rex1 := regexp.MustCompile(`\s*(.+)\s*`)
rex2 := regexp.MustCompile(`[^\s]`)
for _, s := range ss {
r := rex1.FindStringSubmatch(s)
if r != nil && rex2.MatchString(r[1]) {
nSS = append(nSS, r[1])
}
}
return nSS
}
func formatMarkdown(s string) string {
// A dummy function to simulate a small subset of markdown formatting.
lines := stripWhitespace(strings.Split(s, "\n"))
for i, line := range lines {
for j := 1; j <= 6; j++ {
if len(line) - 1 < j {
break
}
if line[:j + 1] == strings.Repeat("#", j) + " " {
lines[i] = fmt.Sprintf("<h%d>%s</h%d>", j, line[j + 1:], j)
}
}
}
r := strings.Join(lines, "\n")
if len(r) > 0 {
r += "\n"
}
return r
}
func formatPlain(s string) string {
// A dummy function to simulate plain formatting.
r := strings.Join(stripWhitespace(strings.Split(s, "\n")), "\n")
if len(r) > 0 {
r += "\n"
}
return r
}
func formatInnerContent(ftype string, s string) string {
// A dummy function to simulate handling an arbitrary number of `ftype`s.
if ftype == "{{%" {
return formatMarkdown(s)
}
return formatPlain(s)
}
func beginTemplate(name string, posi []string, keyw []KeywordParameter) string {
// A dummy function to return the first "half" of a template.
params := ""
kv := []string{}
for _, x := range keyw {
kv = append(kv, fmt.Sprintf(`%s=%s`, x.key, x.value))
}
if len(posi) > 0 {
params += " " + strings.Join(posi, " ")
}
if len(kv) > 0 {
params += " " + strings.Join(kv, " ")
}
return "<" + name + params + ">"
}
func endTemplate(name string) string {
// A dummy function to return the second "half" of a template.
return "</" + name + ">"
} Now it is a very simple matter to parse a page, shortcodes and all. With a sample var content = `
{{% section %}}
# What is best in life?
## The open steppe, fleet horse, falcons at your wrist, the wind in your hair.
{{< section >}}
## The open steppe, fleet horse, falcons at your wrist, the wind in your hair.
{{% section %}}
## The open steppe, fleet horse, falcons at your wrist, the wind in your hair.
{{< section >}}
## The open steppe, fleet horse, falcons at your wrist, the wind in your hair.
{{< /section >}}
{{< youtube LyhTxQwSwJU >}}
{{< youtube LyhTxQwSwJU >}}
{{% section %}}
## The open steppe, fleet horse, falcons at your wrist, the wind in your hair.
{{% /section %}}
{{% /section %}}
{{< /section >}}
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec hendrerit
tempor tellus. Donec pretium posuere tellus. Proin quam nisl, tincidunt et,
mattis eget, convallis nec, purus.
Cum sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Nulla posuere. Donec vitae dolor. Nullam tristique diam non
turpis. Cras placerat accumsan nulla.
Nullam rutrum. Nam vestibulum accumsan nisl.
{{% /section %}}
{{% section %}}
# Wrong!!! Conan! What is best in life?
## To crush your enemies, see them driven before you, and hear the lamentations of the women.
Fusce suscipit, wisi nec facilisis facilisis, est dui fermentum leo, quis
tempor ligula erat quis odio. Nunc porta vulputate tellus
Nunc rutrum turpis sed pede. Sed bibendum. Aliquam posuere. Nunc aliquet,
augue nec adipiscing interdum, lacus tellus malesuada massa, quis varius mi
purus non odio.
Pellentesque condimentum, magna ut suscipit hendrerit, ipsum
augue ornare nulla, non luctus diam neque sit amet urna. Curabitur vulputate
vestibulum lorem.
Fusce sagittis, libero non molestie mollis, magna orci ultrices dolor, at
vulputate neque nulla lacinia eros. Sed id ligula quis est convallis tempor.
Curabitur lacinia pulvinar nibh. Nam a sapien.
{{% /section %}}
{{< youtube LyhTxQwSwJU >}}
{{< form-contact sadflkjasdf />}}
{{< footer name="Conan" place="Thuria" year="The Hyperborian Age" />}}
` Simply run the demo func main() {
scs, err := parseShortcodes(content)
if err != nil {
fmt.Println("err:", err)
return
}
fmt.Println("=== === === ===")
fmt.Println("RENDERED CONTENT")
fmt.Println("=== === === ===")
fmt.Println(renderPage(content, scs))
fmt.Println("=== === === ===")
fmt.Println("=== === === ===")
} Et Voilà. It is worth saying again that this code is very tight. This code is very tight. I think it is reasonably idiomatic, but I have spent a total of just a few hours with Go. In closing, this mechanism is nearly an order of magnitude more terse than what Hugo has currently, and several times more comprehensible. I believe you will find it very useful indeed. Please let me know if there is anything you would like to know. |
We still need a failing test for this PR. By that we don't mean in a comment on this PR. We need actual code in a Regarding your miles-long comment, I doubt anyone is going to read that very closely (if at all). If you want to rewrite the shortcode parser, open a separate issue to discuss your proposal. Instead of pasting a bunch of code into a comment, create a new branch for your changes so that your implementation can be easily reviewed, tested, and benchmarked. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
See #7329.
Here is my particular use case of "paired shortcodes":
This is a partial fix, because there is a catch. (There is always a catch.) Two, in fact.
I)
Shortcodes defined with
%
may "escape" some of the closing HTML tags in one or more templates. Don't ask me how this happens — I don't know.II)
From:
To:
The closing slash must be moved from before
highlight
(as described in the documentation) to before the right angle bracket. To leave the slash where it is may cause precisely the same fatal error as before.The reasons for these strange phenomena are not clear to me, as I have no prior experience with Go in any capacity whatsoever.
You are welcome to do whatever you want with this code, @bep, but don't tell me that my problems don't real.