Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hitting the max tree depth can leak memory #3098

Closed
stevecheckoway opened this issue Jan 16, 2024 · 0 comments · Fixed by #3100
Closed

Hitting the max tree depth can leak memory #3098

stevecheckoway opened this issue Jan 16, 2024 · 0 comments · Fixed by #3100
Assignees
Labels
topic/gumbo Gumbo HTML5 parser topic/memory Segfaults, memory leaks, valgrind testing, etc.

Comments

@stevecheckoway
Copy link
Contributor

This is mostly to keep track of my investigation of this oss-fuzz–found bug

Minimal test case with max_tree_depth: 1:

<html><body>

The issue seems to occur if the tree depth increases beyond the limit due to a parser inserted node when the current token needs to be reprocessed in a new insertion mode.

In more detail, the gumbo parser works on tokens produced by the tokenizer. The standard requires tokens to be reprocessed in a different insertion mode in a variety of circumstances. In a few circumstances, the parser will insert a node into the tree before reprocessing. If this occurs and the depth increases beyond the limit, then memory allocated for the token will be leaked. (I think, I'm still debugging.)

Running this with gumbo debugging turned on (and some additional, uncommitted print statements),

require 'nokogiri'
html = '<html><body>'
doc = Nokogiri::HTML5.parse(html, max_tree_depth: 1)
puts(doc.to_html)

gives the following output:

Parsing <html><body>.
Lexing character '<' (60) in state 0.
Lexing character 'h' (104) in state 5.
Starting new tag.
Lexing character 'h' (104) in state 7.
Lexing character 't' (116) in state 7.
Lexing character 'm' (109) in state 7.
Lexing character 'l' (108) in state 7.
Lexing character '>' (62) in state 7.
Emitted start tag html.
Original text = <html>.
Handling html token @1:1 in state 0.
Adding parse error.
Handling html token @1:1 in state 1.
Inserting <html> element (@0x6000021141e0) from token.
Lexing character '<' (60) in state 0.
Lexing character 'b' (98) in state 5.
Starting new tag.
Lexing character 'b' (98) in state 7.
Lexing character 'o' (111) in state 7.
Lexing character 'd' (100) in state 7.
Lexing character 'y' (121) in state 7.
Lexing character '>' (62) in state 7.
Emitted start tag body.
Original text = <body>.
Handling body token @1:7 in state 2.
Current node: <html>.
Inserting head element (@0x600002114e60) from tag type.
start tag: 0x0
Tree depth limit exceeded.
Finishing parsingPopping head node.
Popping html node.

The key portion showing what's happening is

Handling body token @1:7 in state 2.
Current node: <html>.
Inserting head element (@0x600002114e60) from tag type.

state 2 here should be insertion mode 2, namely, GUMBO_INSERTION_MODE_BEFORE_HEAD. The before head insertion mode says a <body> token should be handled by inserting a head element and reprocessing in the in head insertion mode.

This pushes the tree depth to 2 and we stop parsing at that point. No memory allocated for the <body> token is transferred to the DOM and thus it leaks.

If the example is changed to <html><head>, then the parser handles <head> by creating a head element and no memory is leaked, even though the max tree depth is exceeded.

The solution to this seems to be to delay checking the max tree depth until the token is completely processed. Unfortunately, this fix doesn't actually enable the test case from oss-fuzz to pass. It causes an assert to fire which I think is indicative of another bug but I haven't tracked that down either.

The oss-fuzz test case contains </tr$<a> as the final token.

Lexing character '<' (60) in state 0.
Lexing character '/' (47) in state 5.
Lexing character 't' (116) in state 6.
Starting new tag.
Lexing character 't' (116) in state 7.
Lexing character 'r' (114) in state 7.
Lexing character '$' (36) in state 7.
Lexing character '<' (60) in state 7.
Lexing character 'a' (97) in state 7.
Lexing character '>' (62) in state 7.
Emitted end tag .
Original text = </tr$<a>.
Handling  token @1:2888 in state 9.
Current node: <tr>.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Setting frameset_ok to false.
Reconstructing elements from 333 on tr parent.
Reconstructed a element at 333.
Flushing text node buffer of s>R>o	.
end token: 0x600003106020 151
Tree depth limit exceeded.
Finishing parsingPopping a node.
...

I haven't managed to minimize this test case yet but the Reconstructing elements... and Reconstructed a element... involves inserting new nodes into the tree and I think that's what's pushing it over the limit.

I'd like to understand this a bit better before I'm confident that the change to delay stopping the parse until fixes this issue (and also figure out what is causing the assertion to fail on the oss-fuzz test case).

@stevecheckoway stevecheckoway self-assigned this Jan 16, 2024
stevecheckoway added a commit to stevecheckoway/nokogiri that referenced this issue Jan 16, 2024
When a token causes a node to be added to the DOM which increases the
depth of the DOM to exceed the `max_tree_depth` _and_ the token needs to
be reprocessed, memory is leaked. By delaying breaking out of the loop
until after the token has been completely handled, this appears to fix
the leak.

Fixes sparklemotion#3098
@flavorjones flavorjones added topic/memory Segfaults, memory leaks, valgrind testing, etc. topic/gumbo Gumbo HTML5 parser labels Jan 16, 2024
stevecheckoway added a commit that referenced this issue Jan 17, 2024
When a token causes a node to be added to the DOM which increases the
depth of the DOM to exceed the `max_tree_depth` _and_ the token needs to
be reprocessed, memory is leaked. By delaying breaking out of the loop
until after the token has been completely handled, this appears to fix
the leak.

Fixes #3098
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/gumbo Gumbo HTML5 parser topic/memory Segfaults, memory leaks, valgrind testing, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants