Hitting the max tree depth can leak memory #3098

stevecheckoway · 2024-01-16T01:26:17Z

This is mostly to keep track of my investigation of this oss-fuzz–found bug

Minimal test case with max_tree_depth: 1:

<html><body>

The issue seems to occur if the tree depth increases beyond the limit due to a parser inserted node when the current token needs to be reprocessed in a new insertion mode.

In more detail, the gumbo parser works on tokens produced by the tokenizer. The standard requires tokens to be reprocessed in a different insertion mode in a variety of circumstances. In a few circumstances, the parser will insert a node into the tree before reprocessing. If this occurs and the depth increases beyond the limit, then memory allocated for the token will be leaked. (I think, I'm still debugging.)

Running this with gumbo debugging turned on (and some additional, uncommitted print statements),

require 'nokogiri'
html = '<html><body>'
doc = Nokogiri::HTML5.parse(html, max_tree_depth: 1)
puts(doc.to_html)

gives the following output:

Parsing <html><body>.
Lexing character '<' (60) in state 0.
Lexing character 'h' (104) in state 5.
Starting new tag.
Lexing character 'h' (104) in state 7.
Lexing character 't' (116) in state 7.
Lexing character 'm' (109) in state 7.
Lexing character 'l' (108) in state 7.
Lexing character '>' (62) in state 7.
Emitted start tag html.
Original text = <html>.
Handling html token @1:1 in state 0.
Adding parse error.
Handling html token @1:1 in state 1.
Inserting <html> element (@0x6000021141e0) from token.
Lexing character '<' (60) in state 0.
Lexing character 'b' (98) in state 5.
Starting new tag.
Lexing character 'b' (98) in state 7.
Lexing character 'o' (111) in state 7.
Lexing character 'd' (100) in state 7.
Lexing character 'y' (121) in state 7.
Lexing character '>' (62) in state 7.
Emitted start tag body.
Original text = <body>.
Handling body token @1:7 in state 2.
Current node: <html>.
Inserting head element (@0x600002114e60) from tag type.
start tag: 0x0
Tree depth limit exceeded.
Finishing parsingPopping head node.
Popping html node.

The key portion showing what's happening is

Handling body token @1:7 in state 2.
Current node: <html>.
Inserting head element (@0x600002114e60) from tag type.

state 2 here should be insertion mode 2, namely, GUMBO_INSERTION_MODE_BEFORE_HEAD. The before head insertion mode says a <body> token should be handled by inserting a head element and reprocessing in the in head insertion mode.

This pushes the tree depth to 2 and we stop parsing at that point. No memory allocated for the <body> token is transferred to the DOM and thus it leaks.

If the example is changed to <html><head>, then the parser handles <head> by creating a head element and no memory is leaked, even though the max tree depth is exceeded.

The solution to this seems to be to delay checking the max tree depth until the token is completely processed. Unfortunately, this fix doesn't actually enable the test case from oss-fuzz to pass. It causes an assert to fire which I think is indicative of another bug but I haven't tracked that down either.

The oss-fuzz test case contains </tr$<a> as the final token.

Lexing character '<' (60) in state 0.
Lexing character '/' (47) in state 5.
Lexing character 't' (116) in state 6.
Starting new tag.
Lexing character 't' (116) in state 7.
Lexing character 'r' (114) in state 7.
Lexing character '$' (36) in state 7.
Lexing character '<' (60) in state 7.
Lexing character 'a' (97) in state 7.
Lexing character '>' (62) in state 7.
Emitted end tag .
Original text = </tr$<a>.
Handling  token @1:2888 in state 9.
Current node: <tr>.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Setting frameset_ok to false.
Reconstructing elements from 333 on tr parent.
Reconstructed a element at 333.
Flushing text node buffer of s>R>o	.
end token: 0x600003106020 151
Tree depth limit exceeded.
Finishing parsingPopping a node.
...

I haven't managed to minimize this test case yet but the Reconstructing elements... and Reconstructed a element... involves inserting new nodes into the tree and I think that's what's pushing it over the limit.

I'd like to understand this a bit better before I'm confident that the change to delay stopping the parse until fixes this issue (and also figure out what is causing the assertion to fail on the oss-fuzz test case).

The text was updated successfully, but these errors were encountered:

When a token causes a node to be added to the DOM which increases the depth of the DOM to exceed the `max_tree_depth` _and_ the token needs to be reprocessed, memory is leaked. By delaying breaking out of the loop until after the token has been completely handled, this appears to fix the leak. Fixes sparklemotion#3098

When a token causes a node to be added to the DOM which increases the depth of the DOM to exceed the `max_tree_depth` _and_ the token needs to be reprocessed, memory is leaked. By delaying breaking out of the loop until after the token has been completely handled, this appears to fix the leak. Fixes #3098

stevecheckoway mentioned this issue Jan 16, 2024

clone_node() doesn't duplicate nonstandard tag names #3099

Open

stevecheckoway self-assigned this Jan 16, 2024

stevecheckoway mentioned this issue Jan 16, 2024

Delay breaking out of the parse loop when max_tree_depth is hit #3100

Merged

flavorjones added topic/memory Segfaults, memory leaks, valgrind testing, etc. topic/gumbo Gumbo HTML5 parser labels Jan 16, 2024

stevecheckoway closed this as completed in #3100 Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hitting the max tree depth can leak memory #3098

Hitting the max tree depth can leak memory #3098

stevecheckoway commented Jan 16, 2024

Hitting the max tree depth can leak memory #3098

Hitting the max tree depth can leak memory #3098

Comments

stevecheckoway commented Jan 16, 2024