Inline nodes can reference text data of parent block #309

nwellnhof · 2019-09-15T16:43:48Z

When parsing inline nodes, some pieces of text are kept in the parent block's content buffer and referenced from inline children as non-allocated cmark_chunk pointing directly into the parent's buffer. If the parent is freed, these pointers become invalid. This can lead to memory corruption. for example when moving inline nodes to another tree and deleting the old parent.

I'd suggest to copy all text data to inline children (i. e. use chunk_clone instead of chunk_dup). Then we could also think about removing the content field for block nodes.

The text was updated successfully, but these errors were encountered:

jgm · 2019-09-15T19:31:20Z

Sounds good. Do you want to submit a PR?

jgm · 2019-09-15T19:32:07Z

I wonder what performance impact this would have on normal use. (More allocations.)

Use zero-terminated C strings and a separate length field instead of cmark_chunks. Literal inline text will now be copied from the parent block's content buffer, resulting in a slight overhead. The node struct never references memory of other nodes now, fixing commonmark#309. Node accessors don't have to check for delayed creation of C strings.

nwellnhof · 2020-01-19T13:54:47Z

Here's a branch exploring the idea: https://github.com/nwellnhof/cmark/commits/rework-node-struct

The additional allocations cause about 10-15% overhead on my machine. Some other improvements bring this down to 5-10%. Note that the slowdown should only be visible with the built-in renderers. Parsing and iterating all literals using the public API should be faster than before.

I really like some of the simplifications in the branch. But if we want to avoid the slowdown, another approach is to clone literals when an inline node is unlinked.

jgm · 2020-01-19T16:34:15Z

Simpler is good. I'm for it, even if there's a small slowdown.

jgm · 2020-01-19T16:34:45Z

Can you submit your branch as a PR?

Use zero-terminated C strings and a separate length field instead of cmark_chunks. Literal inline text will now be copied from the parent block's content buffer, slowing the benchmark down by 10-15%. The node struct never references memory of other nodes now, fixing commonmark#309. Node accessors don't have to check for delayed creation of C strings, so parsing and iterating all literals using the public API should actually be faster than before.

Use zero-terminated C strings and a separate length field instead of cmark_chunks. Literal inline text will now be copied from the parent block's content buffer, slowing the benchmark down by 10-15%. The node struct never references memory of other nodes now, fixing #309. Node accessors don't have to check for delayed creation of C strings, so parsing and iterating all literals using the public API should actually be faster than before.

jgm · 2020-01-23T16:30:09Z

Here's what I measured in benchmarks (make bench):

before this change: 1.33 mean
after: 1.54 mean

That's about 18% -- were you getting different measurements for the performance impact?

jgm · 2020-01-23T16:31:01Z

Even with the impact, I think the change is a good idea, but it's more than I'd hoped.

nwellnhof · 2020-01-23T17:55:42Z

On Linux, I get 1.21 before and 1.29 after.

nwellnhof · 2020-01-23T18:44:58Z

Under MinGW: 2.22 before, 2.31 after

nwellnhof · 2020-01-23T19:22:06Z

On my MacBook, the results of each run vary a lot. But the slowdown doesn't seem higher than 10%.

jgm · 2020-01-24T05:36:46Z

Strange. My results are pretty consistent with what I reported on my old macbook pro.
On linux I see 0.169 old, 0.180 new, 6%. That's reassuring.

Use zero-terminated C strings and a separate length field instead of cmark_chunks. Literal inline text will now be copied from the parent block's content buffer, slowing the benchmark down by 10-15%. The node struct never references memory of other nodes now, fixing commonmark#309. Node accessors don't have to check for delayed creation of C strings, so parsing and iterating all literals using the public API should actually be faster than before.

nwellnhof mentioned this issue Jan 19, 2020

Rework node struct #326

Merged

jgm closed this as completed in #326 Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline nodes can reference text data of parent block #309

Inline nodes can reference text data of parent block #309

nwellnhof commented Sep 15, 2019

jgm commented Sep 15, 2019

jgm commented Sep 15, 2019

nwellnhof commented Jan 19, 2020

jgm commented Jan 19, 2020

jgm commented Jan 19, 2020

jgm commented Jan 23, 2020

jgm commented Jan 23, 2020

nwellnhof commented Jan 23, 2020 •

edited

Loading

nwellnhof commented Jan 23, 2020

nwellnhof commented Jan 23, 2020

jgm commented Jan 24, 2020

Inline nodes can reference text data of parent block #309

Inline nodes can reference text data of parent block #309

Comments

nwellnhof commented Sep 15, 2019

jgm commented Sep 15, 2019

jgm commented Sep 15, 2019

nwellnhof commented Jan 19, 2020

jgm commented Jan 19, 2020

jgm commented Jan 19, 2020

jgm commented Jan 23, 2020

jgm commented Jan 23, 2020

nwellnhof commented Jan 23, 2020 • edited Loading

nwellnhof commented Jan 23, 2020

nwellnhof commented Jan 23, 2020

jgm commented Jan 24, 2020

nwellnhof commented Jan 23, 2020 •

edited

Loading