Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline nodes can reference text data of parent block #309

Closed
nwellnhof opened this issue Sep 15, 2019 · 11 comments · Fixed by #326
Closed

Inline nodes can reference text data of parent block #309

nwellnhof opened this issue Sep 15, 2019 · 11 comments · Fixed by #326

Comments

@nwellnhof
Copy link
Contributor

When parsing inline nodes, some pieces of text are kept in the parent block's content buffer and referenced from inline children as non-allocated cmark_chunk pointing directly into the parent's buffer. If the parent is freed, these pointers become invalid. This can lead to memory corruption. for example when moving inline nodes to another tree and deleting the old parent.

I'd suggest to copy all text data to inline children (i. e. use chunk_clone instead of chunk_dup). Then we could also think about removing the content field for block nodes.

@jgm
Copy link
Member

jgm commented Sep 15, 2019

Sounds good. Do you want to submit a PR?

@jgm
Copy link
Member

jgm commented Sep 15, 2019

I wonder what performance impact this would have on normal use. (More allocations.)

nwellnhof added a commit to nwellnhof/cmark that referenced this issue Jan 19, 2020
Use zero-terminated C strings and a separate length field instead of
cmark_chunks. Literal inline text will now be copied from the parent
block's content buffer, resulting in a slight overhead. The node struct
never references memory of other nodes now, fixing commonmark#309. Node accessors
don't have to check for delayed creation of C strings.
nwellnhof added a commit to nwellnhof/cmark that referenced this issue Jan 19, 2020
Use zero-terminated C strings and a separate length field instead of
cmark_chunks. Literal inline text will now be copied from the parent
block's content buffer, resulting in a slight overhead. The node struct
never references memory of other nodes now, fixing commonmark#309. Node accessors
don't have to check for delayed creation of C strings.
@nwellnhof
Copy link
Contributor Author

Here's a branch exploring the idea: https://github.com/nwellnhof/cmark/commits/rework-node-struct

The additional allocations cause about 10-15% overhead on my machine. Some other improvements bring this down to 5-10%. Note that the slowdown should only be visible with the built-in renderers. Parsing and iterating all literals using the public API should be faster than before.

I really like some of the simplifications in the branch. But if we want to avoid the slowdown, another approach is to clone literals when an inline node is unlinked.

@jgm
Copy link
Member

jgm commented Jan 19, 2020

Simpler is good. I'm for it, even if there's a small slowdown.

@jgm
Copy link
Member

jgm commented Jan 19, 2020

Can you submit your branch as a PR?

nwellnhof added a commit to nwellnhof/cmark that referenced this issue Jan 23, 2020
Use zero-terminated C strings and a separate length field instead of
cmark_chunks. Literal inline text will now be copied from the parent
block's content buffer, slowing the benchmark down by 10-15%.

The node struct never references memory of other nodes now, fixing commonmark#309.
Node accessors don't have to check for delayed creation of C strings,
so parsing and iterating all literals using the public API should
actually be faster than before.
@jgm jgm closed this as completed in #326 Jan 23, 2020
jgm pushed a commit that referenced this issue Jan 23, 2020
Use zero-terminated C strings and a separate length field instead of
cmark_chunks. Literal inline text will now be copied from the parent
block's content buffer, slowing the benchmark down by 10-15%.

The node struct never references memory of other nodes now, fixing #309.
Node accessors don't have to check for delayed creation of C strings,
so parsing and iterating all literals using the public API should
actually be faster than before.
@jgm
Copy link
Member

jgm commented Jan 23, 2020

Here's what I measured in benchmarks (make bench):

before this change: 1.33 mean
after: 1.54 mean

That's about 18% -- were you getting different measurements for the performance impact?

@jgm
Copy link
Member

jgm commented Jan 23, 2020

Even with the impact, I think the change is a good idea, but it's more than I'd hoped.

@nwellnhof
Copy link
Contributor Author

nwellnhof commented Jan 23, 2020

On Linux, I get 1.21 before and 1.29 after.

@nwellnhof
Copy link
Contributor Author

Under MinGW: 2.22 before, 2.31 after

@nwellnhof
Copy link
Contributor Author

On my MacBook, the results of each run vary a lot. But the slowdown doesn't seem higher than 10%.

@jgm
Copy link
Member

jgm commented Jan 24, 2020

Strange. My results are pretty consistent with what I reported on my old macbook pro.
On linux I see 0.169 old, 0.180 new, 6%. That's reassuring.

ferdnyc pushed a commit to ferdnyc/cmark that referenced this issue Oct 19, 2023
Use zero-terminated C strings and a separate length field instead of
cmark_chunks. Literal inline text will now be copied from the parent
block's content buffer, slowing the benchmark down by 10-15%.

The node struct never references memory of other nodes now, fixing commonmark#309.
Node accessors don't have to check for delayed creation of C strings,
so parsing and iterating all literals using the public API should
actually be faster than before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants