Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cnode improvements and fixes #8

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

rinpatch
Copy link
Contributor

  • Refactor tree building to not be recursive, so it doesn't segfault
    because of running out of stack space on really big files
  • Fix Cnode crashing on <!----> due to myhtml_node_text returning a
    null pointer
  • Fix Cnode benchmarks

Cnode benchmarks on master:

benchmark name                iterations   average time
wikipedia_hyperlink.html 97k         200   7817.54 µs/op
w3c_html5.html 131k                  100   13872.61 µs/op
github_trending_js.html 341k          50   30911.76 µs/op

This branch:

benchmark name                iterations   average time
wikipedia_hyperlink.html 97k         500   7045.92 µs/op
w3c_html5.html 131k                  200   10438.43 µs/op
github_trending_js.html 341k         100   23686.21 µs/op

Apologies for stuffing all of this into one commit, I can split it into
separate patches if you don't wish some of it merged.

- Refactor tree building to not be recursive, so it doesn't segfault
because of running out of stack space on really big files
- Fix Cnode crashing on `<!---->` due to myhtml_node_text returning a
null pointer
- Fix Cnode benchmarks

Cnode benchmarks on master:
```
benchmark name                iterations   average time
wikipedia_hyperlink.html 97k         200   7817.54 µs/op
w3c_html5.html 131k                  100   13872.61 µs/op
github_trending_js.html 341k          50   30911.76 µs/op
```

This branch:

```
benchmark name                iterations   average time
wikipedia_hyperlink.html 97k         500   7045.92 µs/op
w3c_html5.html 131k                  200   10438.43 µs/op
github_trending_js.html 341k         100   23686.21 µs/op
```
c_src/myhtml_worker.c Show resolved Hide resolved
c_src/myhtml_worker.c Show resolved Hide resolved
if (tag_ns != MyHTML_NAMESPACE_HTML)
{
// tag_ns_name_ptr is unmodifyable, copy it in our tag_ns_buffer to make it modifyable.
tag_ns_buffer = malloc(tag_ns_len);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're in a new scope, so you can use a VLA. You also want to add a second byte for the nul terminator.

char tag_ns_buffer[tag_ns_len + tag_name_len + 2];
snprintf(tag_ns_buffer, sizeof tag_ns_buffer, "%s:%s", tag_ns_name_ptr, tag_name);
lowercase(tag_ns_buffer);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also want to add a second byte for the nul terminator.

Erlang binary strings are not null-terminated. Still had to add a second byte because snprintf always inserts a null-byte in the end, please tell if there is a way to make it not do that/if I should just use multiple strcpys that were there before

c_src/myhtml_worker.c Outdated Show resolved Hide resolved
c_src/myhtml_worker.c Outdated Show resolved Hide resolved
@Overbryd
Copy link
Owner

Hello everybody,

thank you for submitting the patches and engaging in the conversation.

I will review those up until the end of this week.

In general, I am very happy to see those improvements, and yes, my C is sadly not where it should be.
Great to see the refactoring therefore.

@kaniini
Copy link

kaniini commented Oct 24, 2019

would you be interested in further refactoring?

c_src/myhtml_worker.c Outdated Show resolved Hide resolved
@kaniini
Copy link

kaniini commented Oct 24, 2019

while we are here, we should replace all the strcpy and such with safe functions.

@rinpatch rinpatch force-pushed the cnode-improvements-and-fixes branch from 7a6372a to 395d292 Compare October 24, 2019 10:34
@rinpatch
Copy link
Contributor Author

Done

@rinpatch
Copy link
Contributor Author

would you be interested in further refactoring?

Not sure if this is directed at me or at @Overbryd , but I am up.

@Overbryd
Copy link
Owner

@rinpatch highly appreciate the efforts. Would you be up for a call sometime this or next week?

You can reach me at my Email address for that purpose, I would like to go over the changes and do a full review once you are done. I think its faster and we can have a better discussion that way.

Other people are invited to join the call too of course.

@rinpatch
Copy link
Contributor Author

To be honest, calls stress me out, I would much rather communicate over text. You can reach me over IMs if you want faster discussion though, I am @rin:matrix.heldscal.la on matrix and rinpatch on freenode

Copy link
Owner

@Overbryd Overbryd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully endorse the main two changes here:

  1. using a stack to build the tree, rather than recursive function calls

I now understand this should be safer and faster by resizing the stack for 30 ETERMS at once.

  1. Prevent empty comments from blowing up

I kindly ask for one cosmetic change, to split headers/implementation from tstack.h.

ETERM* tag;
ETERM* attrs;
tstack stack;
tstack_init(&stack, 30);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this change with a fellow programmer of mine, and now after following through and understanding, I think its wayyy smarter to use a stack here, than my previous implementation.

Makes perfect sense and should be much safer.

size_t size;
} tstack;

void tstack_init(tstack *stack, size_t initial_size) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the implementation of each function to c_src/tstack.c and keep the headers in here?

@@ -1,4 +1,8 @@
defmodule Myhtmlex.SafeTest do
use MyhtmlexSharedTests, module: Myhtmlex.Safe

test "doesn't segfault when <!----> is encountered" do
assert {"html", _attrs, _children} = Myhtmlex.decode("<div> <!----> </div>")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important fix, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants