Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #525

Closed
nerg4l opened this issue Nov 27, 2023 · 8 comments
Closed

Memory leak #525

nerg4l opened this issue Nov 27, 2023 · 8 comments

Comments

@nerg4l
Copy link
Contributor

nerg4l commented Nov 27, 2023

I was trying to fetch oEmbed information in bulk but I run into multiple memory issues.

\HtmlParser\Parser

First, libxml_use_internal_errors(true) uses an internal buffer which can fill when libxml_clear_errors is not called. The most voted "User Contributed Notes" on https://www.php.net/manual/en/function.libxml-use-internal-errors.php mentions this.

When using this funtion, be sure to clear your internal error buffer. If you dn't and you are using this in a long running process, you may find that all your memory is used up.

https://github.com/oscarotero/html-parser/blob/0c5b619bdc7ac061f06a667d913e2af708ee3231/src/Parser.php#L77-L87

I fixed that by parsing in batch and calling libxml_clear_errors between each parsing.

\Embed\Http\CurlDispatcher

I wasn't able to completely verify the source of this leak. However, the error complains about stream_get_contents and running profiling it does shows that this method uses the most memory.

I replaced \Embed\Http\CurlDispatcher with \GuzzleHttp\Client and memory usage fall from 300+ MB to only 84 MB.

@oscarotero
Copy link
Owner

oscarotero commented Nov 28, 2023

Hi @nerg4l
Thanks for this!

Do you want to work on a PR to fix the Html parser?

And about CurlDispatcher, not sure why this can happen. The $this->body property is stream stored in memory:
https://github.com/oscarotero/Embed/blob/master/src/Http/CurlDispatcher.php#L202

Maybe, after passing the stream to the response, it should be closed and clear here?
https://github.com/oscarotero/Embed/blob/master/src/Http/CurlDispatcher.php#L139

@nerg4l
Copy link
Contributor Author

nerg4l commented Nov 29, 2023

Hello,

I will open a PR later today to fix the parser.

About the other leak, I'm not sure if the issue is with the stream function. I tried to use fclose on the body attribute but it did not help. I will try to do more tests to give the exact cause.

@nerg4l
Copy link
Contributor Author

nerg4l commented Nov 29, 2023

I'm not sure if #527 solves the problem but definitely reduces the occurrence.

@nerg4l
Copy link
Contributor Author

nerg4l commented Nov 30, 2023

I just made this change 51b1aea
Can you test the memory usage?

#527 (comment)

Let's continue the conversation here.

Performance after the recent changes:

$embed = new Embed();
for ($i = 0; $i < 300; $i++) {
    $embed->getMulti('https://oscarotero.com', 'https://oscarotero.com')
}
memory_get_peak_usage() // 11951856 B
// Time: 01:01.863, Memory: 12.00 MB

With Guzzle client:

$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
for ($i = 0; $i < 300; $i++) {
    $embed->getMulti('https://oscarotero.com', 'https://oscarotero.com')
}
memory_get_peak_usage() // 13529336 B
// Time: 00:15.833, Memory: 16.00 MB

Calling gc_collect_cycles() once in a while can reduce peak memory usage to about 8.00 MB.

Would you accept a PR for adding a getClientFactory method and only fall back to CurlClient when it is necessary? That time difference is quite huge. In #527 I was testing Embed::get() and the above is for Embed::getMulti().

@oscarotero
Copy link
Owner

oscarotero commented Nov 30, 2023

Would you accept a PR for adding a getClientFactory method and only fall back to CurlClient when it is necessary? That time difference is quite huge.

Okay, but I cannot understand why this difference. Guzzle uses CURL under the hood, so I cannot understand why this happens. Could be possible that Guzzle had a cache to avoid perform the same query twice? In your tests, you're doing the requests to the same url. The Embed implementation consume 4MB less of memory, the difference is only in the time.

@nerg4l
Copy link
Contributor Author

nerg4l commented Nov 30, 2023

Tested with 300 YouTube links:

$embed = new Embed();
memory_get_peak_usage(); // 105359848
// Time: 04:26.202, Memory: 114.00 MB

$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
// PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 778240 bytes) in /.../oscarotero/Embed/vendor/oscarotero/html-parser/src/Parser.php on line 42

I had to add gc_collect_cycles() to avoid memory exhaustion with Guzzle. The memory usage is less important due to the forced GC.

$embed = new Embed();
// Time: 04:56.609, Memory: 42.00 MB

$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
// Time: 03:08.485, Memory: 34.00 MB

It looks like you were correct partially and that now Guzzle's memory usage is the problematic one.

Code:

$n = count($this->links);
gc_collect_cycles();
for ($i = 0; $i < $n; $i += 2) {
    $embed->getMulti($this->links[$i], $this->links[$i+1]);
    if ($i % 20 == 0) {
        gc_collect_cycles(); // only for second case
    }
}

@oscarotero
Copy link
Owner

Great, so I guess the default CurlDispatcher is good enough, right?
Let me know if you want to do more tests before releasing a new version.

@nerg4l
Copy link
Contributor Author

nerg4l commented Nov 30, 2023

I think it's good to go. I will open a new issue if needed.

@nerg4l nerg4l closed this as completed Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants