-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak #64
Comments
Sorry I could not reproduce. |
@AnatoliiVorobiov , could you provide your html file in order that source of the error can be seen more easily. I am also using this module in my project. There are other people like me who are likely to suffer from the same issue. Thanks. |
This should be resolved after v3.1.1 |
This should be resolved since v3.1.1 |
I have had a problem parsing 1000 html documents, one at a time (removing reference from previous one), node js trow the error: |
In what case? |
@mpcabete Sorry to hear you're having issues. I can't think of anywhere in the codebase that such a leak would be happening, but I'd be happy to take a look. Can you provide a snippet of code or repo that will reproduce the issue? |
I also get I don't know if the problem is in my code, For instance, one million strings But adding just one child element, where one million strings If I increase the v8 heap size with Steps to reproduce: // Linux 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
// node v16.15.0
// node v18.12.1
// npm 8.5.5
// [email protected]
var { parse } = require('node-html-parser'); This succeeds: var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo></foo>')); This crashes: var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo><bar/></foo>'));
|
A leak is where something which should be garbage collected is retained. In your case, you add each result to an array, so the memory would not be freed. In terms of memory usage, you're effectively parsing 1 million HTML documents which creates a full AST for each. This would be multiple nodes that are class instances and would certainly take more memory than a single string. Generally speaking, in cases like needing to perform actions on millions of records, you're best suited to process one or a reasonable chunk of items per thread and proceed to the next item/chunk when finished so you're not filling up memory needlessly. It could be worth running a profiler to investigate memory leaks, but leaks are rare, and I haven't seen any indication of them. As for bloat, that could be worth investigating. I'd need to see some numbers which showed memory size per node + base document AST size. If it seems off, I can check it out. Unfortunately, I'm insanely backlogged and I only volunteer on this when I'm able, so I'd need specific and compelling data. |
I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy
Code that can reproduce, can use any HTML with links:
The text was updated successfully, but these errors were encountered: