Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to avoid (sliced string) ? #711

Closed
alexdima opened this issue Jul 7, 2017 · 5 comments
Closed

Best way to avoid (sliced string) ? #711

alexdima opened this issue Jul 7, 2017 · 5 comments

Comments

@alexdima
Copy link

alexdima commented Jul 7, 2017

  • Node.js Version: 7.4.0
  • v8 Version: 5.6.326.50
  • OS: macOS Sierra
  • Scope (install, code, runtime, meta, other?): basic string manipulation
  • Module (and version) (if relevant): Buffer

Hi, I'm working on VS Code (based on Electron), and I'm looking into improving our memory usage when dealing with large files in microsoft/vscode#30180.

Our buffer implementation is basically using an array of lines. I am aware of the advantages and disadvantages of that, but I would still like to push it to its limits. Our file reading involves reading chunks and pushing those through iconv-lite to handle file encoding. Long story short, we have a bunch of ~64KB strings that we need to split into lines.

The fastest way (that doesn't involve a native C++ node module) I've found so far is a using a simple str.split(\r\n|\r|\n). This works very well, but it ends up creating a (sliced string) for each line, all of which point to the parent chunk. When dealing with files of 3MM lines, these objects add up and eliminating them can mean a few extra tens of MB of memory savings.

Our current workaround to rid ourselves of the (sliced string) is here:

var lines = largeStr.split(/\r\n|\r|\n/);
for (var i = 0, len = lines.length; i < len; i++) {
    lines[i] = Buffer.from(lines[i]).toString();
}

I don't know if the above takes advantage of string interning or if it is the most efficient way to do this short of writing a native node module.

Do you have any idea? Thank you.

@bnoordhuis
Copy link
Member

Not sure I understand what your question is. A sliced string itself is a small object, it's a pointer to the parent string + offset and length. In that respect you shouldn't worry about seeing them show up in heap snapshots.

Slices do however prevent the parent string from reclaimed by the garbage collector. If that is your concern, see if node --nostring_slices makes a difference.

@alexdima
Copy link
Author

alexdima commented Jul 7, 2017

@bnoordhuis
Thank you for the --nostring_slices tip.

However, I don't want to disable the usage of sliced strings in the entirety of VS Code (for 99% of the code base, I fully agree, they are indeed ignorable from a memory usage point of view).

But I would like to avoid them in a specific place, when constructing a file in VS Code, so I would need a localized solution.

As with any small number, when multiplied with a large number, it yields impressive results. Avoiding sliced strings leads to a save of 36MB for a file with more than 3MM lines.

screen shot 2017-07-07 at 11 02 17

I was wondering if there is something more efficient than Buffer.from(lines[i]).toString(), as this might end up copying the memory twice ?

Thank you!

@bnoordhuis
Copy link
Member

Right, I see. There are a number of operations that flatten strings - array.join(), string.charAt(), etc. - but only under specific circumstances and the specifics change over time.

node --allow_natives_syntax gives you access to the %FlattenString(s) intrinsic but that's an implementation detail. Buffer.from(s).toString() is probably the stablest approach longer term.

@addaleax
Copy link
Member

addaleax commented Jul 7, 2017

There’s also https://github.com/davidmarkclements/flatstr – It’s a side effect of Number(s); that its argument is flattened :)

@alexdima
Copy link
Author

alexdima commented Jul 7, 2017

Thank you! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants