-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Large Pages (2M) in Node for Performance #16198
Comments
Possibly related: #11077 (fragmentation in large pages) |
@joyeecheung this would not be transparent huge pages but explicity mapping using |
@suresh-srinivas Yes, but if V8 integrates better with large pages (using |
I don't think there is anything actionable right now. Node.js doesn't mmap memory itself, that's done by V8 and glibc on behalf of node.js. Node.js could mmap some memory directly. For allocations that are released again on the same thread that would be a win, our (Having said that: I experimented with that approach a few years ago and the results were inconclusive. YMMV, benchmark carefully.) With regard to V8, it is currently hard-coded to allocate in multiples of the page size up to 512 kB (that limit applies to executable memory in particular.) Quite a bit of work would have to be done to remove the limit and I'm not sure if it would be well-received because of the security implications that were mentioned in #11077. |
@joyeecheung yes v8 could use large pages for it's JITted code pages. What I was thinking about was for the node binary and all the dso's it links against. |
@bnoordhuis thanks. This issue is about remapping the |
@suresh-srinivas - that looks interesrting to me, and does not harm in experimenting to see how it goes. These are r**p sections rarely meant to be de-allocated or unmapped from the process, so their presence in large pages can reduce page misses by large? So here is the map looks like on a booted node, as you can see node itself is the predominant code, followed by libstd. Do you propose to remap node's own pages alone, or everything?
Do you have a PoC code that we can try integrating and running against some benchmarks? |
@gireeshpunathil thanks. I was initially planning to remap node's own |
I'm +1 on at least having a PR we can use to experiment with the impact of this. It certainly does make sense, we would just need some solid benchmark results to show it is worth the effort. |
I have an initial prototype working (with |
Not desired, it's LGPL (and it's not something we'd want as an external dependency.) |
Sorry this took so long. I got a chunk of time when it was snowing here in Portland and I have completed an implementation under Linux of programmatically mapping a subset of the Node.js text segment to 2M pages and it is demonstrating 4-5% performance on React-SSR workload and reduction in ITLB misses and Front End bottleneck. More work needed to programmatically choose to use either Explicit Huge Pages or Anonymous Huge Pages and if sufficient number are present. @Trott @addaleax I just read your medium article, thanks for all the work you do to help contibutors to the node project. I could use a mentor or two to help get this in. @uttampawar has kindly code reviewed it and I incorporated most of his suggestions. @gireeshpunathil I now have a PoC, so if you want to try it out, let me know. @joyeecheung let me know if you want to try it as well. The algorithm is quite simple but the implementation was a bit tricky!
|
@suresh-srinivas - thanks, please raise a PR if it is finalized and ready for review, or PR with [WIP] tag if in half-baked state (for which design level changes are anticipated). |
ping @suresh-srinivas |
@gireeshpunathil thanks for checking in. The development is complete and code is ready to send as PR. @uttampawar and I are measuring performance should be done by the weekend. |
@joyeecheung @jasnell @gireeshpunathil @Trott @addaleax |
@suresh-srinivas should this remain open or was it addressed by #22079? |
Yes this is addressed by PR #22079? |
Across a couple of workloads ( Node-DC-EIS and Ghost) I noticed that practically all the page walks are for 4K pages
Here is a specific example from Node-DC-EIS (normalized per transaction) on a Xeon Platinum 8180 server.
This results in about 16% of the cycles stalled in the CPU Front End performing page walks using the TMAM Methodology
Several (Java JVM, PHP, HHVM) runtimes have support for Large Pages. They allocate either the hot static code segments and/or dynamic JIT code segments in Large 2M pages. There is typically several percentage performance improvement depending on how much the stall cycles are for page walks.
I wanted to have a discussion of what the community thinks of this I would also be interested in seeing some more data from other workloads. The following perf command is an easy way to get this data for your workload.
perf stat -e cpu/event=0x85,umask=0xe,name=itlb_misses_walk_completed/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed_4k/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x4,name=itlb_misses_walk_completed_2m_4m/ --sleep 30
A simple implementation would start with mapping all the
.text
segment code into large pages (this would be about 20 lines of code on Linux) and it would work reasonably well on modern CPU's. On older CPU's (such as SandyBridge) which have only a 1 level 2M TLB this is not efficient, and a more efficient implementation would only map the hot.text
segment to large pages.The text was updated successfully, but these errors were encountered: