-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process startup time on ARM #2138
Comments
also /cc @wolfeidau |
I'd recommend starting by getting times for every phase of startup. @bnoordhuis Would that be possible using perf and setting the sampling interval super low? Other than that I can't think of anything other than patching the code to print the times. |
Have you tried something as simple as running it in Pretty much rsync over the project and give it a whirl. NOTE: This uses shared memory so you probably wanna do it on a newer arm device like the Pi2 which has more RAM. |
@wolfeidau's suggestion is good. @trevnorris @rvagg I'd be interested to know what |
@wolfeidau doesn't seem like disk i/o is a limiting factor on a RPi2, armv7: /dev/shm/iojs -p process.versions 0.93s user 0.08s system 99% cpu 1.019 total
/dev/shm/iojs -p process.versions 1.00s user 0.07s system 99% cpu 1.074 total
/dev/shm/iojs -p process.versions 1.01s user 0.02s system 99% cpu 1.038 total
/dev/shm/iojs -p process.versions 0.96s user 0.02s system 99% cpu 0.988 total
/dev/shm/iojs -p process.versions 0.85s user 0.08s system 99% cpu 0.935 total
/dev/shm/iojs -p process.versions 1.00s user 0.03s system 99% cpu 1.040 total
/dev/shm/iojs -p process.versions 1.02s user 0.05s system 99% cpu 1.075 total
/usr/local/bin/iojs -p process.versions 1.00s user 0.05s system 99% cpu 1.051 total
/usr/local/bin/iojs -p process.versions 0.93s user 0.04s system 99% cpu 0.976 total
/usr/local/bin/iojs -p process.versions 1.01s user 0.06s system 99% cpu 1.080 total
/usr/local/bin/iojs -p process.versions 0.98s user 0.06s system 99% cpu 1.045 total
/usr/local/bin/iojs -p process.versions 0.96s user 0.04s system 99% cpu 1.004 total
/usr/local/bin/iojs -p process.versions 0.90s user 0.06s system 98% cpu 0.970 total
/usr/local/bin/iojs -p process.versions 0.99s user 0.06s system 99% cpu 1.052 total
$ dd if=/dev/zero of=test bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 44.3668 s, 9.2 MB/s
$ reboot # to avoid caching
$ dd if=test of=/dev/null bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 21.2704 s, 19.3 MB/s |
What about clustering some PIs? Just an idea. I dont have any experience with that. |
Latest optimisation I've attempted is mounting the whole build directory for Jenkins via NFS (from a shared SSD). It's not as dramatic as I hoped, a marginal speed increase perhaps. Overall test runs total times do seem to have come down a little. One hold-up though is that we're getting consistent failures on pi2 builds (not pi1+ interestingly!) for both of these now which sound NFS related: test-fs-readfilesync-pipe-large.js test-fs-readfile-pipe-large.js, e.g.: https://jenkins-iojs.nodesource.com/job/iojs+pr+arm/nodes=pi2-raspbian-wheezy/124/console I'm either going to have to figure that our or revert the NFS introduction on pi2. Any help there would be appreciated. @mathiask88 the next optimisation we have underway is actually doing that, in a simple form, having test.py be able to (1) skip X number of initial tests and (2) run only every Y number of tests. Then we can cluster 2 or 3 pis together and have them all build but only run every 2 or 3 tests. @orangemocha has the initial work for this on his TODO list. But, given that the greatest cost is this process startup time I'm hoping that someone can help work it out.
I do this:
but |
pretty sure that failure was because of
But I still get no samples for
|
Playing with |
This will probably help: #2248 |
@rvagg any idea if this has improved at all? |
I think we've had a minor improvement via some recent changes, still too slow and it's doing so much more on arm than x86. |
With the latest commits that re-enable some previously broken V8 ARM optimizations, is there any improvement? |
non-scientific assessment is that process startup seems to have improved somewhat on ARM, test runs times are coming down with all of the various ways we are attacking this |
Awesome work. By the way, do we know how much of the slowness comes from V8, libuv or Node itself? I wonder what the ratio is there. (the strace discrepancy points in the direction of libuv or V8-level issues, does it not?) |
That's a good question @ronkorving, I would think V8 costs more to start up than libuv because of the amount of memory it will try to provision for itself ahead of time.. but remember there's always high overhead with process startup, even in pure C forking carries a big overhead with all that extra stuff we're asking of the kernel. |
@rvagg If you can give me access to an ARM machine, I'll try some profiling. I may need root access. Just a hunch but I speculate that dynamic linker overhead, OpenSSL entropy pool seeding and V8 snapshot deserialization are the biggest cost centers. |
@bnoordhuis |
@rvagg I can't connect. The address is pingable and the port is open but nothing is sent when I connect with |
@bnoordhuis try again, also maybe install screen/tmux on the pi if/when you get in just in case it fails again .. last time I did this for someone it was flaky as well and I'm not sure why |
@rvagg Thanks, I'm done.
Interestingly enough, it's none of the above: the largest cost centers for a null program turn out to be parsing and code generation... our bootstrap code comes to 210 kB now. I can think of ways of minimizing that - mostly by splitting up src/node.js and some of the files in lib/ into smaller files and only loading what is necessary - but that's a fair amount of work. For the record, below are all the scripts that are loaded for an empty program:
lib/assert.js is probably the lowest hanging fruit, it's only used by lib/module.js and lib/timers.js. |
Only skimmed the thread so far, but very possibly this might be a good place to apply my V8 extras work, by precompiling code in to the snapshot. https://bit.ly/v8-extras |
I think that could work. Modules are basically IIFEs and don't execute code until first use. I played around with moving code around and splitting up files but the results are only so-so for the non-trivial case: #2714 |
@nodejs/build are talking about how to improve test time on the ARM machines and we're tackling it on multiple fronts. However, by far the biggest challenge we're facing with the test suite is
node
process startup time and this isn't something we can do much about from our end so I'm wondering if others here (@bnoordhuis, @trevnorris, others?) can offer suggestions about if and how we might be able to improve things.On a Pi1 B+ (ARMv6):
Which explains why we can't get below ~2.5h for build+test (we've managed to squeeze the build part of that down to ~12m with aggressive caching strategies).
Then this:
vs this on an x64 machine:
Which seems like a huge discrepancy to me, even considering the architecture differences.
I can't recall the details of what happened with snapshots on ARM but they are currently on by default, even on ARM now:
Anyone want to take a look or suggest ways to investigate this further?
The text was updated successfully, but these errors were encountered: