Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16% startup regression after including GraalPy #8553

Closed
JaroslavTulach opened this issue Dec 15, 2023 · 19 comments
Closed

16% startup regression after including GraalPy #8553

JaroslavTulach opened this issue Dec 15, 2023 · 19 comments
Assignees
Labels
--low-performance -compiler p-high Should be completed in the next sprint

Comments

@JaroslavTulach
Copy link
Member

The stdlib startup benchmarks indicate startup time regression for:

  • Startup_empty_startup - 16%
  • Startup_hello_world_startup - 12%

The regression happened between Dec 1 and Dec 4 with the only two integrations being:

Given the #8432 is only a test change, the slowdown is probably related to #8351. Why? Python isn't used during the startup at all - e.g. there is no reason for it to slow startup down!

@JaroslavTulach JaroslavTulach changed the title 16% startup regression after including Python libraries 16% startup regression after including GraalPy Dec 15, 2023
@JaroslavTulach JaroslavTulach moved this from ❓New to 📤 Backlog in Issues Board Dec 20, 2023
@Akirathan Akirathan added the p-high Should be completed in the next sprint label Dec 21, 2023
@Akirathan
Copy link
Member

Akirathan commented Dec 27, 2023

The additional startup time seems to be caused by the JVM boot module layer initialization. To proof that, I have designed an experiment at https://gist.github.com/Akirathan/9c30e10ec525eb09c91c3c94e2c36f40 that repeatedly invokes java --module-path <component-directory> -m org.enso.runtime/EngineBootLoader --help (i.e. enso --help) that does not do any polyglot stuff, just loads a bunch of classes and prints help. There are some comments in the experiment explaining the results.

In other words, if we exclude graalpy-language-23.1.0.jar and its dependencies from --module-path, the startup time is 852 ms, if we include graalpy, the startup time is 1178 ms.

Gist
GraalPy startup experiment. GitHub Gist: instantly share code, notes, and snippets.

@Akirathan Akirathan moved this from 📤 Backlog to ⚙️ Design in Issues Board Dec 27, 2023
@Akirathan
Copy link
Member

Akirathan commented Dec 27, 2023

As noted in https://gist.github.com/Akirathan/9c30e10ec525eb09c91c3c94e2c36f40?permalink_comment_id=4808928#gistcomment-4808928, the total size of all the GraalPy Jars is 201 MB. All the GraalPy Jars that are additionally included are:

['bcpkix-jdk18on-1.76.jar', 'bcprov-jdk18on-1.76.jar', 'bcutil-jdk18on-1.76.jar', 'insight-tool-23.1.0.jar', 'llvm-api-23.1.0.jar', 'python-language-23.1.0.jar', 'python-resources-23.1.0.jar', 'truffle-nfi-23.1.0.jar', 'truffle-nfi-libffi-23.1.0.jar', 'xz-1.9.jar']

How to inspect what does JVM do with these Jars during boot? Please help @JaroslavTulach.

Gist
GraalPy startup experiment. GitHub Gist: instantly share code, notes, and snippets.

@enso-bot
Copy link

enso-bot bot commented Dec 28, 2023

Pavel Marek reports a new STANDUP for yesterday (2023-12-27):

Progress: - Experimenting and measuring JVM startup

Gist
GraalPy startup experiment. GitHub Gist: instantly share code, notes, and snippets.

@enso-bot
Copy link

enso-bot bot commented Dec 28, 2023

Pavel Marek reports a new STANDUP for today (2023-12-28):

Progress: - Format also Java sources in runtime/bench - #8638

@JaroslavTulach
Copy link
Member Author

How to inspect what does JVM do with these Jars during boot?

Please use --profiling-path option.

@JaroslavTulach
Copy link
Member Author

What happens during the startup? I am not sure, but one of the hypothesis is: Python unpacks its resources. If so, then the solution would be somewhere along this copyResources method and the properties it mentions polyglot.engine.resourcePath & co.

@Akirathan
Copy link
Member

Akirathan commented Dec 29, 2023

How to inspect what does JVM do with these Jars during boot?

Please use --profiling-path option.

--profiling-path is only usable when you actually run some Enso code. If you just run enso --help (or enso --version), the profiler will not even start, resulting in an empty profiling file. The thing is that the slowdown is in the JVM initialization, without any org.graalvm.polyglot.Context.

I have a strong suspicion that the slowdown happens before the main method is executed.

Experiments with -Xlog

I have decided to do additional experiments with a combination of -Xlog flags (https://docs.oracle.com/en/java/javase/21/docs/specs/man/java.html#enable-logging-with-the-jvm-unified-logging-framework) (All these experiments run just enso --help):

-Xlog:class+load=trace to look into all the loaded classes. There doesn't seem to be a significant difference between how many classes are loaded: 3011 vs 3116. No classes from any graalpy module nor its dependencies are loaded, obviously.

-Xlog:module+load=trace to look into the modules loading. Number of loaded modules differ, obviously. The duration of loading python-language, or python-resources, or any other graalpy-related module is negligible. The log files generated with something like:

java --module-path <comp-dir> -Xlog:module+load=trace:file=small
-modules.txt:uptime,timenanos,level,tags -m org.enso.runtime/org.ensoEngineRunnerBootLoader --help

are small-modules.txt and big-modules.txt.

In these log files, we can see that the JVM loads java.base module as the very first module and that there is a delay before the second module is loaded. This delay is different for small-modules and big-modules. In small-modules.txt, the uptime of the second loaded module is 0.333s, which in big-modules.txt, the uptime of the second loaded module is 0.555s. So what happens after java.base module is loaded and before the second module is loaded, how do we find that out? The difference is 220 ms!

@Akirathan
Copy link
Member

What happens during the startup? I am not sure, but one of the hypothesis is: Python unpacks its resources. If so, then the solution would be somewhere along this copyResources method and the properties it mentions polyglot.engine.resourcePath & co.

As mentioned in my previous comment, I am running these experiments with just enso --help. So no org.graalvm.polyglot.Context gets initialized. So no EnsoLanguage, neither PythonLanguage classes get loaded. The slowdown happens somewhere in JVM internals during boot, before main method is executed.

@JaroslavTulach
Copy link
Member Author

I am running these experiments with just enso --help.

The real benchmarks can be executed as:

./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/bin/enso --run test/Benchmarks/ Startup

If you want to run exactly what "empty_startup" benchmark runs, then do:

./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/bin/enso --run test/Benchmarks/src/Startup/Empty_World.enso

if you add --profiling-path slow.npss to the above command, you'll see what the JVM does. I am attaching two .npss files FastAndSlow.zip for your convenience. They differ by 400ms. Now the question is why and where do they differ?

@jdunkerley jdunkerley moved this from ⚙️ Design to 🔧 Implementation in Issues Board Jan 2, 2024
@JaroslavTulach
Copy link
Member Author

The slow.npss file seems to spend 100ms in JAR verification. Module constructor seems to be slower:

obrazek

This is probably caused by having some signed JAR in the system. The JDK tries to verify it and slows things down.

@Akirathan
Copy link
Member

The slow.npss file seems to spend 100ms in JAR verification. Module constructor seems to be slower:

obrazek

This is probably caused by having some signed JAR in the system. The JDK tries to verify it and slows things down.

I can confirm this observation. The signed JARs are bcpkix-jdk18on-1.76.jar, bcprov-jdk18on-1.76.jar and bcutil-jdk18on-1.76.jar. These JARs were added as separate modules in #8351. Moreover, org.enso.runtime module has declared dependency in

requires org.bouncycastle.provider;
because we use it for simple SHA1 digest computation in org.enso.interpreter.caches.ModuleCache. We already have a task to remove this dependency - #8362.

I have also tried to manually remove signatures from these Jar archives, but failed to do so. I have tried to remove checksum entries from META-INF/MANIFEST as well as *.DSA and *.SF files from META-INF, but enso still failed on ModuleInitializationError.

Previously (before #8351), the bouncycastle modules used to be assembled in runtime.jar fat jar, so there was no signature checking by the JVM.

@enso-bot
Copy link

enso-bot bot commented Jan 2, 2024

Pavel Marek reports a new STANDUP for today (2024-01-02):

Progress: - The cause of the slowdown is most-likely the introduction of signed Jar archives (org.bouncycastle).

@Akirathan Akirathan linked a pull request Jan 3, 2024 that will close this issue
3 tasks
@github-project-automation github-project-automation bot moved this from 🔧 Implementation to 🟢 Accepted in Issues Board Jan 4, 2024
@Akirathan Akirathan removed a link to a pull request Jan 4, 2024
3 tasks
@Akirathan Akirathan reopened this Jan 4, 2024
@github-project-automation github-project-automation bot moved this from 🟢 Accepted to ❓New in Issues Board Jan 4, 2024
@Akirathan Akirathan moved this from ❓New to ⚙️ Design in Issues Board Jan 4, 2024
@Akirathan
Copy link
Member

org.bouncycastle dependencies were removed in #8664, but did not provide any benchmarks improvements.

@enso-bot
Copy link

enso-bot bot commented Jan 4, 2024

Pavel Marek reports a new STANDUP for today (2024-01-04):

Progress: - Investigating another problems with benchmarks in #8668

@JaroslavTulach
Copy link
Member Author

if you add --profiling-path slow.npss to the above command, you'll see what the JVM does. I am attaching two .npss files FastAndSlow.zip for your convenience. They differ by 400ms. Now the question is why and where do they differ?

Here is a 3rd ZIP with t.npss file. It runs in 1800ms - e.g. just in the middle of FastAndSlow.zip. Looks like the #8664 fix helped to give us 50% of the time back. We still need to find the next 50%.

@JaroslavTulach
Copy link
Member Author

@Akirathan
Copy link
Member

Another startup initialization issue revealed: #8690. It's not related to GraalPy though.

@Akirathan
Copy link
Member

org.bouncycastle dependencies were removed in #8664, but did not provide any benchmarks improvements.

Edit: Removing org.bouncycastle helped the startup by 10%. Let's close this issue and continue with #8690.

@github-project-automation github-project-automation bot moved this from ⚙️ Design to 🟢 Accepted in Issues Board Jan 5, 2024
@enso-bot
Copy link

enso-bot bot commented Jan 5, 2024

Pavel Marek reports a new STANDUP for today (2024-01-05):

Progress: - language-server/test still reassembles runtime fat jar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--low-performance -compiler p-high Should be completed in the next sprint
Projects
Archived in project
Development

No branches or pull requests

2 participants