Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable Performance Among Some Java Test Implementations #5612

Open
msmith-techempower opened this issue Apr 17, 2020 · 10 comments
Open

Unstable Performance Among Some Java Test Implementations #5612

msmith-techempower opened this issue Apr 17, 2020 · 10 comments

Comments

@msmith-techempower
Copy link
Member

msmith-techempower commented Apr 17, 2020

I was troubleshooting what I believed to be a performance degradation in Gemini (and spent a lot of time doing so) when I believe I came to the realization that it is a problem not in Gemini proper. This issue will lay out all the information we have gathered.

For those unfamiliar, it is my pleasure to introduce the Framework Timeline which graphs the continuous benchmark results over time. This tool is great for illustrating the arguments that I will be laying out. This link is to the plaintext results for gemini.

The following is an annotated graph from gemini's Framework Timeline:

image

  1. ServerCentral hardware/network - everything is relatively stable
    0a. Dockerify #3292 was merged and the project was officially Dockerified
  2. There are several of these dips on the graph, but the graph is combination of all environments so these are actually the Azure runs, which are more modestly provisioned as compared to the Citrine environment
  3. Migrated out of ServerCentral and started running continuously on Citrine on prem.
  4. Starts with a big dip which is Azure, then is relatively stable but much lower than 2. After going through emails and chat messages, we believe this is due to applying the Spectre/Meltdown kernel patches.
  5. Ubuntu 16 is replaced with CentOS 7.6.1810 (Core) and I forgot to apply Spectre/Meltdown kernel patches (side-story, I was trying to get upgraded networking hardware working that ended up being unusable, so I was busy and had a great excuse)
  6. Unclear what this is - it does not appear to be low enough to be Azure runs and it aligns with some later bullets, so I'll discuss below.
  7. Our best guess is that this is a dip from Java 11 - Update Docker images to the jdk variant #4850 which changed the base image of many Java test implementations. The timing lines up pretty much exactly, though it is a bit of a mystery as to why moving from openjdk-11.0.3-jre-slim to openjdk-11.0.3-jdk-slim would have a performance impact. Found an email chain wherein @nbrady-techempower confirmed that he once again applied Spectre/Meltdown patches and an iptables rule from this
  8. Nov 8, 2019 - last continuous run on CentOS - we then bring down the machines and begin installing Ubuntu 18 TLS
  9. Nov 20, 2019 - first continuous run on Ubuntu LTS with Spectre/Meltdown kernel patches applied but not this iptables rule
  10. This is the high-water mark for gemini on Citrine (Ubuntu) - roughly 1.2M plaintext RPS
  11. This is the low-water mark for gemini on Citrine (Ubuntu) - roughly 700K plaintext RPS

The following shows the data table for Servlet frameworks written in Java for Round 18 published July 9, 2019 which is between number 6 and 7 on the above graph.

image

Comparing that with the data table for the same test implementations from the run completed on April 1, 2020 which is the last graphed day (as of this writing) on gemini's Framework Timeline.

image

This shows degradation across the board for Java applications, but some are impacted more than others.

For comparison, the following is servlet's plaintext Framework Timeline:

image

  1. The same dip we believe is due to the base Java image being changed in Java 11 - Update Docker images to the jdk variant #4850
  2. Nov 20, 2019 - first continuous run on Ubuntu LTS with Spectre/Meltdown kernel patches and as I indicate with the horizontal line it has been "relatively" stable though the data does tables above does show some degradation
  3. Azure runs

We merged in some updates to Gemini today which included updating the Java base image to openjdk-11.0.7-slim which should be the same as openjdk-11.0.7-jdk-slim. So, if there was some weirdness with openjdk-11.0.3-jdk-slim from #4850 then the next run will show improved plaintext numbers for Gemini.

However, that may be unrelated, so other tests I will probably do in the next hour or two:

[ ] - Downgrade tapestry to openjdk:11.0.3-jre-stretch which was the version prior to #4850
[ ] - Upgrade wicket to openjdk:11.0.7-slim which would eliminate any question if gemini improves and wicket improves
[X] - Verify versions of openjdk:11.0.3-jre-stretch and openjdk:11.0.3-jdk-stretch have the same underlying JRE see below
[X] - Verify gemini plaintext are not leaking connections see below

@msmith-techempower
Copy link
Member Author

It turns out that openjdk:11.0.3-jre-stretch and openjdk:11.0.3-jdk-stretch are not the same underlying JRE (thanks to @nbrady-techempower for finding these):

openjdk:11.0.3-jre-stretch:
image

openjdk:11.0.3-jdk-stretch
image

@msmith-techempower
Copy link
Member Author

gemini is not leaking connections in its plaintext test (thanks to @michaelhixson for finding this):

Repro steps:
tfb --test gemini --mode debug
docker ps | grep gemini to find container id
docker exec -it bash <container-id>
Inside the gemini-mysql container, run watch 'ss -tan | wc -l' to continuously print out the total number of connections
From another bash session on host, run docker run --rm techempower/tfb.wrk wrk -H 'Host: host.docker.internal' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 512 --timeout 8 -t 8 "http://host.docker.internal:8080/update?queries=20"
In the terminal running the watch command, watch the number of connections climb continuously

@joanhey
Copy link
Contributor

joanhey commented Apr 20, 2020

The same pattern exists in php or nginx. And perhaps in more languages.

https://tfb-status.techempower.com/timeline/php/plaintext
https://tfb-status.techempower.com/timeline/nginx/plaintext

And a big drop in 18 June, 2019.
But I thought it was due to the CVE-2019-1147x patches. (but it's only for Microsoft systems)

Really good tool the Framework Timeline 👋 , Will be better with annotated marks about that big changes in the benchmark.

@joanhey
Copy link
Contributor

joanhey commented Apr 20, 2020

I have been investigating a estrange problem for some time.
And after check the Timeline, curiously It also starts at 18 June, 2019.

The problem

In the last runs Kumbiaphp-raw is slower than Kumbiaphp with ORM. It does not make any sense, and I think it will affect the plain php also.

Fortunes Test Round 18 Actual runs
PHP 129,288 95,832
Kumbiaphp raw 90,377 73,245
Kumbiaphp orm 76,710 73,752

https://tfb-status.techempower.com/timeline/php/fortune
https://tfb-status.techempower.com/timeline/kumbiaphp-raw/fortune
https://tfb-status.techempower.com/timeline/kumbiaphp/fortune

It's impossible for raw version to be slower than the ORM version, in all the runs after 18 June.

I was thinking with a bad php stack config. But after read this issue, I think that perhaps would be a problem with the benchmark stack.
I'll investigate more about that problem.

@msmith-techempower
Copy link
Member Author

@joanhey Below is the graph for Kumbiaphp, for reference, and it does indeed see that dip on June 18, 2019. Curiously, it seems to recover on Nov 20, 2019.
image

@msmith-techempower
Copy link
Member Author

I have edited the original post to indicate that on Jun 18, 2019, @nbrady-techempower applied the Spectre/Meltdown kernel patches, and we believe that those account for the dip.

@joanhey
Copy link
Contributor

joanhey commented Apr 20, 2020

Yes it recover in Nov 20, like plain PHP. But I can't understand the reason.
No changes in nginx config or php code, no new minor versions (PHP 7.3.x or nginx).
In Jan 2020, use php 7.4 and we can see a small rise.

Curiously, nginx alone drop in Nov 20, 2019.

@msmith-techempower
Copy link
Member Author

I believe we have an answer to that now.

Nov 20 is when we switched back from CentOS to Ubuntu, and we did not apply (this iptables rule)[https://news.ycombinator.com/item?id=20205566] which was previously applied on the CentOS install.

That dip from Jun 8 to Nov 20 appears to be a direct relation to that particular rule being in place.

@joanhey
Copy link
Contributor

joanhey commented Apr 20, 2020

I think that would be a timeline with all that changes in some place.

A chronological history of the changes in a web page.

@NateBrady23
Copy link
Member

TechEmpower/tfb-status#21 Yes, I want that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants