Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functional Sanity JDK10 Linux s390x tests suddenly take 7 hours #2329

Closed
AdamBrousseau opened this issue Jul 5, 2018 · 27 comments
Closed

Functional Sanity JDK10 Linux s390x tests suddenly take 7 hours #2329

AdamBrousseau opened this issue Jul 5, 2018 · 27 comments

Comments

@AdamBrousseau
Copy link
Contributor

AdamBrousseau commented Jul 5, 2018

First observed on June 28 in OMR build 574
Test before: https://ci.eclipse.org/openj9/job/Test-Sanity-JDK10-linux_390-64_cmprssptrs/200/
Test After: https://ci.eclipse.org/openj9/job/Test-Sanity-JDK10-linux_390-64_cmprssptrs/201/
Typical build time
Compile Test material: 10min
Sanity functional tests: 1.5hrs
After regression
Compile test material: 1hr
Sanity functional tests: 6hrs

Diff between build 573/574
OpenJ9:
693fe84...be52aeb
No OMR diff between builds
PRs merged

(Crossing off the PRs that have been ruled out)
Also affects PR builds
https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK10-linux_390-64_cmprssptrs-OpenJ9/

@AdamBrousseau
Copy link
Contributor Author

Running a revert of #2245 here

@llxia
Copy link
Contributor

llxia commented Jul 5, 2018

It looks like all cmdlinetest are affected. For example:

cmdLineTester_XcheckJNI_0: changed from 8mins to 50mins
cmdLineTester_SCURLClassLoaderNPTests_SE100_1: changed from 9mins to 48mins
cmdLineTester_SCURLClassLoaderTests_1: changed from 9mins to 49mins

@DanHeidinga
Copy link
Member

Where any changes made to the machine configuration? Either by (re)running the ansible scripts or even at the machine provider level?

Does rerunning the Jenkins 200 build have the same good perf it had before?

@llxia
Copy link
Contributor

llxia commented Jul 5, 2018

I think SDK and test are fine. It is the machine configuration issue. Reran the test cmdLineTester_XcheckJNI_0 with lab machine and one of the "bad" SDK https://ci.eclipse.org/openj9/job/Build-JDK10-linux_390-64_cmprssptrs/246/artifact/OpenJ9-JDK10-linux_390-64_cmprssptrs-201805070321.tar.gz. And the test execution time is normal (~8mins).

@AdamBrousseau
Copy link
Contributor Author

I disabled the PR build until we can fix this.

@llxia
Copy link
Contributor

llxia commented Jul 5, 2018

Further tested a full sanity.functional build, which used latest SDK: https://ci.eclipse.org/openj9/job/Build-JDK10-linux_390-64_cmprssptrs/247/artifact/OpenJ9-JDK10-linux_390-64_cmprssptrs-201805071103.tar.gz

The build took only 1hr39mins to complete.

@smlambert
Copy link
Contributor

Related to Dan's question, is there a log of configuration activity (given the smaller set of people with access to machines this should be easier to accomplish) or an ansible schedule that can shine a light on this?

If not, it would be good to institute, putting as much transparency on machine layer changes as possible.

@llxia
Copy link
Contributor

llxia commented Jul 5, 2018

fyi @jdekonin

@AdamBrousseau
Copy link
Contributor Author

Rebuilt the last "good" levels here
Tested here
Rolled back default gcc version (what we upgraded last week) on ub16-390-1, running a test with the same sdk here

@AdamBrousseau
Copy link
Contributor Author

Definitely not a code change issue.
Also doesn't seem to be related to gcc as the rollback of gcc followed by build&test didn't change the perf.
@jdekonin is going to look at the logs to see what else was updated with the gcc7 install and the apt upgrade.

@jdekonin
Copy link
Contributor

jdekonin commented Jul 9, 2018

I haven't been able to successful reboot with the old kernel. zLinux doesn't use grub, it uses zipl as a bootloader. I've followed the basic instructions, the machine just will not reboot with another kernel specified. At least not through the machine reboot cmdln that sudo has access too which reboots the instance in under 10sec. I think this need to be rebooted from the openstack host.

@mstoodle @AdamBrousseau do either of you recall how this can be done on our zLinux machines?

@mstoodle
Copy link
Contributor

@joransiu helped get these machines, maybe he has the requisite abilities?

@pshipton
Copy link
Member

pshipton commented Jul 10, 2018

I expect this problem an aspect of the problem being discussed in #1888. Slow startup related to Java 9 and later setting -Xmx to 25% of the physical memory on the machine by default, vs Java 8 that uses a default of 512MB.

@charliegracie
Copy link
Contributor

@pshipton that change for Java 9 has existed for months so I doubt it is actually the cause here. It may be related if something in the kernel changed which causes the port library to exhibit the same behaviour as the other issues. The new Linux kernel is likely causing a few different problems here so lets make sure we figure all of them.

@pshipton
Copy link
Member

@jdekonin mentioned creating an internal machine with the same kernel level which doesn't exhibit the same slowness, so its not necessarily the kernel change which caused the slowdown.

Bottom line seems to be that the machines changed and caused the JVM memory allocation to get really slow. While perhaps we could figure out what changed and revert the machines (which is problematic at this time), we should improve the memory allocation to avoid others finding the same issue.

@AdamBrousseau
Copy link
Contributor Author

FWIW, the internal machine we created to test this (where the sdk runs fine) is Ubuntu 16.04.4 kernel version 4.4.0-130-generic
The OpenJ9 Jenkins zlinux machines are 16.04.4 with kernel 4.4.0-128-generic

@pshipton pshipton added this to the Release 0.10.0 milestone Jul 13, 2018
@pshipton pshipton removed this from the Release 0.10.0 milestone Jul 17, 2018
@pshipton pshipton added this to the Release 0.9.0 milestone Jul 17, 2018
@pshipton
Copy link
Member

@pshipton
Copy link
Member

One of the problems is fixed by eclipse-omr/omr#2743, however there is still a problem outstanding. The QUICK memory allocation algorithm can fail to find a suitable candidate but then it falls back to a brute force search which also won't find any suitable memory and can be very slow.

@pshipton
Copy link
Member

pshipton commented Jul 25, 2018

@jdekonin
Copy link
Contributor

That looks promising as compiling test material only took 6 mins instead of the recent 1hr plus. Testing appears to be going quickly as well.

@pshipton
Copy link
Member

The whole build took about 1.5 hours.

@DanHeidinga
Copy link
Member

This is a great result! I'll admit I was skeptical this would address the regression so I'm very pleased to see it resolved.

Thanks to everyone for all the work tracking this down!

@keithc-ca
Copy link
Contributor

I disabled the PR build until we can fix this.
@AdamBrousseau can you please re-enable the PR build?

@AdamBrousseau
Copy link
Contributor Author

Done. I assume this can be closed now.

@keithc-ca
Copy link
Contributor

Thanks, Adam.

@pshipton
Copy link
Member

For the record, eclipse-openj9/openj9-omr#12 merged eclipse-omr/omr#2796 to the v0.9.0-release branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants