-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore failed on OCP Power9 for Java 21 #540
Comments
FYI, this is being discussed in the private instanton Slack channel with Java/Semeru team. Depending on the discussion, this might get moved elsewhere: https://ibm-cloud.slack.com/archives/C03MR7EC3NG/p1715103170419049 |
We tested different Test case scenarios to properly determine the problem is from 1. Take the checkpoint and restore on a P9 image if successful, perform a restore on an OCP-P9 clusterThe restore was successful on the P9 VM machine but failed on the OCP-P9 cluster.
2. Take the checkpoint and restore on a P10 image if sucessfuly, perform a restore on an OCP-P9 clusterThe checkpoint on the P10 machine initially failed with this error.
Following the request by InstantOn team, On a regression scenario to taking checkpoint on a P10 image and restore: Try the getsebool virt_sandbox_use_netlink command, if that returns 0 We ran the below command to get it working on a P10 machine.
this time the restore was successful on the P10 image, but the restore on OCP-P9 image failed but with a different error.
ebuy-instanton-1-76df794d95-zfl99-app.log 3. Change the application the checkpoint is performed on and restore on an OCP-P9 cluster.This time we tried to deploy a different application, to see if the error had to do with the app. The restore was successful on the OCP-P9 cluster this time around. |
Old checkpoint images which were taken using WLO 24.0.0.3 are getting restored fine on OCP -P 4.15 cluster for eBuy. |
A new image was built using 24.0.0.3 driver. The restore was successful on the on-prem VM but failed on the OCP instance. |
On Power Linux platforms OpenJ9 loads a special file called On systems that have SELinux set to Enforcing mode access to this file is blocked. This can be seen in the audit log
This problem is happening because the checkpoint was taken on a system where access to this file was not blocked (Ubuntu 22.04, SELinux either not installed, disabled, or set to Permissive mode) and the restore was attempted on a system where access is blocked (OCP 4.15, SELinux set to Enforcing mode). The older checkpoint image taken with the old 24.0.0.3 driver that can still be restored on OCP works because the JVM did not load the systemcfg file. That checkpoint was taken on a RHEL 9 system, not an Ubuntu system, and that system likely had SELinux set to Enforcing mode, which prevented us from loading systemcfg in the first place. That checkpoint can be restored on OCP without any problem because we don't try to reload systemcfg. This is also why taking a new checkpoint on Ubuntu with the old 24.0.0.3 driver still fails. If we set SELinux to Enforcing mode on Ubuntu, or take the checkpoint on a RHEL 9 machine, it will likely restore fine on OCP. Alternatively, disabling SELinux or setting it to Permissive mode on OCP allows the Ubuntu checkpoint to restore (I tested this scenario). My conclusion is that this is not a regression, just a combination of checkpoint host and restore host that was not previously tested. All of the previous builds should show the same problem. Since the systemcfg file is not necessary we can probably just avoid loading it in the first place if |
After setting SELinux to permissive option, we did perform three scenarios to verify things worked properly
As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image. |
If users use SELinux Enforcing on the checkpoint side they can also keep it on the restore side. On Ubuntu SELinux is typically not used or is used in Permissive mode, so in this case my advice would be to either set up SELinux Enforcing on Ubuntu, or build checkpoint images on RHEL, where SELinux Enforcing is the default. We can improve this particular case in future JVM releases, but this kind of issue will probably come up again for other reasons. |
Thanks @ymanton! We still need to figure out why restore is failing on OCP when the images are built on P10 machine which is RHEL and seems to SELInux enabled. Do we know why those are failing? For example, 24003 images built earlier vs build now. Do you think we are disabling SELInux somehow when we took checkpoint agin on 24003 on RHEL based P10 machine? |
The problem that's seen when we take a checkpoint on a P10 RHEL machine and fail to restore on OCP is actually a crash late in the restore process. The evidence can be seen in the kernel log:
The above are Java threads that CRIU attempted to restore, but they hit errors in the kernel when CRIU attempted to run them and the process was terminated. |
Thanks for the clarification @ymanton. Would we classify this as an application issue or a kernel issue? |
Unlikely to be a kernel bug, probably a problem in CRIU and the way we use it. We've seen problems like this before and fixed or worked around them in CRIU, the JVM, or Liberty. I expect something like that here, once I figure out exactly what's going wrong and why. |
The crashes seen when checkpointing on P10 and restoring on OCP (machines are P9) are caused by P10-specific libraries being loaded at checkpoint time. These libraries will cause crashes when the process is restored on a P9 machine. This problem is already fixed in Liberty images; the ci.docker/releases/latest/kernel-slim/helpers/build/checkpoint.sh Lines 9 to 11 in 1089dce
However it looks like there is a custom
The location of the crashes seems to be in the signal handler itself, so that's why we don't get javacore output or core files; instead the thread crashes, which causes a SIGILL, which crashes again during signal handling, which causes another SIGILL to be sent, and so on, eventually overflowing the stack and causing a crash in the kernel signal handling code. If we update the patched |
a serviceability defect OpenLiberty/open-liberty#29240 was opened to have a better error messages for restore on SELinux |
Java Defect: eclipse-openj9/openj9#19511
Hello team, we're on Liberty 24.0.0.5 and Java 21. We did a checkpoint on an EBC P9 VM and did a restore on an OCP P9 VM which failed failed with the error below.
The lscpu from OCP Power9
The result of uname -r on both machines
To get more trace we set the CRIU logging level to 4 and I included the output of the log file.
ebuy-instanton-1-597864cdcd-fk8lz-app.log
The text was updated successfully, but these errors were encountered: