Restore failed on OCP Power9 for Java 21 #540

abdulmateen-1 · 2024-05-07T20:38:50Z

Java Defect: eclipse-openj9/openj9#19511

Hello team, we're on Liberty 24.0.0.5 and Java 21. We did a checkpoint on an EBC P9 VM and did a restore on an OCP P9 VM which failed failed with the error below.

Found mounted TLS certificates, generating keystore
CWWKE0964E: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. The server did not launch because checkpoint restore recovery is disabled.
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
pie: 1023: Error (criu/pie/restorer.c:1676): Can't restore 0x78f367cf0000 mapping with 0xfffffffffffffff3
pie: 1023: Error (criu/pie/restorer.c:2102): Restorer fail 1023
Error (criu/cr-restore.c:2547): Restoring FAILED.

The lscpu from OCP Power9

Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           8
NUMA node(s):        1
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7

The result of uname -r on both machines

OCP
sh-4.4# uname -r
5.14.0-284.59.1.el9_2.ppc64le

P9VM
5.15.0-97-generic

To get more trace we set the CRIU logging level to 4 and I included the output of the log file.
ebuy-instanton-1-597864cdcd-fk8lz-app.log

The text was updated successfully, but these errors were encountered:

leochr · 2024-05-08T13:37:16Z

FYI, this is being discussed in the private instanton Slack channel with Java/Semeru team. Depending on the discussion, this might get moved elsewhere: https://ibm-cloud.slack.com/archives/C03MR7EC3NG/p1715103170419049

abdulmateen-1 · 2024-05-10T19:57:36Z

We tested different Test case scenarios to properly determine the problem is from

1. Take the checkpoint and restore on a P9 image if successful, perform a restore on an OCP-P9 cluster

The restore was successful on the P9 VM machine but failed on the OCP-P9 cluster.

(00.059541) pie: 1025: 	mmap(0x7a1183240000 -> 0x7a1183250000, 0x3 0x12 398)
(00.059548) pie: 1025: 	mmap(0x7a1183250000 -> 0x7a11832b0000, 0x7 0x12 399)
(00.059554) pie: 1025: 	mmap(0x7a11832b0000 -> 0x7a11832c0000, 0x3 0x12 399)
(00.059560) pie: 1025: 	mmap(0x7a11832c0000 -> 0x7a11832d0000, 0x3 0x12 399)
(00.059567) pie: 1025: 	mmap(0x7a11832d0000 -> 0x7a11833a0000, 0x3 0x32 -1)
(00.059573) pie: 1025: 	mmap(0x7a11833a0000 -> 0x7a11833b0000, 0x0 0x32 -1)
(00.059578) pie: 1025: 	mmap(0x7a11833b0000 -> 0x7a11833f0000, 0x3 0x32 -1)
(00.059583) pie: 1025: 	mmap(0x7a11833f0000 -> 0x7a1183400000, 0x1 0x11 400)
(00.059717) pie: 1025: Error (criu/pie/restorer.c:1676): Can't restore 0x7a11833f0000 mapping with 0xfffffffffffffff3
(00.059728) pie: 1025: Error (criu/pie/restorer.c:2102): Restorer fail 1025
(00.059965) Error (criu/cr-restore.c:2547): Restoring FAILED.

restore.log
ebuy-instanton-1-76df794d95-bxf4l-app.log

2. Take the checkpoint and restore on a P10 image if sucessfuly, perform a restore on an OCP-P9 cluster

The checkpoint on the P10 machine initially failed with this error.

[ERROR   ] CWWKE0701E: bundle com.ibm.ws.tx.embeddable.jakarta:1.0.89.cl240620240507-2000 (455)[com.ibm.tx.jta.embeddable.impl.EmbeddableTMHelper(314)] : The shutdown method has thrown an exception java.lang.NullPointerException: Cannot invoke "com.ibm.tx.jta.impl.TxRecoveryAgentImpl.stop(boolean)" because "com.ibm.tx.jta.util.TxTMHelper._recoveryAgent" is null
	at com.ibm.tx.jta.util.TxTMHelper.shutdown(TxTMHelper.java:545)
	at [internal classes]
  
[AUDIT   ] CWWKE0036I: The server defaultServer stopped after 8.602 seconds.
CWWKE0962E: The server checkpoint request failed. The following output is from the CRIU /logs/checkpoint/checkpoint.log file that contains details on why the checkpoint failed.
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (compel/src/lib/infect.c:133): Unable to interrupt task: 1117 (Operation not permitted)
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Error (criu/sockets.c:436): sockets: Sockets (family 1 proto 0) are not collected
Error (criu/sk-unix.c:351): unix: Unix socket 1351736 not found
Error (criu/cr-dump.c:1669): Dump files (pid: 1023) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.
CWWKE0963E: The server checkpoint request failed because netlink system calls were unsuccessful. If SELinux is enabled in enforcing mode, netlink system calls might be blocked by the SELinux "virt_sandbox_use_netlink" policy setting. Either disable SELinux or enable the netlink system calls with the "setsebool virt_sandbox_use_netlink 1" command.

Following the request by InstantOn team, On a regression scenario to taking checkpoint on a P10 image and restore: Try the getsebool virt_sandbox_use_netlink command, if that returns 0
do setsebool, virt_sandbox_use_netlink 1
and try the checkpoint again.

We ran the below command to get it working on a P10 machine.

getsebool virt_sandbox_use_netlink command, if that returns 0 
do setsebool virt_sandbox_use_netlink 1 and try the checkpoint again

this time the restore was successful on the P10 image, but the restore on OCP-P9 image failed but with a different error.

(00.143171) 1023 (native) is going to execute the syscall 91, required is 172
(00.143190) 1023 was trapped
(00.143200) 1023 (native) is going to execute the syscall 172, required is 172
(00.143225) 1023 was stopped
(00.143262) 1023 was trapped
(00.143274) 1023 (native) is going to execute the syscall 91, required is 91
(00.143335) 1023 was stopped
(00.143352) Run late stage hook from criu master for external devices
(00.143359) restore late stage hook for external plugin failed
(00.143367) Run late stage hook from criu master for external devices
(00.143374) restore late stage hook for external plugin failed
(00.143381) Running pre-resume scripts
(00.146642) Error (criu/cr-restore.c:2146): Unable to detach 1077: No such process
(00.146674) Error (criu/cr-restore.c:2520): Killing processes because of failure on restore.
The Network was unlocked so some data or a connection may have been lost.
(00.147040) Error (criu/cr-restore.c:2547): Restoring FAILED.

ebuy-instanton-1-76df794d95-zfl99-app.log

3. Change the application the checkpoint is performed on and restore on an OCP-P9 cluster.

This time we tried to deploy a different application, to see if the error had to do with the app. The restore was successful on the OCP-P9 cluster this time around.
ebuy-instanton-1-76df794d95-bxf4l-app.log

mtamboli · 2024-05-13T19:07:30Z

Old checkpoint images which were taken using WLO 24.0.0.3 are getting restored fine on OCP -P 4.15 cluster for eBuy.

abdulmateen-1 · 2024-05-13T20:14:51Z

A new image was built using 24.0.0.3 driver. The restore was successful on the on-prem VM but failed on the OCP instance.
ebuy-instanton-1-7b64849466-nqnn2-app.log

ymanton · 2024-05-16T13:22:55Z

On Power Linux platforms OpenJ9 loads a special file called /proc/powerpc/systemcfg. This is a virtual file provided by the kernel and it contains detailed info about the system. We use some of the info to provide faster time keeping functions. Loading of this file is optional; if we fail to load it we gracefully fall back.

On systems that have SELinux set to Enforcing mode access to this file is blocked. This can be seen in the audit log /var/log/audit/audit.log:

type=AVC msg=audit(1715833184.627:1597): avc:  denied  { map } for  pid=1316985 comm="java" path="/proc/powerpc/systemcfg" dev="proc" ino=4026531945 scontext=system_u:system_r:container_t:s0:c30,c433 tcontext=system_u:object_r:proc_t:s0 tclass=file permissive=0

This problem is happening because the checkpoint was taken on a system where access to this file was not blocked (Ubuntu 22.04, SELinux either not installed, disabled, or set to Permissive mode) and the restore was attempted on a system where access is blocked (OCP 4.15, SELinux set to Enforcing mode).

The older checkpoint image taken with the old 24.0.0.3 driver that can still be restored on OCP works because the JVM did not load the systemcfg file. That checkpoint was taken on a RHEL 9 system, not an Ubuntu system, and that system likely had SELinux set to Enforcing mode, which prevented us from loading systemcfg in the first place. That checkpoint can be restored on OCP without any problem because we don't try to reload systemcfg.

This is also why taking a new checkpoint on Ubuntu with the old 24.0.0.3 driver still fails. If we set SELinux to Enforcing mode on Ubuntu, or take the checkpoint on a RHEL 9 machine, it will likely restore fine on OCP. Alternatively, disabling SELinux or setting it to Permissive mode on OCP allows the Ubuntu checkpoint to restore (I tested this scenario).

My conclusion is that this is not a regression, just a combination of checkpoint host and restore host that was not previously tested. All of the previous builds should show the same problem.

Since the systemcfg file is not necessary we can probably just avoid loading it in the first place if -XX:+EnableCRIUSupport is used and provide a better experience.

abdulmateen-1 · 2024-05-17T01:12:32Z

After setting SELinux to permissive option, we did perform three scenarios to verify things worked properly

The 24.0.0.5 liberty we opened the case for with checkpoint taken on a Ubuntu EBC Power9 VM. The restore was successfully completed.
A new image built with 24.0.03 driver on a RHEL Power10 machine with netlink system enabled. The restore successfully completed
An old image built with 24.0.0.3 that's being verified to restore successfully on an OCP Power cluster. We tried this image to verify if it's a regression issue. The image also restored successfully.

As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.

ymanton · 2024-05-17T15:59:12Z

As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.

If users use SELinux Enforcing on the checkpoint side they can also keep it on the restore side. On Ubuntu SELinux is typically not used or is used in Permissive mode, so in this case my advice would be to either set up SELinux Enforcing on Ubuntu, or build checkpoint images on RHEL, where SELinux Enforcing is the default.

We can improve this particular case in future JVM releases, but this kind of issue will probably come up again for other reasons.

mtamboli · 2024-05-17T16:53:33Z

Thanks @ymanton! We still need to figure out why restore is failing on OCP when the images are built on P10 machine which is RHEL and seems to SELInux enabled. Do we know why those are failing? For example, 24003 images built earlier vs build now. Do you think we are disabling SELInux somehow when we took checkpoint agin on 24003 on RHEL based P10 machine?

ymanton · 2024-06-04T16:35:28Z

The problem that's seen when we take a checkpoint on a P10 RHEL machine and fail to restore on OCP is actually a crash late in the restore process. The evidence can be seen in the kernel log:

[1207617.180242] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.180248] Common-Cleaner[3883760]: bad frame in handle_rt_signal64: 00000000e778bd07 nip 00007fff90059720 lr 00007fff90ce0464
[1207617.180251] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.185052] systemd-coredump[3883805]: Failed to get COMM: No such process

The above are Java threads that CRIU attempted to restore, but they hit errors in the kernel when CRIU attempted to run them and the process was terminated.

abdulmateen-1 · 2024-06-05T14:29:32Z

Thanks for the clarification @ymanton. Would we classify this as an application issue or a kernel issue?

ymanton · 2024-06-06T16:02:33Z

Unlikely to be a kernel bug, probably a problem in CRIU and the way we use it. We've seen problems like this before and fixed or worked around them in CRIU, the JVM, or Liberty. I expect something like that here, once I figure out exactly what's going wrong and why.

ymanton · 2024-07-15T16:11:50Z

The crashes seen when checkpointing on P10 and restoring on OCP (machines are P9) are caused by P10-specific libraries being loaded at checkpoint time. These libraries will cause crashes when the process is restored on a P9 machine.

This problem is already fixed in Liberty images; the checkpoint.sh script will prevent any processor-specific libraries from being loaded:

ci.docker/releases/latest/kernel-slim/helpers/build/checkpoint.sh

Lines 9 to 11 in 1089dce

    
           if [ -e /etc/instanton.ld.so.cache ]; then 
        
               cp /etc/instanton.ld.so.cache /etc/ld.so.cache 
        
           fi

However it looks like there is a custom checkpoint.sh being used in the testing environment that does not have this fix. The script looks like this:

bash-5.1$ cat -n /opt/ol/helpers/build/checkpoint.sh
     1  #!/bin/bash
     2
     3  echo "USING THE PATCHED checkpoint.sh"
     4
     5  # hack to bump up the pid by 2000
     6  for i in {1..1000}
     7  do
     8      pidplus.sh
     9  done
    10
    11  echo "Performing checkpoint --at=$1"
    12  /opt/ol/wlp/bin/server checkpoint defaultServer --at=$1
    13
    14  rc=$?
    15  if [ $rc -eq 0 ]; then
    16      # Find all directories in logs/ and output/ that the current user has read/write/execute permissions for
    17      # and give the same permissions to the group.
    18      find -L /logs /output -type d -readable -writable -executable -exec chmod g+rwx {} \;
    19
    20      # Find all files in logs/ and output/ that the current user has read/write permissions for
    21      # and give the same permissions to the group.
    22      find -L /logs /output -type f -readable -writable -exec chmod g+rw {} \;
    23  fi
    24
    25  exit $rc

The location of the crashes seems to be in the signal handler itself, so that's why we don't get javacore output or core files; instead the thread crashes, which causes a SIGILL, which crashes again during signal handling, which causes another SIGILL to be sent, and so on, eventually overflowing the stack and causing a crash in the kernel signal handling code.

If we update the patched checkpoint.sh to include the above code, or just get rid of it altogether if it doesn't have a purpose, this should resolve the issue.

vijaysun-omr · 2024-07-15T16:33:05Z

fyi @tam512 and @tjwatson since we have discussed this issue briefly on the InstantOn scrum calls

tam512 · 2024-08-14T18:26:42Z

a serviceability defect OpenLiberty/open-liberty#29240 was opened to have a better error messages for restore on SELinux

mtamboli added bug SVT Tests/Defectsby SVT labels May 13, 2024

ymanton mentioned this issue May 17, 2024

Restoring a checkpoint fails on Power if SELinux is in Enforcing mode on restore but not at checkpoint time eclipse-openj9/openj9#19511

Open

leochr added the InstantOn label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore failed on OCP Power9 for Java 21 #540

Restore failed on OCP Power9 for Java 21 #540

abdulmateen-1 commented May 7, 2024 •

edited by mtamboli

Loading

leochr commented May 8, 2024

abdulmateen-1 commented May 10, 2024

mtamboli commented May 13, 2024

abdulmateen-1 commented May 13, 2024

ymanton commented May 16, 2024

abdulmateen-1 commented May 17, 2024

ymanton commented May 17, 2024

mtamboli commented May 17, 2024

ymanton commented Jun 4, 2024

abdulmateen-1 commented Jun 5, 2024

ymanton commented Jun 6, 2024

ymanton commented Jul 15, 2024

vijaysun-omr commented Jul 15, 2024

tam512 commented Aug 14, 2024

Restore failed on OCP Power9 for Java 21 #540

Restore failed on OCP Power9 for Java 21 #540

Comments

abdulmateen-1 commented May 7, 2024 • edited by mtamboli Loading

leochr commented May 8, 2024

abdulmateen-1 commented May 10, 2024

1. Take the checkpoint and restore on a P9 image if successful, perform a restore on an OCP-P9 cluster

2. Take the checkpoint and restore on a P10 image if sucessfuly, perform a restore on an OCP-P9 cluster

3. Change the application the checkpoint is performed on and restore on an OCP-P9 cluster.

mtamboli commented May 13, 2024

abdulmateen-1 commented May 13, 2024

ymanton commented May 16, 2024

abdulmateen-1 commented May 17, 2024

ymanton commented May 17, 2024

mtamboli commented May 17, 2024

ymanton commented Jun 4, 2024

abdulmateen-1 commented Jun 5, 2024

ymanton commented Jun 6, 2024

ymanton commented Jul 15, 2024

vijaysun-omr commented Jul 15, 2024

tam512 commented Aug 14, 2024

abdulmateen-1 commented May 7, 2024 •

edited by mtamboli

Loading