-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubernetes checkpoint-restore process failed #2366
Comments
Please try with cgroupv1. Restoring containers with Kubernetes currently does not always work with cgroupv2. We are waiting for opencontainers/runc#3546 to be part of a runc release. The missing You can do |
Thanks for your help, it works! |
I have an other question. Recently, I try to restore my restored image, it can successfully run the new pod in local node but failed in another node. I use Normal Pulled 6s (x3 over 18s) kubelet Container image "docker.io/wuch100519/migrations-24311a:dump" already present on machine |
I try to rebuild new image, or use local image instead of pulling from docker hub, but not works. |
What CRI-O currently does not do is to pull missing images during restore. So you need to do a
|
Thanks very very very much, it greatly help me !!!! |
When I try again in another machine, I get log as follows, I don't know what happened.
|
Is this on a cgroup v1 or v2 system? |
I am sure it is on v1 system, I have rum |
Please attach the CRIU log during the checkpointing. |
OK,When I finished checkpoint by `curl -X POST "https://166.111.xxx.190:10250/checkpoint/default/migrations-319a/migration-319a" --insecure --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key
|
That is not the log file during the checkpointing. |
How can I get it ? |
Hello, @adrianreber. I have an issue here. Now I would like to checkpoint the postgresql pod as an experiement. Checkpointing the pod on the specific node can work. But when I tried to build the image based on the tar file and pushed it to DockerHub, then I deployed a pod on new node using that image and found an error. |
That is a known regression in CRI-O. You need to manually pull the image the checkpoint is based on. |
What do you mean manually pull the image? I guess I ssh into the worker node I want to migrate the pod on, and then use crictl pull command to pull that image on DockerHub. But in my implementation, I want to let the whole checkpoint and restore processes occur automatically(I wrote the checkpoint CRD), so I need to use kubectl apply -f to let CRI-O pull the image for me. Is there any alternative that I can use to avoid this issue? |
I am not aware of an alternative. As I said, that used to work and currently is not working. |
Ok, got it. Thank you. |
Hello @adrianreber , I would like to mention the unresolved issue again. Now, I've found the logs from kubelet on workernode related with this error. Here are the snippets of logs, the checkpoint-new23:latest is the checkpointed image I manually pull. The kubelet said that /var/lib/containers/storage/overlay/89655a67ca2d1e6cbd28a2cd61a8077cb443ab960d54209a00161fea9afa480d/merged/spec.dump doesn't exist. Is it the possible reason for CreateContainerError(image not found) instead of the crio known regression? And also I found all the folders under the /var/lib/containers/storage/overlay don't have spec.dump file, is it normal?
|
@tonyliu666 Can you push your checkpoint image to a registry? Then I can take a look. |
Yes, I just store this image in the container storage in crio. These are the information.
|
Can you push the image to a public registry. I would like to download it and inspect it locally. |
Sure, here is the link for the image, https://hub.docker.com/r/tonyliu666/checkpoint-new23. |
@tonyliu666 The checkpoint seems to be created correctly and it does contain
|
@rst0git , I dump all the logs from crio on the node which the pod deployed on. Because github only supports 25MB file upload, I can't put the log file here. I fetch the snippets of the logs related with this images.
|
This is something different I think. Now it says that CRIU failed and more information can be found in |
That tonyliu666/checkpoint-new23 image cause the RuntimeError in crio. I also tested the other pod which is running python flask, and I found it was in the CreateContainerError state which reported the error like:
And the logs in crio are like:
I've also pushed the python flask image to the dockerhub, https://hub.docker.com/repository/docker/tonyliu666/flask. |
The image |
How about this one, https://hub.docker.com/repository/docker/tonyliu666/httpd/general ? I checked this one by using the checkpointctl command and I think this one contain the files we want, but I still got the CreateContainerError. The output I checked from the checkpointctl command:
Displaying container checkpoint data from blobs/sha256/ac85816af204246b50a457ea5f7787ee86163205f70304493e234b3c6ad106bc +-----------+--------------------------------+--------------+---------+--------------------------------+--------+-------------+------------+-------------------+ and I extracted the files from the tar file:
The crio outputs are something like these:
and the logs from the kubelet:
|
I don't know how you created those images, but if the content is in the |
Originally, I used Buildah to build images, but failed to run some pods(like postgresql) with the checkpointed image due to CreateContainerError. So, I build the image with programmatic way(follow your other post, opencontainers/image-spec#962 ), but get the same result. Now, I build the image using buildah way and successfully run the new pod on the other node. Here is the link for the image, https://hub.docker.com/repository/docker/tonyliu666/nginx/general. Then I extract the image to the folder then the structure of folder be like:
The checkpointed content is in one of the subfolder in blobs/ directory. I would like to make sure where the checkpointed contents should be located because what I did in my program using the same structure. |
@adrianreber I reinstalled the k8s cluster to version v1.30.0, crio version 1.30.3,criu version 3.16.1. Then, I try to do the same thing(use buildah command to build the image) but got the RuntimeContainerError when deploying the new pod on the same node as the original pod. The errors are:
The image link is here: https://hub.docker.com/r/tonyliu666/checkpoint-image/tags |
We need to see the content of |
I cd to this /run/containers/storage/overlay-containers/httpd/userdata folder, but I can't find the restore.log
|
As Adrian mention above, the missing You can use |
Looks like you are using Ubuntu 's broken 3.16.1 CRIU. Please use a newer version of CRIU. |
I change the new version of criu, and it's successful to deploy a new checkpointed pod on the original node. But it's still failed to deploy a new pod to the other node, the error is still CreateContainerError with "image not known" log.
kubelet logs:
The image link: https://hub.docker.com/repository/docker/tonyliu666/httpd/general |
@adrianreber , I found the node which restores the pod will fetch the original image(like: docker.io/library/nginx) instead of the checkpointed image. The logs on the node which restores the pod:
And the image listed on the original node:
However, the checkpointed image whose image id is like:
I've made sure the destination node can successfully pull the checkpointed image from DockerHub. But when running the pod, it will create that pod using the original image. |
After checking why this CreateContainerError happens, I found the reason is that if the destination node doesn't have the original image(eg: docker.io/library/redis), it could fail for creating the pod due to missing the rootFSImageRef. After pulling the original image, everything can work. |
@adrianreber one more problem is that the postgres pod is still facing the RunContainerError. I dump the criu.log here. |
Description
The checkpoint tar file can be produced correctly. But when trying to restore the pod in another node, It failed with the error "msg="criu failed: type NOTIFY errno 0". I follow the steps on the link (https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/), and I stucked on the final step which deploys the checkpointed pod on the new node.
Describe the results you received:
Describe the results you expected:
Successfully running the new pod in another node
Additional information you deem important (e.g. issue happens only occasionally):
run k8s cluster on vagrant VMs, crio version 1.28.0 and criu version: 3.19
CRIU logs and information:
dump.log
By the way I can't see the /run/containers/storage/overlay-containers/counters/userdata/restore.log on the node running the checkpointed pod
The text was updated successfully, but these errors were encountered: