Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about hooks #10

Open
tjwatson opened this issue Apr 26, 2021 · 4 comments
Open

Questions about hooks #10

tjwatson opened this issue Apr 26, 2021 · 4 comments

Comments

@tjwatson
Copy link

Experimenting with CRIU for a relatively complex application (in my case Open Liberty https://github.com/OpenLiberty/open-liberty) I have my doubts on the hook for CheckpointRestore.

My main doubt is in the requirement to serialize on the dump operation and deserialize on the restore operation. If on a dump the complete process and all of its state is saved in the image dump then why do we also need to save the state of the hooks with serialization separately? Would we not already have the hooks objects available to us once we restore such that we can call the restore side straight away when the process is resumed?

For Open Liberty I likely will need to introduce my own hook SPI/API [1] to give us the flexibility to have subsystems within Liberty to participate in the prepare/restore operations anyways, but my current thinking is to not have the hooks get serialized at all. The general flow is

  1. Gather up all the hook implementations into a snapshot
  2. Invoke prepare on all the hooks
  3. Perform the CRIU dump (I currently modified the JavaCriuJar to terminate the process on dump)
  4. Use criu command to restore, I would like to avoid invoking the JVM to restore to avoid overhead of firing up a Java process
  5. The process picks up from when it was frozen, the thread that invoked the dump picks right up and has a reference to the snapshot of hooks used during the prepare phase that thread then immediately calls restore on the snapshot of hooks that it called prepare on.
  6. Each hook will have its own state that is persisted with the dumped image. On restore, from the hook POV, it thinks it is the same object and will be able to restore anything it was responsible for in prepare.

My reason for opening this issue is to discuss the strategy for how the hooks work in JavaCriuJar to make sure I am not overlooking something in my strategy that I described above for Open Liberty.

If my understanding is correct we could decide to push a similar strategy down into the JavaCriuJar library. The difficulty I will have with that is that I need the ability to separate the CheckPointRestore/Hook API from the implementation such that the implementation can be installed on top while lower levels of Open Liberty can participate and implement the restore/prepare hooks without requiring the actual implementation of CheckPointRestore be available at runtime. One solution to that is to separate the JavaCriuJar into to JARs: 1) for API 2) for implementation. On the other hand I am also fine with Open Liberty having its own hook API and only using JavaCriuJar for invoking the criu lib calls to perform the dump.

[1] https://github.com/tjwatson/open-liberty/blob/criu/dev/com.ibm.ws.kernel.boot.core/src/io/openliberty/checkpoint/spi/SnapshotHook.java

@chflood
Copy link
Collaborator

chflood commented Apr 26, 2021

Do you forsee needing to impose an order on the hooks?

I'm happy to discuss a better option than serialization.

@tjwatson
Copy link
Author

tjwatson commented Apr 26, 2021

I would like to clear up some confusion on my part about the management of:

org.checkpoint.CheckpointRestore.restoreHooks

  1. In CheckpointRestore.saveTheWorld(String) each element of restoreHooks is serialized to a file JavaRestoreHooks.txt
  2. Then CheckpointRestore.saveTheWorldNative(String). Will the state of the process contain a restoreHooks list already fully populated within the saved image?
  3. At this point say the Java application continues and eventually is exited.
  4. Then later a new process is started and invokes the method CheckpointRestore.restoreTheWorld(String). This should restore the state which I think should restore all the objects of the Java process saved in step 2. At this point would we not have a restoreHooks list fully populated with the hooks that were saved to the CRIU image?
  5. Then code deserializes the serialized hooks from step 1. Will this result in duplicates being added to the restoreHooks? Ones saved as part of the CRIU image and ones saved during object serialization by the JavaCriuJar code.

I will admit my grasp of how the restore side works when transforming the running Java instance (running the CheckpointRestore.restoreTheWorld(String) method) into the restored process is pretty sketchy. My initial reaction is that it would be far more reliable to just invoke the criu restore command directly to restore and depend on execution picking right backup at the point the process was saved and restore just happens from right there against the exist hooks that were present at the time of save.

As for order, I was planning on that being able to be controlled from the Open Liberty side in the registration of the hooks on our side. As of now I don't have good examples where hook order matters, but I will not be surprised if it comes up. But this does bring up a good point if the same hook object does the prepare and restore then it may make sense to do the restore in the reverse order of the prepare order.

@vijaysun-omr
Copy link

I think one scenario in which order of hooks may matter is if we had a distinction between "application" hooks and "JVM" hooks, e.g. we might want to compact the Java heap and release unneeded memory to the OS in the JVM hook as one of the last actions before taking a snapshot to keep its size small.

@chflood
Copy link
Collaborator

chflood commented Apr 28, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants