-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Quacc compatible with Parsl's checkpointing capabilities #2521
Make Quacc compatible with Parsl's checkpointing capabilities #2521
Conversation
Can one of the admins verify this patch? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2521 +/- ##
==========================================
- Coverage 97.39% 97.32% -0.08%
==========================================
Files 85 85
Lines 3538 3550 +12
==========================================
+ Hits 3446 3455 +9
- Misses 92 95 +3 ☔ View full report in Codecov by Sentry. |
@tomdemeyere: Thank you for your contribution here! This looks quite nice. I do not have any particular comments other than that I would be pleased to merge this. Regarding testing, I guess you could run one of the |
@Andrew-S-Rosen I need to think about tests a little, it would be nice if we could check if this is working (grep something in the parsl This PR introduces restart (for one workflow engine) for jobs that are completed. So, if you, let's say, run a workflow that is a series of singlepoints, and you get timed out, the workflow will not recompute everything on restart for job(s) already completed: nice! However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think? |
Sort of, yes. Basically, even if the workflow engine supports native restarts (as many do), it will rerun the job from the beginning in a new directory. But if you were doing a geometry optimization and it timed out at 1000 steps, this means that you will have to start again rather than pick up where you left off and do step 1001. This is because the naive restart can be implemented entirely on the workflow engine side (where it belongs), but the workflow engine knows nothing about the science. It does not know how to restart a DFT calculation, which might involve shuttling files around for instance (e.g. in VASP, move |
To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope). I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this. |
Open to suggestions!
Agreed. It's a pretty annoying thing to have to deal with... |
The most non-invasive way I see might be to perform a similar kind of hashing based on jobs parameters to replace the current naming from
The challenge will be for jobs that take non-trivial types in parameters (phonopy, ...). |
@Andrew-S-Rosen Here are the tests, including proper parsl restarting tests. |
Very nice! Thank you, @tomdemeyere!! |
Summary of Changes
This PR aim to allow users to use the Parsl checkpointing features for simple jobs that do not require complex parameter types (most of them I believe). To this aim:
get_atoms_id_parsl
which return the hash computed by_encode_atoms
(previously namedget_atoms_id
, which still exist as well) in bytes form.quacc.__init__
.docs.user.misc.restarts
.I am not sure how to deal with testing yet. From what I have seen on my HPC this seems to work ok.
Requirements
main
).Note: If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.