Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Quacc compatible with Parsl's checkpointing capabilities #2521

Merged
merged 8 commits into from
Nov 7, 2024

Conversation

tomdemeyere
Copy link
Contributor

Summary of Changes

This PR aim to allow users to use the Parsl checkpointing features for simple jobs that do not require complex parameter types (most of them I believe). To this aim:

  • Added a function get_atoms_id_parsl which return the hash computed by _encode_atoms (previously named get_atoms_id, which still exist as well) in bytes form.
  • This function is automatically registered if Parsl is selected as the workflow engine, this is done in quacc.__init__.
  • I added a comprehensive documentation about the feature in docs.user.misc.restarts.

I am not sure how to deal with testing yet. From what I have seen on my HPC this seems to work ok.

Requirements

Note: If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.

@buildbot-princeton
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.32%. Comparing base (cf9e8db) to head (e9525f2).
Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2521      +/-   ##
==========================================
- Coverage   97.39%   97.32%   -0.08%     
==========================================
  Files          85       85              
  Lines        3538     3550      +12     
==========================================
+ Hits         3446     3455       +9     
- Misses         92       95       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Oct 31, 2024

@tomdemeyere: Thank you for your contribution here! This looks quite nice. I do not have any particular comments other than that I would be pleased to merge this. Regarding testing, I guess you could run one of the tests/parsl test functions twice over? But in the short-term, a test to make sure that get_atoms_id_parsl is covered is probably sufficient. This should be trivial to implement. I can merge thereafter if you agree.

@tomdemeyere
Copy link
Contributor Author

tomdemeyere commented Nov 1, 2024

@Andrew-S-Rosen I need to think about tests a little, it would be nice if we could check if this is working (grep something in the parsl runinfo ...).

This PR introduces restart (for one workflow engine) for jobs that are completed. So, if you, let's say, run a workflow that is a series of singlepoints, and you get timed out, the workflow will not recompute everything on restart for job(s) already completed: nice!

However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Nov 1, 2024

However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?

Sort of, yes. Basically, even if the workflow engine supports native restarts (as many do), it will rerun the job from the beginning in a new directory. But if you were doing a geometry optimization and it timed out at 1000 steps, this means that you will have to start again rather than pick up where you left off and do step 1001. This is because the naive restart can be implemented entirely on the workflow engine side (where it belongs), but the workflow engine knows nothing about the science. It does not know how to restart a DFT calculation, which might involve shuttling files around for instance (e.g. in VASP, move CONTCAR to POSCAR) and can be dependent on the underlying method. What is the best way to solve this? It is not immediately clear to me...

@tomdemeyere
Copy link
Contributor Author

tomdemeyere commented Nov 1, 2024

It is not immediately clear to me...

To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).

I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.

@Andrew-S-Rosen
Copy link
Member

To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).

Open to suggestions!

I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.

Agreed. It's a pretty annoying thing to have to deal with...

@tomdemeyere
Copy link
Contributor Author

Open to suggestions!

The most non-invasive way I see might be to perform a similar kind of hashing based on jobs parameters to replace the current naming from tmp-quacc-time-etc... to tmp-quacc-hash, so that same calculations always use the same temporary folder? Results dirs are then still unique. This has multiple good points:

  • The directory management is the only requirement being solved: restarts are then tackled where they belong, e.g., in calculators or custodian etc...
  • People don't have to modify their script for restarts, in fact they should not, they just run the same script again, and the workflow engine + quacc will make sure to restart exactly where needed.

The challenge will be for jobs that take non-trivial types in parameters (phonopy, ...).

@tomdemeyere
Copy link
Contributor Author

@Andrew-S-Rosen Here are the tests, including proper parsl restarting tests.

@Andrew-S-Rosen
Copy link
Member

Very nice! Thank you, @tomdemeyere!!

@Andrew-S-Rosen Andrew-S-Rosen merged commit 003ca07 into Quantum-Accelerators:main Nov 7, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants