Make Quacc compatible with Parsl's checkpointing capabilities #2521

tomdemeyere · 2024-10-31T12:16:27Z

Summary of Changes

This PR aim to allow users to use the Parsl checkpointing features for simple jobs that do not require complex parameter types (most of them I believe). To this aim:

Added a function get_atoms_id_parsl which return the hash computed by _encode_atoms (previously named get_atoms_id, which still exist as well) in bytes form.
This function is automatically registered if Parsl is selected as the workflow engine, this is done in quacc.__init__.
I added a comprehensive documentation about the feature in docs.user.misc.restarts.

I am not sure how to deal with testing yet. From what I have seen on my HPC this seems to work ok.

Requirements

My PR is focused on a single feature addition or bugfix.
My PR has relevant, comprehensive unit tests.
My PR is on a custom branch (i.e. is not named main).

Note: If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.

buildbot-princeton · 2024-10-31T12:16:30Z

Can one of the admins verify this patch?

codecov · 2024-10-31T12:26:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.32%. Comparing base (cf9e8db) to head (e9525f2).
Report is 13 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2521      +/-   ##
==========================================
- Coverage   97.39%   97.32%   -0.08%     
==========================================
  Files          85       85              
  Lines        3538     3550      +12     
==========================================
+ Hits         3446     3455       +9     
- Misses         92       95       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Andrew-S-Rosen · 2024-10-31T23:14:30Z

@tomdemeyere: Thank you for your contribution here! This looks quite nice. I do not have any particular comments other than that I would be pleased to merge this. Regarding testing, I guess you could run one of the tests/parsl test functions twice over? But in the short-term, a test to make sure that get_atoms_id_parsl is covered is probably sufficient. This should be trivial to implement. I can merge thereafter if you agree.

tomdemeyere · 2024-11-01T21:57:11Z

@Andrew-S-Rosen I need to think about tests a little, it would be nice if we could check if this is working (grep something in the parsl runinfo ...).

This PR introduces restart (for one workflow engine) for jobs that are completed. So, if you, let's say, run a workflow that is a series of singlepoints, and you get timed out, the workflow will not recompute everything on restart for job(s) already completed: nice!

However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?

Andrew-S-Rosen · 2024-11-01T23:12:27Z

However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?

Sort of, yes. Basically, even if the workflow engine supports native restarts (as many do), it will rerun the job from the beginning in a new directory. But if you were doing a geometry optimization and it timed out at 1000 steps, this means that you will have to start again rather than pick up where you left off and do step 1001. This is because the naive restart can be implemented entirely on the workflow engine side (where it belongs), but the workflow engine knows nothing about the science. It does not know how to restart a DFT calculation, which might involve shuttling files around for instance (e.g. in VASP, move CONTCAR to POSCAR) and can be dependent on the underlying method. What is the best way to solve this? It is not immediately clear to me...

tomdemeyere · 2024-11-01T23:28:33Z

It is not immediately clear to me...

To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).

I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.

Andrew-S-Rosen · 2024-11-02T01:21:33Z

To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).

Open to suggestions!

I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.

Agreed. It's a pretty annoying thing to have to deal with...

tomdemeyere · 2024-11-02T11:59:53Z

Open to suggestions!

The most non-invasive way I see might be to perform a similar kind of hashing based on jobs parameters to replace the current naming from tmp-quacc-time-etc... to tmp-quacc-hash, so that same calculations always use the same temporary folder? Results dirs are then still unique. This has multiple good points:

The directory management is the only requirement being solved: restarts are then tackled where they belong, e.g., in calculators or custodian etc...
People don't have to modify their script for restarts, in fact they should not, they just run the same script again, and the workflow engine + quacc will make sure to restart exactly where needed.

The challenge will be for jobs that take non-trivial types in parameters (phonopy, ...).

tomdemeyere · 2024-11-07T14:22:23Z

@Andrew-S-Rosen Here are the tests, including proper parsl restarting tests.

Andrew-S-Rosen · 2024-11-07T15:11:53Z

Very nice! Thank you, @tomdemeyere!!

tomdemeyere added 5 commits October 24, 2024 23:28

parsl_memo

4492276

parsl_memo

0adec94

output_ref is necessary...

28957c0

no default

d9e77cd

documentation

360ff6d

pre-commit auto-fixes

c14149a

tests including proper parsl tests

369773e

Merge remote-tracking branch 'origin/parsl_memo' into parsl_memo

e9525f2

Andrew-S-Rosen merged commit 003ca07 into Quantum-Accelerators:main Nov 7, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Quacc compatible with Parsl's checkpointing capabilities #2521

Make Quacc compatible with Parsl's checkpointing capabilities #2521

tomdemeyere commented Oct 31, 2024

buildbot-princeton commented Oct 31, 2024

codecov bot commented Oct 31, 2024 •

edited

Loading

Andrew-S-Rosen commented Oct 31, 2024 •

edited

Loading

tomdemeyere commented Nov 1, 2024 •

edited

Loading

Andrew-S-Rosen commented Nov 1, 2024 •

edited

Loading

tomdemeyere commented Nov 1, 2024 •

edited

Loading

Andrew-S-Rosen commented Nov 2, 2024

tomdemeyere commented Nov 2, 2024

tomdemeyere commented Nov 7, 2024

Andrew-S-Rosen commented Nov 7, 2024

Make Quacc compatible with Parsl's checkpointing capabilities #2521

Make Quacc compatible with Parsl's checkpointing capabilities #2521

Conversation

tomdemeyere commented Oct 31, 2024

Summary of Changes

Requirements

buildbot-princeton commented Oct 31, 2024

codecov bot commented Oct 31, 2024 • edited Loading

Codecov Report

Andrew-S-Rosen commented Oct 31, 2024 • edited Loading

tomdemeyere commented Nov 1, 2024 • edited Loading

Andrew-S-Rosen commented Nov 1, 2024 • edited Loading

tomdemeyere commented Nov 1, 2024 • edited Loading

Andrew-S-Rosen commented Nov 2, 2024

tomdemeyere commented Nov 2, 2024

tomdemeyere commented Nov 7, 2024

Andrew-S-Rosen commented Nov 7, 2024

codecov bot commented Oct 31, 2024 •

edited

Loading

Andrew-S-Rosen commented Oct 31, 2024 •

edited

Loading

tomdemeyere commented Nov 1, 2024 •

edited

Loading

Andrew-S-Rosen commented Nov 1, 2024 •

edited

Loading

tomdemeyere commented Nov 1, 2024 •

edited

Loading