Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New option to store gbw file instead of fetching it by default #72

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

danielhollas
Copy link
Collaborator

@danielhollas danielhollas commented Apr 17, 2023

ORCA stores molecular wavefunction in a binary GBW file, which can then be used to restart a calculation or used as a wavefunction guess. Currently, in OrcaCalculation we always fetch this file as part of the retrieved folder. However, this is not ideal since these files can get very big and it is not possible to delete them once the workflow finishes since they become part of the AiiDA provenance.

In this PR, we instead introduce a new keyword store_gbw as part of input parameters. This is by default false. If the user sets it to true, the gbw file is attached to the calcjob outputs as SinglefileData node. This has the advantage that it can then be easily passed into subsequent workflows.

Closes #68

@danielhollas danielhollas added enhancement New feature or request question Further information is requested labels Apr 17, 2023
@danielhollas danielhollas self-assigned this Apr 17, 2023
@codecov-commenter
Copy link

codecov-commenter commented Apr 17, 2023

Codecov Report

Merging #72 (d09c5cc) into develop (e267e8b) will decrease coverage by 0.10%.
The diff coverage is 63.63%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #72      +/-   ##
===========================================
- Coverage    57.33%   57.24%   -0.10%     
===========================================
  Files           15       15              
  Lines         1589     1595       +6     
===========================================
+ Hits           911      913       +2     
- Misses         678      682       +4     
Impacted Files Coverage Δ
aiida_orca/parsers/__init__.py 83.92% <40.00%> (-4.97%) ⬇️
aiida_orca/calculations/orca_orca.py 96.05% <83.33%> (-1.17%) ⬇️

@danielhollas
Copy link
Collaborator Author

@yakutovicha @mbercx would you mind taking a look at this PR? I am trying to figure out if the design here makes sense, and whether it is something close to what other AiiDA plugins (aiida-gaussian, aiida-quantumespresso) are doing to handle large wavefunction files. Thanks!

@mbercx
Copy link

mbercx commented May 9, 2023

Just off the top of my head: have you looked into stashing?

https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/calculations/usage.html?highlight=stash#stashing-on-the-remote

This is what I use for e.g. the charge density in Quantum ESPRESSO (not in the AiiDAlab QEapp, but just my work in general).

@danielhollas
Copy link
Collaborator Author

@mbercx thanks! I saw stashing in the docs, but it seemed a bit too complicated and I was lazy to look into it deeper. :-D
This is definitely something I could use. I guess the question here is whether the aiida-orca plugin should offer a more streamlined way to do this (potentially common) operation. I also need to look into how one can actually pass in the stash folder.

@mbercx
Copy link

mbercx commented May 12, 2023

It's fairly simple to use, and comes with every calculation job through the options. There is only one stash_mode supported at the moment (COPY), so you can ignore that. So for the calculation job inputs:

calcjob_inputs['metadata']['options']['stash'] = {
    'source_list': ['restart_file1', 'restart_file2'],
    'target_base': '/path/to/stashing/folder',
}

You can stash anywhere you want, i.e. sometimes I use it to store charge densities in project storage for later use. If you just need them temporarily for restarts though, you can also just take a default stash folder on the scratch space:

target_base = Path(calcjob_code.computer.get_workdir(), 'stash').as_posix()

However, this only really makes sense in case you still want to clean the other files that you don't restart from, or these files somehow interfere with the running of ORCA. Else you could just also restart from the remote_folder output (I think this is general?). Note that the RemoteStashFolderData will simply become an output node with the remote_stash link:

❯ verdi process show <PK>
[...]
Outputs          PK  Type
-------------  ----  ---------------------
remote_folder  2516  RemoteData
remote_stash   2517  RemoteStashFolderData
retrieved      2518  FolderData

Regarding re-using the RemoteStashFolderData, I've been working on using a stashed folder to restart epw.x calculations from one, see aiidateam/aiida-quantumespresso#918

In this case I'm adding it as a valid_type for my parent_folder_epw input:

        spec.input('parent_folder_epw', required=False, valid_type=(orm.RemoteData, orm.RemoteStashFolderData),
                   help='folder that contains all files required to restart an `EpwCalculation`')

Link to code

(You could also add the SinglefileData as a valid type, I suppose 😄)

There is a slight difference in syntax to get the remote path (i.e. target_basepath) compared to the RemoteData type though:

            if isinstance(parent_folder_epw, orm.RemoteStashFolderData):
                epw_path = Path(parent_folder_epw.target_basepath)
            else:
                epw_path = Path(parent_folder_epw.get_remote_path())

Link to code

You can then add its path + contents to the CalcInfo:

        calcinfo = datastructures.CalcInfo()
        calcinfo.codes_info = [codeinfo]
        calcinfo.local_copy_list = local_copy_list
        calcinfo.remote_copy_list = remote_copy_list
        calcinfo.remote_symlink_list = remote_symlink_list

❗️Warning: be careful when doing multiple restarts from the same Remote(StashFolder)Data with remote_symlink_link. If it can change the file this can influence results or make calculations fail.

@danielhollas danielhollas changed the title WIP: Do not store gbw file by default New option to store gbw file instead of fetching it by default May 16, 2023
@danielhollas
Copy link
Collaborator Author

danielhollas commented May 16, 2023

Hi @mbercx. Thank you for a really detailed writeup ❤️ , that clears things up quite a bit.
I think for my use case just using RemoteData folder for restart seems to be the more natural thing to do. So instead of the approach here, we would have a new input like parent_calc, which would be in line with both aiida-quantumespresso and aiida-gaussian. I will open a new PR when I get to it.

https://github.com/nanotech-empa/aiida-gaussian/blob/e953c3497c06c1857e0751a8b5b6897eb167de95/aiida_gaussian/calculations/gaussian.py#L201

One thing that needs care in the case of ORCA is the fact that we cannot use the exact same filename for the restart file, otherwise the ORCA calculation crashes (I guess ORCA overwrites the wavefunction file at the beginning before it loads the WF guess from it). So we need to rename it or put it in a subfolder. That needs to be done anyway since as you point out, we probably want to use symlinks so that we do not copy these GBW files around, and we need to make sure that ORCA does not modify them.

I am keeping this open for now.

@yakutovicha
Copy link

I had a look at the discussion, and I agree with @mbercx that stashing is probably the right thing here.

I avoided supporting pulling the wavefunctions in all plugins I maintain. It escalates memory use.

Somewhat related comment.
In the nanoribbon app on AiiDAlab, we pull the cube files. The result: one user did about ~100 such simulations, and those data consume half of the entire memory of the AiiDAlab server. So now we need to redesign the app 🤷 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not always fetch gbw files to local folder
4 participants