-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: remove note on disk space for caching #5534
Conversation
The new repository implementation includes automatic de-duplication of identical files. Re-running a cached calculation should therefore not result in copies of the results stored in the repository, and in no increase in disk space usage besides what is needed for storing metadata for the new calculation nodes, data nodes & links in the database.
I wouldn't remove this, or at the very least just adjust the text. Even though content in the file repository is now deduplicated, this is just for the |
Thanks for the comment @sphuber Well, the gist of the sentence is certainly no longer correct and needs to change. In my view, it would also be fine to delete since the duplication of metadata on the DB level is unlikely to substantially impact disk usage - and there is value to users not spending time worrying about things that are unlikely to impact them. Given that this is under "topics", however, where people go to learn how things work, I'm also fine with providing a more detailed explanation here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ltalirz . Since it is possible to provide different storage implementations I would keep it general, with an explicit note that the default case has autodeduplication for files.
#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated. | ||
In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated. | |
In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level. | |
#. While caching saves unnecessary computations, it does not necessarily prevent duplication of data: the cached calculation and its output nodes are duplicated in the storage. | |
Whether the duplicated nodes actually result in the _size_ of the storage increasing, depends on the storage implementation, which may implement automatic deduplication mechanisms to save space. | |
This is actually the case for the default storage implementation `psql_dos`; this storage automatically detects files that already exist and will not store them again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I wanted to phrase it slightly differently but I've tried to incorporate your points
Co-authored-by: Sebastiaan Huber <[email protected]>
The new repository implementation includes automatic de-duplication of identical files.
Re-running a cached calculation should therefore not result in copies of the results stored
in the repository, and in no increase in disk space usage besides what is needed for storing
metadata for the new calculation nodes, data nodes & links in the database.