Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow lazy loading of DataContainers #367

Merged
merged 19 commits into from
Aug 23, 2021
Merged

Allow lazy loading of DataContainers #367

merged 19 commits into from
Aug 23, 2021

Conversation

pmrv
Copy link
Contributor

@pmrv pmrv commented Jul 7, 2021

With this data containers (and potentially other data types) can be loaded "on demand" from HDF5 instead of all at once. In my simple example below this saves ~90% time on load.

The way it works is that in from_hdf values of the data container are not loading right away but just the hdf object and the path are saved in a "stub" object, that is then later transparently loaded upon item access.

Tests and design spec follow when we discussed this.

This should help with #364 and inspect mode in general. My vision would be that expensive to load objects should be using data container for all storage and can then be load()ed with lazy=True instead of inspect()ed.

Here's an example that can be run with ipython -i

# coding: utf-8
from pyiron_base import Project
import pyiron_base.generic.datacontainer as D
import numpy as np

d = D.DataContainer()
pr = Project('asdf')
h = pr.create_hdf(pr.path, 'test')

K = 'abcdefghijklmnop'
for c in K:
    dd = d.get(c, create=True)
    for c in K:
        dd[c] = np.random.rand(1024)

d.to_hdf(h, 'lazy')
# ~2s
get_ipython().run_line_magic('time', "d_normal = h['lazy'].to_object()")
# ~150ms
get_ipython().run_line_magic('time', "d_lazy = h['lazy'].to_object(lazy=True)")

assert (d_normal.a.a == d_lazy.a.a).all()

@pmrv pmrv added the enhancement New feature or request label Jul 7, 2021
@jan-janssen
Copy link
Member

I still have a bit of trouble to understand the full implications of this change. For example for a structure object stored in the HDF5 file, when are the positions loaded at job.structure or only when I select the positions job.structure.positions ? So I guess we should discuss this in more detail tomorrow.

@pmrv
Copy link
Contributor Author

pmrv commented Jul 8, 2021

Currently they would be loaded when job.structure is accessed, assuming we'd have the following setup

class Job:
  ...
  def from_hdf(self, hdf, group_name):
    self._structure = HDFStub(hdf, group_name + '/structure')
  ...
  @property
  def structure(self):
    if isinstance(self._structure, HDFStub):
      self._structure = self._structure.load() # all HDF access for this structure occurs here
    return self._structure

In the current setup this change only affects how DataContainer is read from HDF, but HDFStub would be easily used to delay loading of other types as well.

However HDFStub already allows to customize this, with the register method. If the Atoms class supported to read itself lazily from HDF (which would have to be implemented separately), then you could do something like

HDFStub.register('Atoms', lambda hdf, group: Atoms.lazy_from_hdf(hdf, group))

So I guess we should discuss this in more detail tomorrow.

Yeah, will prepare a small demo.

@pmrv
Copy link
Contributor Author

pmrv commented Jul 9, 2021

Only thing left from my side is whether I should enable lazy by default or not.

@stale
Copy link

stale bot commented Jul 23, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 23, 2021
@pmrv pmrv removed the stale label Jul 24, 2021
@stale
Copy link

stale bot commented Aug 7, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 7, 2021
Instead of subclassing for each new lazily loadable type, types may
register themselves to the HDF5Stub class with a simple callback.  The
class then checks the 'NAME' field in the HDF group at loading time
against values provided in the register call and use this callback.

This allows for the same amount of customization, but has the advantage
that you can wrap every HDF5 group in HDF5Stub without checking which
subclass is necessary.
@pmrv
Copy link
Contributor Author

pmrv commented Aug 11, 2021

I rebased to fix the merge conflicts and will merge will lazy loading enabled by default, if no one objects.

Copy link
Member

@liamhuber liamhuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy actually said something meaningful! Otherwise lgtm.

tests/generic/test_datacontainer.py Outdated Show resolved Hide resolved
tests/generic/test_datacontainer.py Outdated Show resolved Hide resolved
self._group_name = group_name

@classmethod
def register(cls, type, load):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, codacy complains that this overrides a built-in type...do you know what it's talking about? In pycharm I don't get any complaint, and I was expecting like when you try to use "id" or "dict" as a variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument type is a builtin. It is the base class of all classes and also a function to create new classes (iirc). Since I'm not using this inside this short method, I think it's ok, but I can also come up with a different name.

@pmrv pmrv merged commit d758bca into master Aug 23, 2021
@delete-merged-branch delete-merged-branch bot deleted the hdfstub branch August 23, 2021 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants