Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HEP003: Hickle Compact Expand protocol #145

Closed
hernot opened this issue Nov 26, 2020 · 1 comment
Closed

HEP003: Hickle Compact Expand protocol #145

hernot opened this issue Nov 26, 2020 · 1 comment

Comments

@hernot
Copy link
Contributor

hernot commented Nov 26, 2020

Abstract:

With the proposed extension it will be possible to represent hierarchies of custom objects by a reasonable structure within hickle file without the need to write and register a full set of loaders. This allows to use hickle files to serialize data created in Python to other programming languages and systems without the requirement to embed a dependency upon Python interpreter.

Motivation:

Hickle provides reasonable loader modules for standard library and common libraries like numpy, scipy etc. Custom objects are in default setup serialized into a bytestring by calling pickle.dumps function. This byte-string can only be decoded by python using pickle.loads function as done when loading a hickle file through hickle.loads. As long as files are only used for storage and to share data among python based programs and systems which embed a python interpreter instance this poses no problem. When data shall also be read by programs written in other languages which do not embed a python interpreter and to do so is no option, than the pickle strings remain an inaccessible block of bytes.

With proposed "compact expand protocol" this kind of data can be made accessible on demand by mapping the data to data structures for which hickle provides a dedicated loader. At the same time it allows to keep file sizes reasonably low.

Specification:

A Python object which supports the proposed protocol has to define the special methods __compact__ and __expand__.

class SupportsHickleCompactExpand():
    
     # ... Class definition
     
    def __compact__(self):
          """
          returns a compacted representation of the object to be stored within the hickle file
          the structure returned can be any object of any type for which or any of its base classes
          hickle provides a dedicated loader.
          If this function returns none the object will be passed on to pickle.dumps independent
          whether for any of its base classes a dedicated loader would exist.
          """
          return dict() # just an example can be list, numpy array etc

    def __expand__(self,compact):
          """
          restores the content and structure of the object represented by compact
          """

All objects like SupportsHickleComactExpand are handled by the special b'compact' loader. This loader calls on storing the __compact__ method of the object. In case this method returns None the object is converted into a pickle byte string.
In any other case the loader will call load_loader function to obtain the appropriate loader for storing the return value of the __compact__ method. After the corresponding h5py.Dataset or h5py.Group was created by this loader the a reference to the corresponding py_obj_type and base_type is stored in the additional compact_type attribute.

On loading the compact loader at first reads the compact_type attribute for finding the loader to be used for properly restoring the compacted representation of the object. This will than be used to call the __expand__ method of a new instance of the compacted object.

In case the object was pickled as a result of __compact__ method returnig NONE no compact_typeattribute will be crated. Consequently on load any compacted object representation lacking acompact_type` attribute will be assumed to be a block of bytes to be passed on to pickle.loads.

In order to keep changes to core hickle mechanics the b'compact' loader is per default disabled for storing data and enabled for loading data. The loader has to be enabled for each object individually by calling enable_compact_expand method passing it any number of objects which shall be handled by the b'compact' loader when saving them to hickle file. After passing any number of objects to disable_compact_expand the default handler for the specific object is restored again. Loading of compacted objects on the other hand stays enabled in any case.

Rational

The HDF5 file format is designed for storing large continuous blocks of data in an efficient manner and thereby organize them in a reasonable structure and represent their dependencies and relations. It is not designed to efficiently store complex object structures found in memory.
The python copy protocol and its special methods __getstate__, __setstate__, __reduce__, __reduce_ex__ ex and the pickle.dump and pickle.load methods are specifically desinged for optimally serializing these structures with the least efforts. Naively trying to store the tuple returned by __reduce__ or __reduce_ex__ is the most obvious and was proposed in issue #125 . As tests revealed doing so on complex structures adds several gigabytes in size to the resulting hdf5 formatted file just for string all the meta data required to properly represent the structure and each tiny detail of the data which in memory used just a few 10 mega bytes.
In a first step the __getstate__, __setstate__ methods could be rewritten to properly compact and restore the structure of the object to be stored to the HDF5 file. These two method are part of the copy protocol which was designed cause python objects do neither posess an implicit nor an explicit copy constructor they are created by calling the objects __init__ method. To copy an object copy.copy and copy.deepcopy methods must be called which rely upon __getstate__, __setstate__. Using them would cause the object to be compacted and expanded upon whenever duplicated which is a highly inefficient deep copy where a shallow copy would be sufficient for example. Further also the pickle string would contain the compacted version of the object which is not necessary as pickle any is capable of efficiently representing the object structure.

For objects which are provided by libraries publicly available like numpy etc. a dedicated loader module residing the hickle/loaders directory is the best option. Such a module can define loaders for multiple objects provided by the library and can make use of all other loader functions and containers available for transforming the object into appropriate group and dataset structure. For custom objects and libraries which are specific to a single project or local development writing a loader module especially during development and testing where the objects and their structure still heavily changes is no realistic and feasible option.

In this case the proposed approach provides more flexibility and allows to concentrate upon the design and structure of the objects first. When done the __compact__ __expand__ methods can be updated to handle the structure properly. This can happen without interfering with the copy protocol at all. Further objects derived from objects for which a dedicated loader would exist can be forced to be pickled. This only requires that the __compact__ method returns None. In case these objects shall be handled again by the loader for one of its base classes it is sufficient to disable compact expand protocol for that specific object, which would not be possible when (ab)using copy protocol aiming at a different purpose.

Alternatives

From version 4.x on the rules how to store the individual Python objects are defined by a the 'dump_fcn' and 'load_fcn' provided by a dedicated loader module or through an explicit call to 'hickle.lookup.register_classfunction. The hickle_5_RC #149 extends this by addingcontainer_classfor restoringpy_objectrepresented byh5py.Group`.

The register_class function is part of the hickle core machinery and thus may be affected by changes of the inner workings of hickle and/or the structure used to store the object data in the hdf5 file. Such a disrupting change is caused by the introduction of PyContainer classes used to restore py_object represented by h5py.Group in hickle_5_RC #149 or by the full separation of hickle core machinery and loaders completed in hickle_5_RC #149.

Consequently the dump_fcǹ and load_fcn of all loaders including loaders explicitly registered though calling hickle.lookup.register_class directly have to be adopted in order to properly support newer hickle versions and multiple load_fcn functions have to be provided for legacy loaders of older hickle versions.

The hickle.lookup.register_class has to be explicitly called to add dump_fcn, load_fcn and container_class (hickle_5_RC #149) for individual classes before the first call to hickl.dump and/or hickle.load used to dump or restore the custom class. In case either hickle.lookup.register_class is not called or custom class is missing hickle will store the corresponding pickle string or abort restoring raising an Exception. Attempting to at least recover the data storing them in numpy and python primitives may not be possible in all cases, unless as defined for the compact-expand protocol the dump_fcn would consequently map to python, numpy and other primitives for which appropriate loaders are included within the hickle package. In this case the !missing! attribute could be attached to the h5py.Group or h5py.Dataset by the dump_fcn function to indicate the py_obj_type and base_type to be used by the hickle.load function if no appropriate load_fcn and/or container_class are available.

Defining a loader module which allows hickle to automatically load dump_fcn, load_fcn and container_class is currently not possible in hickle 4.x but could be introduced by hickle_5_RC. Thereby hickle.lookup.load_loader function needs to check a package or module path the custom class is defined in if either contains a hicke_loaders directory which contains a load_.py or load_.py file defining the loader for the custom classes.

Independent whether explicitly called or implicitly through loader module stored in hickle.loaders directory or in hickle_loaders directory in package or module directory, dump_fcn, load_fcn functions and container_class can be kept separate from custom class definition and only need to be loaded when custom class shall be dumped or loaded using hickle. The __compact__ and __expand__ methods have to be defined along with all other class methods.

Feature loaders introduced by hickle_5_RC allow control whether loader is used to dump the custom class or whether it shall be stored as pickle string instead of using the dump_fcn, load_fcn and container_class registered through loader_module or explicit call to hickle.lookup.register_class. On loading hickle automatically will apply option loader if applied during hickle.dump call as done for compact-expand loader option.

Open Issues:

Can similar be reached by directly import and call register class (question #150)?
How should case of lass is not being registered or not available before call to hickle.load handled?

Conclusion

The proposed functionality can be achieved by a single call hickle.lookup.register_class. The proposed Python copy protocol mimicry would introduce unnecessary and non desired interdependencies between hickle and the packages providing class and type declarations for objects to be dumped using hickle. A clean separation between production code of the package, single module or application and custom hickle loaders can be achieved through hickle.lookup.load_loader function trying to load the load_<pakage> loader modules from a dedicated hickle_loaders directory found at the following locations in addition to hickle/loaders directory:

  • the package root path at which a specific multi-module Python package is installed or loaded from.
    For example: <python_installation_path>/dist_packages/<package>/hickle_loaders/load_<package>.py
  • The module installation directory for single module Python modules.
    For example: <python_installation_path>/dist_packages/hickle_loaders/load_<module>.py or
    <program_path>/hickle_loaders/load_<submodule>.py>
  • The directory at which the currently executed __main__ script is stored
    For example: <program_path>/hickle_loaders/load_<mainscript>.py

The temporary disabling of dump_fcn being broken by during ongoing modification of handled classes and types can be achieved by unconditionally raising hickle.NotHickleable exception at its first line. The hickle.hickle._dump function automatically calls hickle.lookup.create_pickled_datasetfunction when encountering this kind of exception and ensure that on load the object is restored throughpickle.load`.

Finally the hickle.lookup.register_class based custom loader approach in addition allows to implement a generic drop-in data recovery loader for missing custom load_fcn function and PyContainer classes.

Decision

Consequently this proposal is withdrawn and closed in favour of the more integral and generic hickle.lookup.register_class based approach.

References:

[1] issue #125
[2] Python copy https://docs.python.org/3.6/library/copy.html
[3] Python pickle https://docs.python.org/3.6/library/pickle.html#pickling-class-instances
[4] https://docs.h5py.org/en/stable/index.html
[5] issue #150

Precondition

H4EP001 #138 implemented and merged into dev (met by hickle_5_RC branch
pull request #149 )

optional:

H4EP002 #139 implemented and merged into dev (met by hickle_5_RC branch pull_request #149)

History:

26/11/2020 first proposal
14/02/2021 Updated open issues
16/02/2021 Added alternative hickle.lookup.register_class
02/03/2021 Conclusion and decision to withdraw

hernot added a commit to hernot/hickle that referenced this issue Jan 18, 2021
In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145. Objects
which support compact expand protocol have to be registered with
'compact_expand' filter using hickle.register_comact_expand' method
the filter itself is activated by setting 'compact_expand' key in
options dict passed to hickle.dump to True. It is also activated when
'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group'
attributes.

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
those provided by hickle core ('compact_expand', 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.
hernot added a commit to hernot/hickle that referenced this issue Jan 18, 2021
In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145. Objects
which support compact expand protocol have to be registered with
'compact_expand' filter using hickle.register_comact_expand' method
the filter itself is activated by setting 'compact_expand' key in
options dict passed to hickle.dump to True. It is also activated when
'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group'
attributes.

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
those provided by hickle core ('compact_expand', 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.
hernot added a commit to hernot/hickle that referenced this issue Feb 17, 2021
In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145. Objects
which support compact expand protocol have to be registered with
'compact_expand' filter using hickle.register_comact_expand' method
the filter itself is activated by setting 'compact_expand' key in
options dict passed to hickle.dump to True. It is also activated when
'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group'
attributes.

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
those provided by hickle core ('compact_expand', 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.
hernot added a commit to hernot/hickle that referenced this issue Feb 17, 2021
In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145. Objects
which support compact expand protocol have to be registered with
'compact_expand' filter using hickle.register_comact_expand' method
the filter itself is activated by setting 'compact_expand' key in
options dict passed to hickle.dump to True. It is also activated when
'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group'
attributes.

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
those provided by hickle core ('compact_expand', 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.
hernot added a commit to hernot/hickle that referenced this issue Feb 19, 2021
In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145. Objects
which support compact expand protocol have to be registered with
'compact_expand' filter using hickle.register_comact_expand' method
the filter itself is activated by setting 'compact_expand' key in
options dict passed to hickle.dump to True. It is also activated when
'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group'
attributes.

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
those provided by hickle core ('compact_expand', 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.
@hernot hernot mentioned this issue Feb 24, 2021
hernot added a commit to hernot/hickle that referenced this issue Mar 1, 2021
…ives

In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145 register_class
based alternative featureing custom loader modules.

In hickle 4.x custom loader funcitons can be added to hickle by
exiplicitly calling hickle.lookup.register_class before calling
hickle.dump and hickle.load. In HEP003 an alternative approach, the
hickle specific compact_expand protocol mimicking python copy protocol
is proposed. It was found that this mimickry does not provide any
benfefit compared to hickle.lookup.register_class based approach.
Even worse hickle users have to litter their class definitions whith
hickle only loader methods called __compact__ and __expand__ and
in addition have to register their class for compact expand and activate
the compact_expand loader option to activate the uses of these two
methods.

This commit implements hickle.lookup.register_class package and program
loader modules support. In addtion load_<package>.py modules may in
addtion to hickle.loaders directory also be stored along with the python
base package, module or program main script. In order to keep directory
structure clean load_<package>.py modules must be stored within the
special hickle_loaders subdirectory at that level.

Example package loader:
-----------------------

dist_packages/
  +- ...
  +- <my_package>/
  |   +- __init__.py
  |   +- sub_module1.py
  |   +- ...
  |   +- sub_moduleN.py
  |   +- hickle_loaders/
  |   +- load_<my_package>.py
  +- ...

Example single module loader

distpackages/
  +- ...
  +- <my_module>.py
  +- ...
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_module>.py
  |   +- ...
  +- ...

Example program main (package) loader

bin/
  +- ...
  +- <my_single_file_program>.py
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_single_file_program>.py
  |   +- ...
  +- ...
  +- <my_program>/
  |   +- ...
  |   +- <main>.py
  |   +- ...
  |   +- hickle_loaders/
  |   |   +- ...
  |   |   +- load_<main>.py
  |   |   +- ...
  |   +- ...
  +- ...

LoaderManager:
==============

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
legacy loaders provided by hickle core (currently 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.

@hernot
hernot added a commit to hernot/hickle that referenced this issue Mar 2, 2021
…ives

In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145 register_class
based alternative featureing custom loader modules.

In hickle 4.x custom loader funcitons can be added to hickle by
exiplicitly calling hickle.lookup.register_class before calling
hickle.dump and hickle.load. In HEP003 an alternative approach, the
hickle specific compact_expand protocol mimicking python copy protocol
is proposed. It was found that this mimickry does not provide any
benfefit compared to hickle.lookup.register_class based approach.
Even worse hickle users have to litter their class definitions whith
hickle only loader methods called __compact__ and __expand__ and
in addition have to register their class for compact expand and activate
the compact_expand loader option to activate the uses of these two
methods.

This commit implements hickle.lookup.register_class package and program
loader modules support. In addtion load_<package>.py modules may in
addtion to hickle.loaders directory also be stored along with the python
base package, module or program main script. In order to keep directory
structure clean load_<package>.py modules must be stored within the
special hickle_loaders subdirectory at that level.

Example package loader:
-----------------------

dist_packages/
  +- ...
  +- <my_package>/
  |   +- __init__.py
  |   +- sub_module1.py
  |   +- ...
  |   +- sub_moduleN.py
  |   +- hickle_loaders/
  |   +- load_<my_package>.py
  +- ...

Example single module loader

distpackages/
  +- ...
  +- <my_module>.py
  +- ...
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_module>.py
  |   +- ...
  +- ...

Example program main (package) loader

bin/
  +- ...
  +- <my_single_file_program>.py
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_single_file_program>.py
  |   +- ...
  +- ...
  +- <my_program>/
  |   +- ...
  |   +- <main>.py
  |   +- ...
  |   +- hickle_loaders/
  |   |   +- ...
  |   |   +- load_<main>.py
  |   |   +- ...
  |   +- ...
  +- ...

Fallback Loader recovering data:
--------------------------------
Implements special AttemptsRecoverCustom types used in replacement for
Python objects are missing or are incompatible to the data stored. The
affected data is loaded as RecoveredGroup (dict type) or RecoveredDataset
(numpy.ndarray type) objects. Attached to either is the attrs attribute
as found on the corresponding h5py.Group and h5py.Dataset in the hickle
file.

LoaderManager:
==============

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
legacy loaders provided by hickle core (currently 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.

@hernot
@hernot hernot closed this as completed Mar 2, 2021
@hernot
Copy link
Contributor Author

hernot commented Mar 2, 2021

Just some final numbers. My test data worth of about 400Mb to 500Mb in uncompressed pickle or hickle V3 format results using properly crafted custom dump_fcn functions and PyContainer classes and enabled compression to about 250Mb to 300Mb in hickle_5_RC file format. A whole hickle.dump(...) - hickle.load(...) round trip cycle takes approx 3 Minutes to dump data to and reload it form file again in python 3.6 on a recent standard workplace computer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant