-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HEP003: Hickle Compact Expand protocol #145
Comments
In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145. Objects which support compact expand protocol have to be registered with 'compact_expand' filter using hickle.register_comact_expand' method the filter itself is activated by setting 'compact_expand' key in options dict passed to hickle.dump to True. It is also activated when 'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group' attributes. The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt those provided by hickle core ('compact_expand', 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders.
In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145. Objects which support compact expand protocol have to be registered with 'compact_expand' filter using hickle.register_comact_expand' method the filter itself is activated by setting 'compact_expand' key in options dict passed to hickle.dump to True. It is also activated when 'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group' attributes. The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt those provided by hickle core ('compact_expand', 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders.
In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145. Objects which support compact expand protocol have to be registered with 'compact_expand' filter using hickle.register_comact_expand' method the filter itself is activated by setting 'compact_expand' key in options dict passed to hickle.dump to True. It is also activated when 'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group' attributes. The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt those provided by hickle core ('compact_expand', 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders.
In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145. Objects which support compact expand protocol have to be registered with 'compact_expand' filter using hickle.register_comact_expand' method the filter itself is activated by setting 'compact_expand' key in options dict passed to hickle.dump to True. It is also activated when 'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group' attributes. The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt those provided by hickle core ('compact_expand', 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders.
In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145. Objects which support compact expand protocol have to be registered with 'compact_expand' filter using hickle.register_comact_expand' method the filter itself is activated by setting 'compact_expand' key in options dict passed to hickle.dump to True. It is also activated when 'OPTIONS_COMPACT_EXPAND' attribute is encountered from 'h_root_group' attributes. The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt those provided by hickle core ('compact_expand', 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders.
…ives In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145 register_class based alternative featureing custom loader modules. In hickle 4.x custom loader funcitons can be added to hickle by exiplicitly calling hickle.lookup.register_class before calling hickle.dump and hickle.load. In HEP003 an alternative approach, the hickle specific compact_expand protocol mimicking python copy protocol is proposed. It was found that this mimickry does not provide any benfefit compared to hickle.lookup.register_class based approach. Even worse hickle users have to litter their class definitions whith hickle only loader methods called __compact__ and __expand__ and in addition have to register their class for compact expand and activate the compact_expand loader option to activate the uses of these two methods. This commit implements hickle.lookup.register_class package and program loader modules support. In addtion load_<package>.py modules may in addtion to hickle.loaders directory also be stored along with the python base package, module or program main script. In order to keep directory structure clean load_<package>.py modules must be stored within the special hickle_loaders subdirectory at that level. Example package loader: ----------------------- dist_packages/ +- ... +- <my_package>/ | +- __init__.py | +- sub_module1.py | +- ... | +- sub_moduleN.py | +- hickle_loaders/ | +- load_<my_package>.py +- ... Example single module loader distpackages/ +- ... +- <my_module>.py +- ... +- hickle_loaders/ | +- ... | +- load_<my_module>.py | +- ... +- ... Example program main (package) loader bin/ +- ... +- <my_single_file_program>.py +- hickle_loaders/ | +- ... | +- load_<my_single_file_program>.py | +- ... +- ... +- <my_program>/ | +- ... | +- <main>.py | +- ... | +- hickle_loaders/ | | +- ... | | +- load_<main>.py | | +- ... | +- ... +- ... LoaderManager: ============== The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt legacy loaders provided by hickle core (currently 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders. @hernot
…ives In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145 register_class based alternative featureing custom loader modules. In hickle 4.x custom loader funcitons can be added to hickle by exiplicitly calling hickle.lookup.register_class before calling hickle.dump and hickle.load. In HEP003 an alternative approach, the hickle specific compact_expand protocol mimicking python copy protocol is proposed. It was found that this mimickry does not provide any benfefit compared to hickle.lookup.register_class based approach. Even worse hickle users have to litter their class definitions whith hickle only loader methods called __compact__ and __expand__ and in addition have to register their class for compact expand and activate the compact_expand loader option to activate the uses of these two methods. This commit implements hickle.lookup.register_class package and program loader modules support. In addtion load_<package>.py modules may in addtion to hickle.loaders directory also be stored along with the python base package, module or program main script. In order to keep directory structure clean load_<package>.py modules must be stored within the special hickle_loaders subdirectory at that level. Example package loader: ----------------------- dist_packages/ +- ... +- <my_package>/ | +- __init__.py | +- sub_module1.py | +- ... | +- sub_moduleN.py | +- hickle_loaders/ | +- load_<my_package>.py +- ... Example single module loader distpackages/ +- ... +- <my_module>.py +- ... +- hickle_loaders/ | +- ... | +- load_<my_module>.py | +- ... +- ... Example program main (package) loader bin/ +- ... +- <my_single_file_program>.py +- hickle_loaders/ | +- ... | +- load_<my_single_file_program>.py | +- ... +- ... +- <my_program>/ | +- ... | +- <main>.py | +- ... | +- hickle_loaders/ | | +- ... | | +- load_<main>.py | | +- ... | +- ... +- ... Fallback Loader recovering data: -------------------------------- Implements special AttemptsRecoverCustom types used in replacement for Python objects are missing or are incompatible to the data stored. The affected data is loaded as RecoveredGroup (dict type) or RecoveredDataset (numpy.ndarray type) objects. Attached to either is the attrs attribute as found on the corresponding h5py.Group and h5py.Dataset in the hickle file. LoaderManager: ============== The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt legacy loaders provided by hickle core (currently 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders. @hernot
Just some final numbers. My test data worth of about 400Mb to 500Mb in uncompressed pickle or hickle V3 format results using properly crafted custom |
Abstract:
With the proposed extension it will be possible to represent hierarchies of custom objects by a reasonable structure within hickle file without the need to write and register a full set of loaders. This allows to use hickle files to serialize data created in Python to other programming languages and systems without the requirement to embed a dependency upon Python interpreter.
Motivation:
Hickle provides reasonable loader modules for standard library and common libraries like numpy, scipy etc. Custom objects are in default setup serialized into a bytestring by calling
pickle.dumps
function. This byte-string can only be decoded by python usingpickle.loads
function as done when loading a hickle file throughhickle.loads
. As long as files are only used for storage and to share data among python based programs and systems which embed a python interpreter instance this poses no problem. When data shall also be read by programs written in other languages which do not embed a python interpreter and to do so is no option, than the pickle strings remain an inaccessible block of bytes.With proposed "compact expand protocol" this kind of data can be made accessible on demand by mapping the data to data structures for which hickle provides a dedicated loader. At the same time it allows to keep file sizes reasonably low.
Specification:
A Python object which supports the proposed protocol has to define the special methods
__compact__
and__expand__
.All objects like
SupportsHickleComactExpand
are handled by the specialb'compact'
loader. This loader calls on storing the__compact__
method of the object. In case this method returns None the object is converted into a pickle byte string.In any other case the loader will call
load_loader
function to obtain the appropriate loader for storing the return value of the__compact__
method. After the correspondingh5py.Dataset
orh5py.Group
was created by this loader the a reference to the correspondingpy_obj_type
andbase_type
is stored in the additionalcompact_type
attribute.On loading the compact loader at first reads the
compact_type
attribute for finding the loader to be used for properly restoring the compacted representation of the object. This will than be used to call the__expand__
method of a new instance of the compacted object.In case the object was pickled as a result of
__compact__
method returnigNONE no
compact_typeattribute will be crated. Consequently on load any compacted object representation lacking a
compact_type` attribute will be assumed to be a block of bytes to be passed on to pickle.loads.In order to keep changes to core hickle mechanics the
b'compact'
loader is per default disabled for storing data and enabled for loading data. The loader has to be enabled for each object individually by callingenable_compact_expand
method passing it any number of objects which shall be handled by theb'compact'
loader when saving them to hickle file. After passing any number of objects todisable_compact_expand
the default handler for the specific object is restored again. Loading of compacted objects on the other hand stays enabled in any case.Rational
The HDF5 file format is designed for storing large continuous blocks of data in an efficient manner and thereby organize them in a reasonable structure and represent their dependencies and relations. It is not designed to efficiently store complex object structures found in memory.
The python copy protocol and its special methods
__getstate__
,__setstate__
,__reduce__
,__reduce_ex__
ex and thepickle.dump
andpickle.load
methods are specifically desinged for optimally serializing these structures with the least efforts. Naively trying to store the tuple returned by__reduce__
or__reduce_ex__
is the most obvious and was proposed in issue #125 . As tests revealed doing so on complex structures adds several gigabytes in size to the resulting hdf5 formatted file just for string all the meta data required to properly represent the structure and each tiny detail of the data which in memory used just a few 10 mega bytes.In a first step the
__getstate__
,__setstate__
methods could be rewritten to properly compact and restore the structure of the object to be stored to the HDF5 file. These two method are part of the copy protocol which was designed cause python objects do neither posess an implicit nor an explicit copy constructor they are created by calling the objects__init__
method. To copy an objectcopy.copy
andcopy.deepcopy
methods must be called which rely upon__getstate__
,__setstate__
. Using them would cause the object to be compacted and expanded upon whenever duplicated which is a highly inefficient deep copy where a shallow copy would be sufficient for example. Further also the pickle string would contain the compacted version of the object which is not necessary as pickle any is capable of efficiently representing the object structure.For objects which are provided by libraries publicly available like numpy etc. a dedicated loader module residing the hickle/loaders directory is the best option. Such a module can define loaders for multiple objects provided by the library and can make use of all other loader functions and containers available for transforming the object into appropriate group and dataset structure. For custom objects and libraries which are specific to a single project or local development writing a loader module especially during development and testing where the objects and their structure still heavily changes is no realistic and feasible option.
In this case the proposed approach provides more flexibility and allows to concentrate upon the design and structure of the objects first. When done the
__compact__
__expand__
methods can be updated to handle the structure properly. This can happen without interfering with the copy protocol at all. Further objects derived from objects for which a dedicated loader would exist can be forced to be pickled. This only requires that the__compact__
method returnsNone
. In case these objects shall be handled again by the loader for one of its base classes it is sufficient to disable compact expand protocol for that specific object, which would not be possible when (ab)using copy protocol aiming at a different purpose.Alternatives
From version 4.x on the rules how to store the individual Python objects are defined by a the 'dump_fcn' and 'load_fcn' provided by a dedicated loader module or through an explicit call to 'hickle.lookup.register_class
function. The hickle_5_RC #149 extends this by adding
container_classfor restoring
py_objectrepresented by
h5py.Group`.The
register_class
function is part of the hickle core machinery and thus may be affected by changes of the inner workings of hickle and/or the structure used to store the object data in the hdf5 file. Such a disrupting change is caused by the introduction ofPyContainer
classes used to restorepy_object
represented byh5py.Group
in hickle_5_RC #149 or by the full separation of hickle core machinery and loaders completed in hickle_5_RC #149.Consequently the
dump_fcǹ
andload_fcn
of all loaders including loaders explicitly registered though callinghickle.lookup.register_class
directly have to be adopted in order to properly support newer hickle versions and multipleload_fcn
functions have to be provided for legacy loaders of older hickle versions.The
hickle.lookup.register_class
has to be explicitly called to adddump_fcn
,load_fcn
andcontainer_class
(hickle_5_RC #149) for individual classes before the first call tohickl.dump
and/orhickle.load
used to dump or restore the custom class. In case eitherhickle.lookup.register_class
is not called or custom class is missing hickle will store the corresponding pickle string or abort restoring raising an Exception. Attempting to at least recover the data storing them in numpy and python primitives may not be possible in all cases, unless as defined for the compact-expand protocol thedump_fcn
would consequently map to python, numpy and other primitives for which appropriate loaders are included within the hickle package. In this case the!missing!
attribute could be attached to theh5py.Group
orh5py.Dataset
by thedump_fcn
function to indicate thepy_obj_type
andbase_type
to be used by thehickle.load
function if no appropriateload_fcn
and/orcontainer_class
are available.Defining a loader module which allows hickle to automatically load
dump_fcn
,load_fcn
andcontainer_class
is currently not possible in hickle 4.x but could be introduced by hickle_5_RC. Therebyhickle.lookup.load_loader
function needs to check a package or module path the custom class is defined in if either contains a hicke_loaders directory which contains a load_.py or load_.py file defining the loader for the custom classes.Independent whether explicitly called or implicitly through loader module stored in
hickle.loaders
directory or inhickle_loaders
directory in package or module directory,dump_fcn
,load_fcn
functions andcontainer_class
can be kept separate from custom class definition and only need to be loaded when custom class shall be dumped or loaded using hickle. The__compact__
and__expand__
methods have to be defined along with all other class methods.Feature loaders introduced by hickle_5_RC allow control whether loader is used to dump the custom class or whether it shall be stored as pickle string instead of using the
dump_fcn
,load_fcn
andcontainer_class
registered through loader_module or explicit call tohickle.lookup.register_class
. On loading hickle automatically will apply option loader if applied duringhickle.dump
call as done for compact-expand loader option.Open Issues:
Can similar be reached by directly import and call
register class
(question #150)?How should case of lass is not being registered or not available before call to
hickle.load
handled?Conclusion
The proposed functionality can be achieved by a single call
hickle.lookup.register_class
. The proposed Python copy protocol mimicry would introduce unnecessary and non desired interdependencies between hickle and the packages providing class and type declarations for objects to be dumped using hickle. A clean separation between production code of the package, single module or application and custom hickle loaders can be achieved throughhickle.lookup.load_loader
function trying to load theload_<pakage>
loader modules from a dedicatedhickle_loaders
directory found at the following locations in addition tohickle/loaders
directory:For example:
<python_installation_path>/dist_packages/<package>/hickle_loaders/load_<package>.py
For example:
<python_installation_path>/dist_packages/hickle_loaders/load_<module>.py
or<program_path>/hickle_loaders/load_<submodule>.py>
__main__
script is storedFor example:
<program_path>/hickle_loaders/load_<mainscript>.py
The temporary disabling of
dump_fcn
being broken by during ongoing modification of handled classes and types can be achieved by unconditionally raisinghickle.NotHickleable
exception at its first line. Thehickle.hickle._dump function automatically calls
hickle.lookup.create_pickled_datasetfunction when encountering this kind of exception and ensure that on load the object is restored through
pickle.load`.Finally the
hickle.lookup.register_class
based custom loader approach in addition allows to implement a generic drop-in data recovery loader for missing customload_fcn
function andPyContainer
classes.Decision
Consequently this proposal is withdrawn and closed in favour of the more integral and generic
hickle.lookup.register_class
based approach.References:
[1] issue #125
[2] Python copy https://docs.python.org/3.6/library/copy.html
[3] Python pickle https://docs.python.org/3.6/library/pickle.html#pickling-class-instances
[4] https://docs.h5py.org/en/stable/index.html
[5] issue #150
Precondition
H4EP001 #138 implemented and merged into dev (met by hickle_5_RC branch
pull request #149 )
optional:
H4EP002 #139 implemented and merged into dev (met by hickle_5_RC branch pull_request #149)
History:
26/11/2020 first proposal
14/02/2021 Updated open issues
16/02/2021 Added alternative
hickle.lookup.register_class
02/03/2021 Conclusion and decision to withdraw
The text was updated successfully, but these errors were encountered: