-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hickle 5 rc #149
Hickle 5 rc #149
Conversation
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
What is commit d1760ee doing in here? |
Shit definitly some rebase error not sure how to get it out again. Shouldnt be there. Too late for today. |
Also, given that this would be |
it does, and differences are rather small that they can be easily be handled by stacking the few specialized loader functions on top of the default loader functions, like the compact expand which is also implemented as optional loader which on request is stacked on top of all others. Further i do fear that the the issue i observed only when running tests in python 3.6 and 3.7 in connection with h5py3 would not vanish with pulling in hickle 4 files ad dedicate legacy loading. And this way users who still have not yet had time to convert their files from hickle 3 to 4 still can do this using matured hickle 5. Thus unless severe ostacles and issues are against i would prefer to keep support as implemented ;-) |
It is not like v5 will release any time soon, as many, many changes have to be made to this before it can be merged into |
didn't expect to be, and as long as I'm able to keep in sync which your changes on dev it should be quite easy for me to take care of the amendments necessary to reach a releaseable state. Just let me know where you have your doubts, where you think improvement is necessary or where things have to be pushed further than done. I'm open to any input and hints. the only thing I want to avoid that in any cases sooner or later there exist two incompatible versions of hickle, the official one and mine even if i start to use mine for now all findings will be reported here and go back into branch . |
Thanks @hernot, looking forward to this coming to fruition. |
Looks like things change these days Not sure which is better:
|
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting. h5py version limited to <3.x according to issue telegraphic#143
…and type) Basic Menoisation: ================== Both types of memoisation are handled by Reference manager dictionary type object. For storing object instance references it is used as python dict object which stores the references to the py_obj and related node using py_obj(id) as key when dumping. On load the id of the h_node is used as key for storing the to be shared reference of the restored object. Additoinal references to the same object are represented by h5py.Datasets with their dtype set to ref_dtype. They are created by assinging a h5py.Refence object as returned by h5py.Dataset.ref or h5py.Group.ref attribute. These datasets are resolved by the filter iterator method of the ExpandReferenceContainer class and returned as sub_item of the reference dataset. Type Memoisation ================ The 'type' attribute of all nodes exempt datasets which contain pickle strings, or expose a ref_dtype as their dtype now contains a reference to the approriate py_obj_type entry in the global 'hickle_types_table' this table host the datasets representing all py_obj_types and base_types encountered by hickle.dump once. Each py_obj_type is represened by a numbered dataset containing the corresponding pickle string. The base_types are represented by empty datasets the name of which is the name of the base_type as defined by class_register table of loaders. No entry is stored for object, b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these can be resolved implicitly on load. The 'base_type' attribute of a py_obj_type entry referres to the base_type used to encode and required to properly restore it again from the hickle file. The entries in the 'hickle_types_table' are managed by the ReferenceManager.store_type and ReferenceManager.resolve_type methods. The latter is also taking care of properly distinguishing pickle datasets from reference datasets and resolving hickle 4.0.X dict_item groups. The ReferenceManager is implemented as context manager and thus can and shall be used within the with statement, to ensure proper cleanup. Each file has its own ReferenceManager instance, therefore different data can be dumped to distinct files which are open in parallel. The basic management of managers is provided by the BaseManager base class which can be used build futher managers for example to allow loaders to be activated only when specific feature flags are passed to hickle.dump method or encountered by hickle.load from the file attributes. The BaseManager class has to be subclassed. Other changes: ============== - lookup.register_class and class_register tables have an addtional memoise flag indicating whether py_obj shall be remmembered for representing and resolving multiple references to it or if it shall be dumped and restored everytime it is encountered - lookup.hickle_types table entries include the memoise flag as third entry - lookup.load_loader: the tuple returned in addtion to py_obj_type includes the memoise flag - hickle.load: Whether to use load_fn stored in lookup.hkl_types_table or use a PyContainer object stored in hkl_container_dict is decided upon the is_container flag returned by ReferenceManager.resolve_type instead of checking whether processed node is of type h5py.Group - dtype of string and bytes datasets is now set to 'S1' instead of 'u8' and shape is set to (1,len)
Just to let you know currently amending commit 940710c to finally get rid of copy protocol mimicry as it turns out that it is not necessary to achieve what is intended by H4EP003 #145 (see latests edits of #145). As soon as finished with amendment i will take care of the rest for which i would apriciate an advice whether to issue limit versions on win32 or whether to drop win32 support on proposed hickle 5 release candidate. |
As 32-bit Windows versions are still widely used, I think support for them should not be dropped. |
Will check how win32 issue can be solved with least effort and maximum gain |
…ives In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue telegraphic#145 register_class based alternative featureing custom loader modules. In hickle 4.x custom loader funcitons can be added to hickle by exiplicitly calling hickle.lookup.register_class before calling hickle.dump and hickle.load. In HEP003 an alternative approach, the hickle specific compact_expand protocol mimicking python copy protocol is proposed. It was found that this mimickry does not provide any benfefit compared to hickle.lookup.register_class based approach. Even worse hickle users have to litter their class definitions whith hickle only loader methods called __compact__ and __expand__ and in addition have to register their class for compact expand and activate the compact_expand loader option to activate the uses of these two methods. This commit implements hickle.lookup.register_class package and program loader modules support. In addtion load_<package>.py modules may in addtion to hickle.loaders directory also be stored along with the python base package, module or program main script. In order to keep directory structure clean load_<package>.py modules must be stored within the special hickle_loaders subdirectory at that level. Example package loader: ----------------------- dist_packages/ +- ... +- <my_package>/ | +- __init__.py | +- sub_module1.py | +- ... | +- sub_moduleN.py | +- hickle_loaders/ | +- load_<my_package>.py +- ... Example single module loader distpackages/ +- ... +- <my_module>.py +- ... +- hickle_loaders/ | +- ... | +- load_<my_module>.py | +- ... +- ... Example program main (package) loader bin/ +- ... +- <my_single_file_program>.py +- hickle_loaders/ | +- ... | +- load_<my_single_file_program>.py | +- ... +- ... +- <my_program>/ | +- ... | +- <main>.py | +- ... | +- hickle_loaders/ | | +- ... | | +- load_<main>.py | | +- ... | +- ... +- ... Fallback Loader recovering data: -------------------------------- Implements special AttemptsRecoverCustom types used in replacement for Python objects are missing or are incompatible to the data stored. The affected data is loaded as RecoveredGroup (dict type) or RecoveredDataset (numpy.ndarray type) objects. Attached to either is the attrs attribute as found on the corresponding h5py.Group and h5py.Dataset in the hickle file. LoaderManager: ============== The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt legacy loaders provided by hickle core (currently 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders. @hernot
Hickle 4.0.x accepts file like objects and uses them as model for creating a new file discarding file handler passed in. If the file should be accessed later on using the same file handler this results in an IOError raised and alike. The proposed fix does not close the file but checks that it is opened at least read able if mode is 'r' or read and writable in case mode reads 'w', 'r+', 'w+','x', 'x+' or 'a'. Note: passed in file or file like object has to be closed by caller.
Compression keywords safety tests: ================================== In issue telegraphic#140 it was reported that some loaders crash when 'compression', 'chunks' and related h5py keyword arguments are specified. By running pytest a second time and thereby specifying the custom --enable-compression parameter all tests are rerun with kwargs={'compression':'gzip', 'compression_opts':6} All compression sensitive tests especially all 'test_XX_*.py::*' unit test functions must include the 'compression_kwargs' parameter in their signature to receive the actual kwargs list to be passed to all 'create_fcn' function defined by loader module. In case a test function misses to pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to 'hickle.dump', 'hickle._dump', or any dump_method listed in 'class_register' table of loader module or specified directly in a 'LoaderManager.register_class' an AssertionError exception is thrown indicating the name of the test function, the line in which the affected function is called any function which it calls. Tests which either test compression related issues explicitly or do not call any of the dump_functions may be marked accordingly using the 'pytest.mark.no_compression' marker to explicitly exclude test function from compression testing. Tox virtual env manager support: ================================ Adds support for virtualenv manager tox. Tox simplifies local testing of compatibility for multiple python versions before pushing to github and creating pullrequest. Travis and Appveyor integration still has to be tested and verified. '# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
Current release causes problems when trying to dump and load data when h5py 3.X library and newer is installed. In that case loading will fail with errors. This is caused by - a change in how strings are stored in Attributes. Before 3.x they were stored as bytes strings. Encoding had to happend manually before storing to and after reading from the attribute. In h5py encoding is handled by attrs.__getitem__ and attrs.__getitem__. The first simply assumes that any bytestring stored as attribute value must represent a python 'str' object which either was written by python 2 which used plain c strings or is the encoded version of a Python 3 utf8 string and thus has to be encoded in utf8. Only in case the byte string contains characters which are not valid in utf8 encoded string than content is returned as 'bytes' object - a change in how utf8 'str' objets are stored. They are now consequently stored as var_dtype objects where the metadata of dtype indicate the resulting python type. In H5py2 this didn't matter therefore strings stored in datasets using h5py2 are returned with a opaque object dtype for which decoding to utf8 has to be enforced if py_obj_type as indicated by 'type' attribute is 'str'. - a change how Datasets for which no explicit link exists in the file but a strong Python reference exists to the h5py.Dataset or h5py.Group objects exits are handled when a reference to them is resolved. In h5py 2 a ValueError is raise in h5py 3.X an anonymous Dataset or Group is returned. Consequently when simulating during test runs stale references encountered by hickle.load all python references to the corresponding datasets and groups have to be dropped to see that the reference is stale. The alternative would be to check the name of the referred to Dataset if it is None the dataset would be anonymous. Resolving the name on non Anonymous Datasets and Groups, which allways represents the full path name and not just the local name, is a quite expensive and costy processin terms of runtime and resources and shall be avoided This commit provides fixes for these changes severely affecting proper operation of hickle. String values of attributes are compared against 'bytes' and 'str' type variants of constant strings. Lookupables like key_base_type and hkl_types and hkl_conainer_types include the key for an entry as 'str' and 'bytes' type key. A featured optional load_fcn is provided fixing the difference how dtype of utf8 strings is computed by h5py 2 and h5py 3. Test functions are updated accordingly to properly mock stale reference conditions and changes in how strings are stored.
Ok fixes done. Numpy depencency still not changed cause a bit affected by the issue current dill versions have with dumping After testing the following options feasible and reasonable for hickle are possible. Need to know which route to go:
NOTE EDIT |
… adding dill weed
…cleanup. Without package, module and script loader support is broken
…properly and stable under any circumstances?
…properly and stable under any circumstances?
… collector, try forcing collection after file flushing and before closing
…ves the problem too
On GitHub actions windows runners it was obswerved for python 3.7 32 bit and 64 bit and python 3.8 64 bit that h5py.copy aborted with PermissionError exceptoin when copying hdf5 files which are opened for writing. On the remaining three windows runners as well as linux and mac runners copying was possible without any issue. Adopted test_ReferenceManager_context and test_ReferenceManager tests to close file accessed through h5_data fixture first before calling shutil.copy. Fixed spourious Unrecognized typecode ValueError exception occuring during h5py.File.close. These are caused by python garbage collector which kicks in while h5py.close in critical section. By disabling garbage collection before call to h5py.File.close and re-enabling afterwards this can be solved for all python versions. This error most consistently could be observed on windows 32 bit running python 3.7 with h5py 2.10.0.
…cleanup. Without package, module and script loader support is broken
Hi @hernot -- catching up on this now. I think a larger issue is that Numpy dtypes should be handled by create_np_dtype. Any idea why this isn't happening? |
@telegraphic If you look at Dill does not make use of this table so it is not able to properly handle this case. I had last weekend a discussion with numpy and dill project with the result that the dill project has added a workaround for exactly this case which will be available in the next release. Checking the workaround it is not the only one the dill project uses to handle special cases like the new way numpy encodes the classes of its In the light of the now official possibility to add custom loaders to hickle outside official This on the other hand will sooner or later limit the outreach and use-cases of hickle, which, as far as i understand, was and is thought by you @telegraphic and @1313e as compatible standard pickle as possible and feasible. Therefore users have the ability to add their own loaders for their very local and specific classes and cases. If they are not picklable by pickle than hickle should not handle them either. But if they are properly handled by standard pickle i would expect that hickle does too and would be surprised that it doesn't just cause dill refuses to pickle the class object of my fancy object. Hope i could shed some light on the core of the issue and what breaks when changing requirements for numpy form |
I think with all that, we should drop |
Thanks @hernot for the explanation! If I think pickling with protocol=4 is a good option? (https://www.python.org/dev/peps/pep-3154/) (Assuming this plays nicely with numpy @hernot?) |
@telegraphic Sorry misread your questions first. Normally pickle anyway will select the highest version available in python version. In Python 3.5, 3.6 and 3.7 the default is 3 and the highest is 4 whereas from 8 on the default is 4 and the highest is 5. |
…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
Yeah, I am fine with this merging. |
@hernot double checking you're done with the dill cleanout and ready for the big merge? |
yes I'm done
|
Ok here we go as agreed discussing pr #138 here the fully assembled hickle-5-RC branch. In case you prefer to directly review within my repo feel free to close this pr again. I'm fine with whatever you prefer.
Not sure if all commits reported below are really visible from within this branch or whether the log below is an exceprt of the git reflog.
I also expcet that appveyor and travis to complain a bit due to the tox related changes would be surprised if things would work on the first try. Anyway i cross fingers
Reason for python 3.5 failure known (cant test here any more lacking Python3.5). Fixing if support of Python 3.5 should be kept beyond hickle 4. Is easy just one change in one line.
Fixing fail in Pyhton 3.5 is one change in one line will do if support for Python3.5, which i have no means any more to test locally here) shall be supported beyond hickle 4. No problem at all.
With the astropy warnings i would need your help as we do not use Astropy here so i have not any clue how to fix.