-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementaion of Container and mixed loaders (H4EP001) #138
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #138 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 9 9
Lines 592 654 +62
=========================================
+ Hits 592 654 +62
Continue to review full report at Codecov.
|
I'll wait for some input from @1313e, I only note two small things:
raise FileError("HDF5-file does not have the proper attributes!")
|
I'm thinking of trying to get hands on this two issues the next days while we are waiting for input for @1313e . Not sure if i will manage or fail. Another thing i have checked the two other pullrequest #136 and #137. They are basically the same two minor adaptions of |
#136 and #137 exist, because I specifically asked for 2 PRs (I still have to merge them). I have just finished an incredibly busy period, so I now finally have time to go and take a look at this. |
@1313e: So shall i wait with the ammendments for the not covered line and the implementation of the not yet included verification for proper loading of 4.0.0 file too or would you prefer that these ammendments are done before reviewing. |
There you go, now those PRs are merged. |
I do propose the following approach
Anything I'm up to missing anything and thus should consider and cover additionally? |
You first want to merge |
Ok think rebase worked. @1313e thank you very much for reminding. |
Think i managed. Full coverage and proper loading of hickle 4.0.0 and 4.0.1 files. Added file created with master and pickled version of the same data which test uses to verify that data were properly restored. Hope i did not miss any tricky obstacle. Real world use will reveal. |
@hernot Having fun over there? ;) |
not any more one last squashing and things are ready, just hat to crawl down a tight rabit hole to find the one nasty error occuring in python 3.8. And as i do not have here a native Python 3.8 installation here available i have to abuse Tarvis a bit for this. The error seemed to be caused by the fact that
which is not used any more in the new container based scheme. And further according to Python doc I will cleanup the code and squash the intermediate commits and repush one final time when done. |
Ok final cleanup and squashing done. NOTE: phew |
Well, I better do some reviewing then huh? |
Take your time, the next steps will definitly be smaller than the switch to loader based design in 4.0.0 and the extesnsion of PyContainer concept initiated in 4.0.0 now. Further a big part is covered by the rather thorough (hopefully) unit tests of individual modules (helpers, lookup, load_builtins, load_numpy, load_scipy, load_astropy, hickle) in addition to tests already present. These are also the reason why i stumbled over the compression issue related to scalars and strings as reported in #140 and implemented a proposal to fix. And some other things which fitted for demonstration purpose (confess #125). |
2e46fec
to
5d63b64
Compare
Hi just a small ping on how the plans are. |
bf12c6a
to
e4bc715
Compare
e4bc715
to
c681971
Compare
@telegraphic Thoughts? |
@telegraphic I know you are quite busy but never the less is it possible to timely decide on this as with each fix @1313e necessarily implements rebasing of this pull-request and all dependent branches for upcoming pull-requests becomes more complicated and tedious. |
c681971
to
661a2b8
Compare
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
Related to issue telegraphic#83, making astropy/scipy optional dependencies. Can now install e.g. hickle[astropy] to add astropy support. Uses pkg_resources.requires('hickle[astropy') to check and only load if error is not raised.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting. h5py version limited to <3.x according to issue telegraphic#143
661a2b8
to
d4ef711
Compare
Hi @hernot and @1313e -- I'm taking a dive into the PR this week. I'm currently on paternity leave so have been even slower to respond than usual (first child so tough learning curve!). For now, let me say @hernot I really appreciate all the effort and thought you've clearly put into this. Apologies for my tardiness! |
@telegraphic no worry, every thing is set and prepared here, so just waiting for your go. |
Making some notes here as I go through: Compare file structure 4.0.4 and 4.1.0# Make a basic test file with 4.1.0 and 4.0.4
a = {'a': 1, 'b': 'hello', 'c': [1,2,3]}
b = np.array([1,2,3,5])
c = (a, b)
hkl.dump(c, 'hkl_410.hkl')
# (rinse and repeat for hkl_404.hkl with 4.0.4 Compare the two file structures:
The main difference is The difference in file format means 4.0.4 will not be able to load 4.1.0. It fail with a Check 4.1.0 can load 4.0.4 stillYup (at least for this basic test file). |
Hey @hernot and @1313e, overall I am happy for this to be merged, as it is a precondition on H4EP002 and H4EP003. What changed and was it worth it?Firstly, I note that this is a pretty major underlying change to how hickle works under the hood. Even though there are a lot of changes to the loading/dumping logic, changes to the actual file structure are minor. I like the changes to how dictionaries are dumped without extraneous Part of the motivation for the changes in H4EP001 was issue #125, to do with use of So, the big question: are the major changes in H4EP001 worth it? At first glance, the end functionality to the user has not changed! A quick case studyTo weigh this up, I found it useful to look at how the scipy sparse matrix support changed -- see diff here. Sparse matrices are stored in the hdf5 file as three datasets -- which need to be recombined into a single sparse array when loaded by hickle. To allow this previously required the class SparseMatrixContainer(PyContainer): which I think is a slightly better design pattern. However there is some more complexity in def register_class(myclass_type, hkl_str, dump_function=None, load_function=None, container_class=None):
""" Register a new hickle class.
Parameters:
-----------
myclass_type type(class): type of class
hkl_str (str): String to write to HDF5 file to describe class
dump_function (function def): function to write data to HDF5
load_function (function def): function to load data from HDF5
container_class (class def): proxy class to load data from HDF5
Raises:
-------
TypeError:
myclass_type represents a py_object the loader for which is to
be provided by hickle.lookup and hickle.hickle module only
""" Where My conclusionsThese changes add complexity, but they do promise to make some things easier in the future. I think, @hernot and @1313e, we can now agree that some data structures in Python are not easy to map to HDF5 optimally without some ugly code... Overall I am supportive of merging this given H4EP002 and H4EP003, which extend H4EP001 with more tangible improvements. The improvement to the file structure when dumping dictionaries is also nice. My apologies once again for the latency on the review. Thanks @hernot and @1313e for your patience. Small request: @hernot in the future, can you pretty please make smaller commits instead of one large one so it's easier to review? (I know this is difficult when refactoring, but it would be very helpful!) |
I am still very sceptical of the usefulness of this PR. |
@1313e @telegraphic that is why i wrote this note in one my comments above above. I'm pretty fine if you consider this and all the already prepared following pull-requests, which will introduce further even bigger changes to file format rendering them major and not just minor changes which require to spare them for hickle >= 5.0 release instead, of just bumping version 4 minor. Thinking about it it might be anyway the wiser option. @1313e I'm pretty fine to first assemble all prepared pieces (#138, #139, #145, and clean-up and finalization) in a hickle 5 RC-1 proposal branch in my fork. Thereby it is for we very important that i do this in full agreement and coordination with you two @telegraphic and @1313e. So may I suggest that ) create a Hickle-5-RC branch in my forked repo, add there all prepared pull-requests as commits and post here when ready for discussion about it either in continuation of this discussion or as part of together reviewing of the new branch. |
Yes, that's fine with me. |
Also, the compression thing I already fixed. |
I would have appriciated also getting the OK from @telegraphic before closing |
Just following up: I see @1313e's point that major changes should have functionality improvements, and see you've opened a new PR for a v5 bump -- all sounds good to me. I am more ambivalent about the code changes, so @1313e when we have RC5 ready I'll be relying on you to identify areas that you flag for reversion / request changes. |
Sounds good to me 👍 |
At first:
@1313e with this pull request i want to express how much i apriciate your really
great work done for hickle 4.0.0 implementing the first step to dedicated loaders.
Second: the reason why I'm so pushing upon implementation of H4EP001
The research conducted by the research group I'm establishing and leading
is split into two tracks. A methodological one dealing with improvement and
development of new algorithms and methods for clinical procedures in diagnostics
and treatment. The second one is concerned with clinical research utilizing the
tools based upon the methods and algorithms provided by the first track.
In the first track python, numpy, scipy etc. are the primary tools for working on
the algorithms, investigating new procedures and algorithmic approaches.
The work in the second track is primarily conducted by clinicians. Therefore
the tools provided for their research and studies have to be thoroughly tested
and validated. This validation at least the part which can be automatized through
unit test utilizes test data, including intermediate data and results obtained and
provided by the python programs and scripts developed during development of
underlying algorithm.
As the clinical tools are implemented in compiled languages
which support true multi-threading the data passed on has to be stored in a
file format readable outside python, out-ruling pickle strings. Therefore jsonpickle
was used to dump the data. Meanwhile the amount of data has grown into the large
so that json files even if compressed using zip, gzip or other compression schemes
is not feasible any more. NPY, and NPZ files which was the next choice mandate a
dependency upon numpy library. Just for conducting unit tests a self contained file
format for which only the corresponding library without any further has to be included
would be the better choice.
And this is the point where hdf5libraries and hickle come into play. I do consider both
as the best and most suitable option i have found so far. And the current limitation that
objects without dedicated loader are stored as pickle strings can be solved by
supporting python copy protocol. Which i offer hereby to contribute to hickle.
Third content of this pull-request:
Implementation of Container based and mixed loaders as proposed by #135 hickle extension
proposal H4EP001. For details see commit-message and the proposal #135.
Finally i do recommend:
Not to put this into an official release. Some extended tests using a real dataset compiled
for testing and validating software tools and components developed for use in clinical track
showed that an important part is missing to keep file sizes at reasonable level. Especially
type-strings and pickle strings for class and function objects currently take-up most of the
file space letting dumped files grow quickly into GB even with hdf5 file compression activated
where the pickle stream just requires 400MB of space.
Therefore i do recommend to implement additionally memoization (H4EP002 #139 ) first before considering
the resulting code base ready for release.
Ps.: loading of hickle 4.0.0 files should be still possible out of the box. Due to the lack of an appropirate testfile no test is included to verify.