Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hickle 5 rc #149

Merged
merged 31 commits into from
Dec 16, 2021
Merged

Hickle 5 rc #149

merged 31 commits into from
Dec 16, 2021

Conversation

hernot
Copy link
Contributor

@hernot hernot commented Jan 18, 2021

Ok here we go as agreed discussing pr #138 here the fully assembled hickle-5-RC branch. In case you prefer to directly review within my repo feel free to close this pr again. I'm fine with whatever you prefer.
Not sure if all commits reported below are really visible from within this branch or whether the log below is an exceprt of the git reflog.

I also expcet that appveyor and travis to complain a bit due to the tox related changes would be surprised if things would work on the first try. Anyway i cross fingers

Reason for python 3.5 failure known (cant test here any more lacking Python3.5). Fixing if support of Python 3.5 should be kept beyond hickle 4. Is easy just one change in one line.

Fixing fail in Pyhton 3.5 is one change in one line will do if support for Python3.5, which i have no means any more to test locally here) shall be supported beyond hickle 4. No problem at all.

With the astropy warnings i would need your help as we do not use Astropy here so i have not any clue how to fix.

With hickle 4.0.0 the code for dumping and loading dedicated objects
like scalar values or numpy arrays was moved to dedicated loader
modules. This first step of disentangling hickle core machinery from
object specific included all objects and structures which were mappable
to h5py.Dataset objects.

This commit provides an implementaition of hickle extension proposal
H4EP001 (telegraphic#135). In this
proposal the extension of the loader concept introduced by hickle 4.0.0
towards generic PyContainer based and mixed loaders specified.

In addition to the proposed extension this proposed implementation inludes
the following extensions hickle 4.0.0 and H4EP001

H4EP001:
========
    PyContainer Interface includes a filter method which allows loaders
    when data is loaded to adjust, suppress, or insert addtional data subitems
    of h5py.Group objects. In order to acomplish the temorary modification
    of h5py.Group and h5py.Dataset object when file is opened in read
    only mode the H5NodeFilterProxy class is provided. This class will
    store all temporary modifications while the original h5py.Group
    and h5py.Dataset object stay unchanged

hickle 4.0.0 / 4.0.1:
=====================
    Strings and arrays of bytes are stored as Python bytearrays and not as
    variable sized stirngs and bytes. The benefit is that hdf5 filters
    and hdf5.compression filters can be applied to Python bytearrays.
    The down is that data is stored as bytes of int8 datatype.
    This change affects native Python string scalars as well as numpy
    arrays containing strings.

    numpy.masked array is now stored as h5py.Group containin a dedicated
    dataset for data and mask each.

    scipy.sparce matrices now are stored as h5py.Group with containing
    the datasets data, indices, indptr and shape

    dictionary keys are now used as names for h5py.Dataset and
    h5py.Group objects.

    Only string, bytes, int, float, complex, bool and NonType keys are
    converted to name strings, for all other keys a key-value-pair group
    is created containg the key and value as its subitems.

    string and bytes keys which contain slashes are converted into key
    value pairs instead of converting slashes to backslashes.
    Distinction between hickle 4.0.0 string and byte keys with converted
    slashes is made by enclosing sting value within double quotes
    instead of single qoutes as donw by Python repr function or !r or %r
    string format specifiers. Consequently on load all string keys which
    are enclosed in single quotes will be subjected to slash conversion
    while any others will be used as ar.

    h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle'
    are on load automatically get assigned object as their py_object_type.
    The related 'type' attribute is ignored. h5py.Dataset objects which do
    not expose a 'base_type' attribute are assumed to contain pickle string
    and thus get implicitly assigned 'pickle' base type. Thus on dump for all
    h5py.Dataset objects which contain pickle strings 'base_type' and 'type'
    attributes are ommited as their values are 'pickle' and object respective.

Other stuff:
============
    Full separation between hickle core and loaders

    Distinct unit tests for individual loaders and hickle core

    Cleanup of not any more required functions and classes

    Simplification of recursion on dump and load through self contained
    loader interface.

    is capbable to load hickle 4.0.x files which do not yet
    support PyContainer concept beyond list, tuple, dict and set
    includes extended test of loading hickel 4.0.x files

    contains fix for labda py_obj_type issue on numpy arrays with
    single non list/tuple object content. Python 3.8 refuses to
    unpickle lambda function string. Was observerd during finalizing
    pullrequest. Fixes are only activated when 4.0.x file is to be
    loaded

    Exceptoin thrown by load now includes exception triggering it
    including stacktrace for better localization of error in debuggin
    and error reporting.
@1313e 1313e self-requested a review January 19, 2021 00:15
@1313e
Copy link
Collaborator

1313e commented Jan 19, 2021

What is commit d1760ee doing in here?
That is a commit from over 2 years ago and has no relevance at all to the current state of hickle.

@hernot
Copy link
Contributor Author

hernot commented Jan 19, 2021

Shit definitly some rebase error not sure how to get it out again. Shouldnt be there. Too late for today.

@1313e
Copy link
Collaborator

1313e commented Jan 19, 2021

Also, given that this would be hickle v5, it is not necessary that the base code can handle v4 files.
Support for v4 files can be added in the same way as v4 currently handles v3 files.

@hernot
Copy link
Contributor Author

hernot commented Jan 19, 2021

it does, and differences are rather small that they can be easily be handled by stacking the few specialized loader functions on top of the default loader functions, like the compact expand which is also implemented as optional loader which on request is stacked on top of all others. Further i do fear that the the issue i observed only when running tests in python 3.6 and 3.7 in connection with h5py3 would not vanish with pulling in hickle 4 files ad dedicate legacy loading. And this way users who still have not yet had time to convert their files from hickle 3 to 4 still can do this using matured hickle 5. Thus unless severe ostacles and issues are against i would prefer to keep support as implemented ;-)

@1313e
Copy link
Collaborator

1313e commented Jan 20, 2021

It is not like v5 will release any time soon, as many, many changes have to be made to this before it can be merged into master.

@hernot
Copy link
Contributor Author

hernot commented Jan 20, 2021

didn't expect to be, and as long as I'm able to keep in sync which your changes on dev it should be quite easy for me to take care of the amendments necessary to reach a releaseable state. Just let me know where you have your doubts, where you think improvement is necessary or where things have to be pushed further than done. I'm open to any input and hints. the only thing I want to avoid that in any cases sooner or later there exist two incompatible versions of hickle, the official one and mine even if i start to use mine for now all findings will be reported here and go back into branch .

@telegraphic
Copy link
Owner

Thanks @hernot, looking forward to this coming to fruition.

@hernot
Copy link
Contributor Author

hernot commented Feb 17, 2021

Looks like things change these days
Numpy 1.20 introduces changes which prevent nupy.dtype classess like numpy.int16 from beieng pickled and solution will not released before numpy 1.21 (detauls see commit message).
Latests hdf5.dll (1.12) for windows is only available in 64 bit no mor 32 bit which would be a plausible explanation why appveyor 32 runs from python 3.6 on fail while 64 bit runs finish properly. For python 3.5 it seems as if h5py would still recommend 1.10 or 1.8 for which 32 bit windows dlls are available.

Not sure which is better:

  • drop windows 32 bit support
  • add dedicated dependency line limiting h5py to 2.10 for windows 32bit while allowing all others to use recent h5py versoin

@hernot hernot changed the base branch from dev to v3 February 19, 2021 19:15
@hernot hernot changed the base branch from v3 to dev February 19, 2021 19:16
With hickle 4.0.0 the code for dumping and loading dedicated objects
like scalar values or numpy arrays was moved to dedicated loader
modules. This first step of disentangling hickle core machinery from
object specific included all objects and structures which were mappable
to h5py.Dataset objects.

This commit provides an implementaition of hickle extension proposal
H4EP001 (telegraphic#135). In this
proposal the extension of the loader concept introduced by hickle 4.0.0
towards generic PyContainer based and mixed loaders specified.

In addition to the proposed extension this proposed implementation inludes
the following extensions hickle 4.0.0 and H4EP001

H4EP001:
========
    PyContainer Interface includes a filter method which allows loaders
    when data is loaded to adjust, suppress, or insert addtional data subitems
    of h5py.Group objects. In order to acomplish the temorary modification
    of h5py.Group and h5py.Dataset object when file is opened in read
    only mode the H5NodeFilterProxy class is provided. This class will
    store all temporary modifications while the original h5py.Group
    and h5py.Dataset object stay unchanged

hickle 4.0.0 / 4.0.1:
=====================
    Strings and arrays of bytes are stored as Python bytearrays and not as
    variable sized stirngs and bytes. The benefit is that hdf5 filters
    and hdf5.compression filters can be applied to Python bytearrays.
    The down is that data is stored as bytes of int8 datatype.
    This change affects native Python string scalars as well as numpy
    arrays containing strings.

    numpy.masked array is now stored as h5py.Group containin a dedicated
    dataset for data and mask each.

    scipy.sparce matrices now are stored as h5py.Group with containing
    the datasets data, indices, indptr and shape

    dictionary keys are now used as names for h5py.Dataset and
    h5py.Group objects.

    Only string, bytes, int, float, complex, bool and NonType keys are
    converted to name strings, for all other keys a key-value-pair group
    is created containg the key and value as its subitems.

    string and bytes keys which contain slashes are converted into key
    value pairs instead of converting slashes to backslashes.
    Distinction between hickle 4.0.0 string and byte keys with converted
    slashes is made by enclosing sting value within double quotes
    instead of single qoutes as donw by Python repr function or !r or %r
    string format specifiers. Consequently on load all string keys which
    are enclosed in single quotes will be subjected to slash conversion
    while any others will be used as ar.

    h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle'
    are on load automatically get assigned object as their py_object_type.
    The related 'type' attribute is ignored. h5py.Dataset objects which do
    not expose a 'base_type' attribute are assumed to contain pickle string
    and thus get implicitly assigned 'pickle' base type. Thus on dump for all
    h5py.Dataset objects which contain pickle strings 'base_type' and 'type'
    attributes are ommited as their values are 'pickle' and object respective.

Other stuff:
============
    Full separation between hickle core and loaders

    Distinct unit tests for individual loaders and hickle core

    Cleanup of not any more required functions and classes

    Simplification of recursion on dump and load through self contained
    loader interface.

    is capbable to load hickle 4.0.x files which do not yet
    support PyContainer concept beyond list, tuple, dict and set
    includes extended test of loading hickel 4.0.x files

    contains fix for labda py_obj_type issue on numpy arrays with
    single non list/tuple object content. Python 3.8 refuses to
    unpickle lambda function string. Was observerd during finalizing
    pullrequest. Fixes are only activated when 4.0.x file is to be
    loaded

    Exceptoin thrown by load now includes exception triggering it
    including stacktrace for better localization of error in debuggin
    and error reporting.

    h5py version limited to <3.x according to issue telegraphic#143
…and type)

Basic Menoisation:
==================
Both types of memoisation are handled by Reference manager dictionary
type object. For storing object instance references it is used as
python dict object which stores the references to the py_obj and related
node using py_obj(id) as key when dumping. On load the id of the h_node
is used as key for storing the to be shared reference of the restored
object.

Additoinal references to the same object are represented by
h5py.Datasets with their dtype set to ref_dtype. They are created
by assinging a h5py.Refence object as returned by h5py.Dataset.ref or
h5py.Group.ref attribute. These datasets are resolved by the filter
iterator method of the ExpandReferenceContainer class and returned as
sub_item of the reference dataset.

Type Memoisation
================
The 'type' attribute of all nodes exempt datasets which contain pickle
strings, or expose a ref_dtype as their dtype now contains a reference
to the approriate py_obj_type entry in the global 'hickle_types_table'
this table host the datasets representing all py_obj_types and
base_types encountered by hickle.dump once.

Each py_obj_type is represened by a numbered dataset containing the
corresponding pickle string. The base_types are represented by empty
datasets the name of which is the name of the base_type as defined
by class_register table of loaders. No entry is stored for object,
b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these
can be resolved implicitly on load.
The 'base_type' attribute of a  py_obj_type entry referres to the
base_type used to encode and required to properly restore it again
from the hickle file.

The entries in the 'hickle_types_table' are managed by the
ReferenceManager.store_type and ReferenceManager.resolve_type methods.
The latter is also taking care of properly distinguishing pickle
datasets from reference datasets and resolving hickle 4.0.X dict_item
groups.

The ReferenceManager is implemented as context manager and thus can and
shall be used within the with statement, to ensure proper cleanup. Each
file has its own ReferenceManager instance, therefore different data can
be dumped to distinct files which are open in parallel. The basic
management of managers is provided by the BaseManager base class which can be
used build futher managers for example to allow loaders to be activated only
when specific feature flags are passed to hickle.dump method or
encountered by hickle.load from the file attributes. The BaseManager
class has to be subclassed.

Other changes:
==============
 - lookup.register_class and class_register tables have an addtional
   memoise flag indicating whether py_obj shall be remmembered for
   representing and resolving multiple references to it or if it shall
   be dumped and restored everytime it is encountered

 - lookup.hickle_types table entries include the memoise flag as third entry

 - lookup.load_loader: the tuple returned in addtion to py_obj_type
   includes the memoise flag

 - hickle.load: Whether to use load_fn stored in
   lookup.hkl_types_table or use a PyContainer object stored in
   hkl_container_dict is decided upon the is_container flag returned
   by ReferenceManager.resolve_type instead of checking whether
   processed node is of type h5py.Group

 - dtype of string and bytes datasets is now set to 'S1' instead of 'u8'
   and shape is set to (1,len)
@hernot
Copy link
Contributor Author

hernot commented Feb 24, 2021

Just to let you know currently amending commit 940710c to finally get rid of copy protocol mimicry as it turns out that it is not necessary to achieve what is intended by H4EP003 #145 (see latests edits of #145). As soon as finished with amendment i will take care of the rest for which i would apriciate an advice whether to issue limit versions on win32 or whether to drop win32 support on proposed hickle 5 release candidate.

@1313e
Copy link
Collaborator

1313e commented Feb 24, 2021

As 32-bit Windows versions are still widely used, I think support for them should not be dropped.
However, if h5py no longer supports it, we might have to drop it as well.

@hernot
Copy link
Contributor Author

hernot commented Mar 1, 2021

  • Back to the surface.
  • Ammendment done
  • Python copy protocol mimickry (__compact__ and __expand__) gone
  • custom package/module and program loaders enabled:
    • auto loading of dump_fcn, load_fcn and PyContainer (details see commit message implementing H4EP003)
    • Special loader for recovering data where package/module is not installed and thus loader missing

Will check how win32 issue can be solved with least effort and maximum gain

hernot added 5 commits March 2, 2021 07:12
…ives

In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue telegraphic#145 register_class
based alternative featureing custom loader modules.

In hickle 4.x custom loader funcitons can be added to hickle by
exiplicitly calling hickle.lookup.register_class before calling
hickle.dump and hickle.load. In HEP003 an alternative approach, the
hickle specific compact_expand protocol mimicking python copy protocol
is proposed. It was found that this mimickry does not provide any
benfefit compared to hickle.lookup.register_class based approach.
Even worse hickle users have to litter their class definitions whith
hickle only loader methods called __compact__ and __expand__ and
in addition have to register their class for compact expand and activate
the compact_expand loader option to activate the uses of these two
methods.

This commit implements hickle.lookup.register_class package and program
loader modules support. In addtion load_<package>.py modules may in
addtion to hickle.loaders directory also be stored along with the python
base package, module or program main script. In order to keep directory
structure clean load_<package>.py modules must be stored within the
special hickle_loaders subdirectory at that level.

Example package loader:
-----------------------

dist_packages/
  +- ...
  +- <my_package>/
  |   +- __init__.py
  |   +- sub_module1.py
  |   +- ...
  |   +- sub_moduleN.py
  |   +- hickle_loaders/
  |   +- load_<my_package>.py
  +- ...

Example single module loader

distpackages/
  +- ...
  +- <my_module>.py
  +- ...
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_module>.py
  |   +- ...
  +- ...

Example program main (package) loader

bin/
  +- ...
  +- <my_single_file_program>.py
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_single_file_program>.py
  |   +- ...
  +- ...
  +- <my_program>/
  |   +- ...
  |   +- <main>.py
  |   +- ...
  |   +- hickle_loaders/
  |   |   +- ...
  |   |   +- load_<main>.py
  |   |   +- ...
  |   +- ...
  +- ...

Fallback Loader recovering data:
--------------------------------
Implements special AttemptsRecoverCustom types used in replacement for
Python objects are missing or are incompatible to the data stored. The
affected data is loaded as RecoveredGroup (dict type) or RecoveredDataset
(numpy.ndarray type) objects. Attached to either is the attrs attribute
as found on the corresponding h5py.Group and h5py.Dataset in the hickle
file.

LoaderManager:
==============

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
legacy loaders provided by hickle core (currently 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.

@hernot
Hickle 4.0.x accepts file like objects and uses them as model for
creating a new file discarding file handler passed in. If the file
should be accessed later on using the same file handler this results
in an IOError raised and alike.

The proposed fix does not close the file but checks that it is opened at
least read able if mode is 'r' or read and writable in case mode reads
'w', 'r+', 'w+','x', 'x+' or 'a'.

Note: passed in file or file like object has to be closed by caller.
Compression keywords safety tests:
==================================

In issue telegraphic#140 it was reported that some loaders crash
when 'compression', 'chunks' and related h5py keyword arguments are
specified. By running pytest a second time and thereby specifying
the custom --enable-compression parameter all tests are rerun with

   kwargs={'compression':'gzip', 'compression_opts':6}

All compression sensitive tests especially all 'test_XX_*.py::*' unit test
functions must include the 'compression_kwargs' parameter in their
signature to receive the actual kwargs list to be passed to all 'create_fcn'
function defined by loader module. In case a test function misses to
pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to
   'hickle.dump',
   'hickle._dump',
or any dump_method listed in 'class_register' table of loader module or
specified directly in a 'LoaderManager.register_class' an AssertionError
exception is thrown indicating the name of the test function, the line
in which the affected function is called any function which it calls.
Tests which either test compression related issues explicitly or do not
call any of the dump_functions may be marked accordingly using the
   'pytest.mark.no_compression'
marker to explicitly exclude test function from compression testing.

Tox virtual env manager support:
================================
Adds support for virtualenv manager tox. Tox simplifies local testing of
compatibility for multiple python versions before pushing to github and
creating pullrequest. Travis and Appveyor integration still has to be
tested and verified.

'# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
Current release causes problems when trying to dump and load data
when h5py 3.X library and newer is installed. In that case loading
will fail with errors. This is caused by
 - a change in how strings are stored in Attributes.
   Before 3.x they were stored as bytes strings. Encoding had to happend
   manually before storing to and after reading from the attribute. In
   h5py encoding is handled by attrs.__getitem__ and attrs.__getitem__.
   The first simply assumes that any bytestring stored as attribute
   value must represent a python 'str' object which either was written
   by python 2 which used plain c strings or is the encoded version of
   a Python 3 utf8 string and thus has to be encoded in utf8. Only in
   case the byte string contains characters which are not valid in utf8
   encoded string than content is returned as 'bytes' object
 - a change in how utf8 'str' objets are stored. They are now
   consequently stored as var_dtype objects where the metadata of dtype
   indicate the resulting python type. In H5py2 this didn't matter
   therefore strings stored in datasets using h5py2 are returned with a
   opaque object dtype for which decoding to utf8 has to be enforced if
   py_obj_type as indicated by 'type' attribute is 'str'.
 - a change how Datasets for which no explicit link exists in the file
   but a strong Python reference exists to the h5py.Dataset or
   h5py.Group objects exits are handled when a reference to them is
   resolved. In h5py 2 a ValueError is raise in h5py 3.X an anonymous
   Dataset or Group is returned. Consequently when simulating during
   test runs stale references encountered by hickle.load all python
   references to the corresponding datasets and groups have to be
   dropped to see that the reference is stale. The alternative would be
   to check the name of the referred to Dataset if it is None the
   dataset would be anonymous. Resolving the name on non Anonymous
   Datasets and Groups, which allways represents the full path name and
   not just the local name,  is a quite expensive and costy processin
   terms of runtime and resources and shall be avoided

This commit provides fixes for these changes severely affecting proper
operation of hickle. String values of attributes are compared against
'bytes' and 'str' type variants of constant strings. Lookupables like
key_base_type and hkl_types and hkl_conainer_types include the key for
an entry as 'str' and 'bytes' type key. A featured optional load_fcn is
provided fixing the difference how dtype of utf8 strings is computed by
h5py 2 and h5py 3. Test functions are updated accordingly to properly
mock stale reference conditions and changes in how strings are stored.
@hernot
Copy link
Contributor Author

hernot commented Oct 17, 2021

Ok fixes done. Numpy depencency still not changed cause a bit affected by the issue current dill versions have with dumping type(numpy(>=1.21).dtype required for anotating numpy.dtype with their proper py_object_type. Future dill release will have a dedicated fix for it. But in general this will occur for any project which like numpy uses dynamic classes and alike. And it is only to be expected that dill project will add further workarounds and fixes for popular packages like numpy. But not smaller projects or personal projects. So the only solution in the current state is either to combine lastest numpy and lastes dill including the fix or manually resort to numpy <1.20. Both solutions are quite limiting and reduces the usability of hickle.

After testing the following options feasible and reasonable for hickle are possible. Need to know which route to go:

  • keep requirement as is until dill officially has released fix and dtype test does not fail any more. This after updating requirements needs a clear knowledge which dill/numpy combinations work and as stated above, and can be quite limitting to usage of hickle.
  • change requirement to only skip numpy 1.20.X versions and disable dtype test. This needs a clear indication in the README that currently no dtype objects can be dumped with hickle when using numpy >= 1.21.
  • drop dill and resort to vanilly python pickle. This needs further testing and verification by you two, @telegraphic and @1313e, to ensure that no unexpected issues occur. For this I have prepared in my forked hickle repo the dedicated testing and playing only dill-or-not-dill branch.

NOTE
windows runners on GitHub actions seem not to expose the most consistent configuration. Some of them seem to be configured to globally allow read access to files opened for writing while others prohibit this causing the error described in log message of latests not planned commit. Just stumbled across when preparing the dill-or-not-dill within my forked repo. I wanted to spare you the despair i faced thereby until accepting what my gut suspected from the start.

EDIT
Found another bug breaking package, module and main script loader module support. I fear i deleted it accidentially, when cleaning coverage flagging # pragma: nocover comments. Luckily i found it. Let me know if i should squash all the small fixes into one single commit or keep them as they are.

hernot added 14 commits October 17, 2021 14:41
…cleanup. Without package, module and script loader support is broken
…properly and stable under any circumstances?
…properly and stable under any circumstances?
… collector, try forcing collection after file flushing and before closing
On GitHub actions windows runners it was obswerved for python 3.7 32 bit
and 64 bit and python 3.8 64 bit that h5py.copy aborted with PermissionError
exceptoin when copying hdf5 files which are opened for writing. On the
remaining three windows runners as well as linux and mac runners copying
was possible without any issue.

Adopted test_ReferenceManager_context and test_ReferenceManager tests to
close file accessed through h5_data fixture first before calling
shutil.copy.

Fixed spourious Unrecognized typecode ValueError exception occuring
during h5py.File.close. These are caused by python garbage collector
which kicks in while h5py.close in critical section. By disabling
garbage collection before call to h5py.File.close and re-enabling
afterwards this can be solved for all python versions. This error most
consistently could be observed on windows 32 bit running python 3.7 with
h5py 2.10.0.
…cleanup. Without package, module and script loader support is broken
@telegraphic
Copy link
Owner

Hi @hernot -- catching up on this now. I think a larger issue is that dill shouldn't be pickling numpy dtypes!

Numpy dtypes should be handled by create_np_dtype. Any idea why this isn't happening?

@hernot
Copy link
Contributor Author

hernot commented Oct 20, 2021

@telegraphic
It is happening, the create_np_dtype is working properly. The problem is the creation of the py_obj_type attribute string. Which is a pickle string of for exampel np.dtype(int).__class__ that works flawwlessly with curren dill until numpy <1.20. With numpy >= 1.20 the numpy project has changed a few things including that they now use dynamic class objects for encoding the class for each alife dtype.

If you look at type.mro(np.dtype(int).__class__ it looks like the following in numpy >= 1.20: [ <class np.dtype[int]> <class np.dtype> <class object>] as you can see the actual class has a reprstring and a __qualname__which is not a vaild python expression. When this change was first published with numpy 1.20.X this caused a problem with standard pickle too. But in numpy >= 1.21 they have added a copyreg.dispatch_table entry for type(np.dtype) type objects which handles the dumping of the new style np.dtype objects.

Dill does not make use of this table so it is not able to properly handle this case. I had last weekend a discussion with numpy and dill project with the result that the dill project has added a workaround for exactly this case which will be available in the next release. Checking the workaround it is not the only one the dill project uses to handle special cases like the new way numpy encodes the classes of its dtype objects.

In the light of the now official possibility to add custom loaders to hickle outside official hickle/loaders directory, it is to be expected that similar issues problem will occur more frequently with custom classes from users who report them as issue on the hickle project. Given the fact that the dill project has reasons why they still stick to python 2 style pickler and not utilize copyreg and other extensions introduced by python3, it is to be expected that there will only be dill side workarounds for issues related to projects being larger than dill project.

This on the other hand will sooner or later limit the outreach and use-cases of hickle, which, as far as i understand, was and is thought by you @telegraphic and @1313e as compatible standard pickle as possible and feasible. Therefore users have the ability to add their own loaders for their very local and specific classes and cases. If they are not picklable by pickle than hickle should not handle them either. But if they are properly handled by standard pickle i would expect that hickle does too and would be surprised that it doesn't just cause dill refuses to pickle the class object of my fancy object.

Hope i could shed some light on the core of the issue and what breaks when changing requirements for numpy form <1.20 to != 1.20 to include numpy versions >= 1.21

@1313e
Copy link
Collaborator

1313e commented Oct 20, 2021

I think with all that, we should drop dill and go back to pickle again.
I am somewhat surprised that dill does not use copyreg, because I think it should use that, but if they don't, then there is no reason to use dill anymore.
After all, copyreg is how you add more objects to pickle.
@telegraphic Thoughts?

@telegraphic
Copy link
Owner

Thanks @hernot for the explanation! If pickle handles this correctly, I'm happy to switch to it as it is part of Python core. This major v5 release is the optimal time to make the change too!

I think pickling with protocol=4 is a good option? (https://www.python.org/dev/peps/pep-3154/) (Assuming this plays nicely with numpy @hernot?)

@hernot
Copy link
Contributor Author

hernot commented Oct 20, 2021

@telegraphic Sorry misread your questions first. Normally pickle anyway will select the highest version available in python version.

In Python 3.5, 3.6 and 3.7 the default is 3 and the highest is 4 whereas from 8 on the default is 4 and the highest is 5.
Therefore yes up to date there will be no problem reading files generated with any of the python versions currently supported by hickle. And from documentation it seems that this will not change so soon at least not for python 3.9, 3.10 and the forthcoming python 3.11

…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
@telegraphic
Copy link
Owner

@1313e and @hernot -- where are we on this? Ready to pull the trigger?

@1313e
Copy link
Collaborator

1313e commented Dec 13, 2021

Yeah, I am fine with this merging.

@telegraphic
Copy link
Owner

@hernot double checking you're done with the dill cleanout and ready for the big merge?

@hernot
Copy link
Contributor Author

hernot commented Dec 13, 2021

yes I'm done

  1. hickle 5. Files are dumped without dill
  2. legacy loading of hickle <= 4.x files still needs dill.
    therefore dill is still in the requirements for legacy loading but indicated that only need for legacy loading in README.md
    can be removed any time from requrements.txt but must stay in requirements_test.txt as long as there is the need for legacy loading hickle <= 4.x Files

@telegraphic telegraphic merged commit 9bedbec into telegraphic:dev-v5 Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants