Implements issue telegraphic#139 H4EP002 Meomoisation scheme (object …

…and type) Basic Menoisation: ================== Both types of memoisation are handled by Reference manager dictionary type object. For storing object instance references it is used as python dict object which stores the references to the py_obj and related node using py_obj(id) as key when dumping. On load the id of the h_node is used as key for storing the to be shared reference of the restored object. Additoinal references to the same object are represented by h5py.Datasets with their dtype set to ref_dtype. They are created by assinging a h5py.Refence object as returned by h5py.Dataset.ref or h5py.Group.ref attribute. These datasets are resolved by the filter iterator method of the ExpandReferenceContainer class and returned as sub_item of the reference dataset. Type Memoisation ================ The 'type' attribute of all nodes exempt datasets which contain pickle strings, or expose a ref_dtype as their dtype now contains a reference to the approriate py_obj_type entry in the global 'hickle_types_table' this table host the datasets representing all py_obj_types and base_types encountered by hickle.dump once. Each py_obj_type is represened by a numbered dataset containing the corresponding pickle string. The base_types are represented by empty datasets the name of which is the name of the base_type as defined by class_register table of loaders. No entry is stored for object, b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these can be resolved implicitly on load. The 'base_type' attribute of a py_obj_type entry referres to the base_type used to encode and required to properly restore it again from the hickle file. The entries in the 'hickle_types_table' are managed by the ReferenceManager.store_type and ReferenceManager.resolve_type methods. The latter is also taking care of properly distinguishing pickle datasets from reference datasets and resolving hickle 4.0.X dict_item groups. The ReferenceManager is implemented as context manager and thus can and shall be used within the with statement, to ensure proper cleanup. Each file has its own ReferenceManager instance, therefore different data can be dumped to distinct files which are open in parallel. The basic management of managers is provided by the BaseManager base class which can be used build futher managers for example to allow loaders to be activated only when specific feature flags are passed to hickle.dump method or encountered by hickle.load from the file attributes. The BaseManager class has to be subclassed. Other changes: ============== - lookup.register_class and class_register tables have an addtional memoise flag indicating whether py_obj shall be remmembered for representing and resolving multiple references to it or if it shall be dumped and restored everytime it is encountered - lookup.hickle_types table entries include the memoise flag as third entry - lookup.load_loader: the tuple returned in addtion to py_obj_type includes the memoise flag - hickle.load: Whether to use load_fn stored in lookup.hkl_types_table or use a PyContainer object stored in hkl_container_dict is decided upon the is_container flag returned by ReferenceManager.resolve_type instead of checking whether processed node is of type h5py.Group - dtype of string and bytes datasets is now set to 'S1' instead of 'u8' and shape is set to (1,len)
hernot · Feb 19, 2021 · 032a333 · 032a333
1 parent 74103f9
commit 032a333
Show file tree

Hide file tree

Showing 12 changed files with 1,163 additions and 242 deletions.
diff --git a/hickle/helpers.py b/hickle/helpers.py
@@ -81,8 +81,20 @@ def __init__(self,h5_attrs, base_type, object_type,_content = None):
         # when calling the append method
         self._content = _content if _content is not None else []
 
-    def filter(self,items):
-        yield from items
+    def filter(self,h_parent):
+        """
+        PyContainer type child chasses may overload this function to
+        filter and preprocess the content of h_parent h5py.Group or 
+        h5py.Dataset to ensure it can be properly processed by recursive
+        calls to hickle._load function.
+
+        Per default yields from h_parent.items(). 
+
+        For examples see: 
+            hickle.lookup.ExpandReferenceContainer.filter
+            hickle.loaders.load_scipy.SparseMatrixContainer.filter
+        """
+        yield from h_parent.items()
 
     def append(self,name,item,h5_attrs):
         """
@@ -160,7 +172,7 @@ def __getitem__(self,*args,**kwargs):
 
 class no_compression(dict):
     """
-    subclass of dict which which temporarily removes any compression or data filter related
+    named dict comprehension which which temporarily removes any compression or data filter related
     arguments from the passed iterable. 
     """
     def __init__(self,mapping,**kwargs):

diff --git a/hickle/hickle.py b/hickle/hickle.py
@@ -43,7 +43,8 @@
 from .helpers import PyContainer, NotHicklable, nobody_is_my_name
 from .lookup import (
     hkl_types_dict, hkl_container_dict, load_loader, load_legacy_loader ,
-    create_pickled_dataset, load_nothing, fix_lambda_obj_type
+    create_pickled_dataset, load_nothing, fix_lambda_obj_type,ReferenceManager,
+    link_dtype
 )
 
 
@@ -69,11 +70,6 @@ class ToDoError(Exception):     # pragma: no cover
     def __str__(self):
         return "Error: this functionality hasn't been implemented yet."
 
-class SerializedWarning(UserWarning):
-    """ An object type was not understood
-    The data will be serialized using pickle.
-    """
-
 # %% FUNCTION DEFINITIONS
 def file_opener(f, path, mode='r'):
     """
@@ -142,49 +138,65 @@ def file_opener(f, path, mode='r'):
 # DUMPERS #
 ###########
 
-def _dump(py_obj, h_group, name, attrs={} , **kwargs):
+def _dump(py_obj, h_group, name, memo, attrs={} , **kwargs):
     """ Dump a python object to a group within an HDF5 file.
 
     This function is called recursively by the main dump() function.
 
-    Args:
+    Parameters:
+    -----------
         py_obj: python object to dump.
         h_group (h5.File.group): group to dump data into.
         name (bytes): name of resultin hdf5 group or dataset 
+        memo (ReferenceManager): the ReferenceManager object
+            responsible for handling all object and type memoisation
+            related issues
+        attrs (dict): addtional attributes to be stored along with the
+            resulting hdf5 group or hdf5 dataset
+        kwargs (dict): keyword arguments to be passed to create_dataset
+            function
     """
 
+    py_obj_id = id(py_obj)
+    py_obj_ref = memo.get(py_obj_id,None)
+    if py_obj_ref is not None:
+        # reference data sets do not have any base_type and no py_obj_type set
+        # as they can be distinguished from pickled data due to their dtype 
+        # of type ref_dtype and thus load implicitly will be assigned b'!node-reference!'
+        # base_type and hickle.lookup.NodeReference as their py_obj_type
+        h_link = h_group.create_dataset(name,data = py_obj_ref[0].ref,dtype = link_dtype)
+        h_link.attrs.update(attrs)
+        return
+
     # Check if we have a unloaded loader for the provided py_obj and 
-    # retrive the most apropriate method for crating the corresponding
+    # retrive the most apropriate method for creating the corresponding
     # representation within HDF5 file
-    if isinstance(
-         py_obj,
-         (types.FunctionType, types.BuiltinFunctionType, types.MethodType, types.BuiltinMethodType, type)
-    ):
-        py_obj_type,create_dataset,base_type = object,create_pickled_dataset,b'pickle'
-    else:
-        py_obj_type, (create_dataset, base_type) = load_loader(py_obj.__class__)
+    py_obj_type, (create_dataset, base_type,memoise) = load_loader(py_obj.__class__)
     try:
         h_node,h_subitems = create_dataset(py_obj, h_group, name, **kwargs)
-
-        # loop through list of all subitems and recursively dump them
-        # to HDF5 file
-        for h_subname,py_subobj,h_subattrs,sub_kwargs in h_subitems:
-            _dump(py_subobj,h_node,h_subname,h_subattrs,**sub_kwargs)
-        # add addtional attributes and set 'base_type' and 'type'
-        # attributes accordingly
-        h_node.attrs.update(attrs)
-
-        # only explicitly store base_type and type if not dumped by
-        # create_pickled_dataset
-        if create_dataset is not create_pickled_dataset:
-            h_node.attrs['base_type'] = base_type
-            h_node.attrs['type'] = np.array(pickle.dumps(py_obj_type))
-        return
     except NotHicklable:
-
-        # ask pickle to try to store
         h_node,h_subitems = create_pickled_dataset(py_obj, h_group, name, reason = str(NotHicklable), **kwargs)
-        h_node.attrs.update(attrs)
+    else:
+        # store base_type and type unless py_obj had to be picled by create_pickled_dataset
+        memo.store_type(h_node,py_obj_type,base_type,**kwargs)
+
+    # add addtional attributes and set 'base_type' and 'type'
+    # attributes accordingly
+    h_node.attrs.update((name,attr) for name,attr in attrs.items() if name != 'type' )
+
+    # ask pickle to try to store
+    # if h_node shall be memoised for representing multiple references
+    # to the same py_obj instance in the hdf5 file store h_node
+    # in the memo dictionary. Store py_obj along with h_node to ensure
+    # py_obj_id which represents the memory address of py_obj referrs
+    # to py_obj until the whole structure is stored within hickle file.
+    if memoise:
+        memo[py_obj_id] = (h_node,py_obj)
+
+    # loop through list of all subitems and recursively dump them
+    # to HDF5 file
+    for h_subname,py_subobj,h_subattrs,sub_kwargs in h_subitems:
+        _dump(py_subobj,h_node,h_subname,memo,h_subattrs,**sub_kwargs)
 
 
 def dump(py_obj, file_obj, mode='w', path='/', **kwargs):
@@ -235,7 +247,8 @@ def dump(py_obj, file_obj, mode='w', path='/', **kwargs):
         h_root_group.attrs["HICKLE_VERSION"] = __version__
         h_root_group.attrs["HICKLE_PYTHON_VERSION"] = py_ver
 
-        _dump(py_obj, h_root_group,'data', **kwargs)
+        with ReferenceManager.create_manager(h_root_group) as memo:
+            _dump(py_obj, h_root_group,'data', memo ,**kwargs)
     finally:
         # Close the file if requested.
         # Closing a file twice will not cause any problems
@@ -268,7 +281,7 @@ def __init__(self,h5_attrs, base_type, object_type): # pragma: no cover
         raise RuntimeError("Cannot load container proxy for %s data type " % base_type)
 
 
-def no_match_load(key):     # pragma: no cover
+def no_match_load(key,*args,**kwargs):     # pragma: no cover
     """ 
     If no match is made when loading dataset , need to raise an exception
     """
@@ -355,11 +368,12 @@ def load(file_obj, path='/', safe=True):
                 # eventhough stated otherwise in documentation. Activate workarrounds
                 # just in case issues arrise. Especially as corresponding lambdas in
                 # load_numpy are not needed anymore and thus have been removed.
-                pickle_loads = fix_lambda_obj_type
-                _load(py_container, 'data',h_root_group['data'],pickle_loads = fix_lambda_obj_type,load_loader = load_legacy_loader)
+                with ReferenceManager.create_manager(h_root_group,fix_lambda_obj_type) as memo:
+                    _load(py_container, 'data',h_root_group['data'],memo,load_loader = load_legacy_loader)
                 return py_container.convert()
             # 4.1.x file and newer
-            _load(py_container, 'data',h_root_group['data'],pickle_loads = pickle.loads,load_loader = load_loader)
+            with ReferenceManager.create_manager(h_root_group,pickle_loads) as memo:
+                _load(py_container, 'data',h_root_group['data'],memo,load_loader = load_loader)
             return py_container.convert()
 
         # Else, raise error
@@ -376,7 +390,7 @@ def load(file_obj, path='/', safe=True):
 
 
 
-def _load(py_container, h_name, h_node,pickle_loads=pickle.loads,load_loader = load_loader):
+def _load(py_container, h_name, h_node,memo,load_loader = load_loader):
     """ Load a hickle file
 
     Recursive funnction to load hdf5 data into a PyContainer()
@@ -386,33 +400,29 @@ def _load(py_container, h_name, h_node,pickle_loads=pickle.loads,load_loader = l
         h_name (string): the name of the resulting h5py object group or dataset
         h_node (h5 group or dataset): h5py object, group or dataset, to spider
             and load all datasets.
-        pickle_loads (FunctionType,MethodType): defaults to pickle.loads and will
-            be switched to fix_lambda_obj_type if file to be loaded was created by
-            hickle 4.0.x version
+        memo (ReferenceManager): the ReferenceManager object
+            responsible for handling all object and type memoisation
+            related issues
         load_loader (FunctionType,MethodType): defaults to lookup.load_loader and
             will be switched to load_legacy_loader if file to be loaded was
             created by hickle 4.0.x version
     """
 
-    # load base_type of node. if not set assume that it contains
-    # pickled object data to be restored through load_pickled_data or
-    # PickledContainer object in case of group.
-    base_type = h_node.attrs.get('base_type',b'pickle')
-    if base_type == b'pickle':
-        # pickled dataset or group assume its object_type to be object
-        # as true object type is anyway handled by load_pickled_data or
-        # PickledContainer
-        py_obj_type = object
-    else:
-        # extract object_type and ensure loader beeing able to handle is loaded
-        # loading is controlled through base_type, object_type is just required
-        # to allow load_fn or py_subcontainer to properly restore and cast
-        # py_obj to proper object type
-        py_obj_type = pickle_loads(h_node.attrs.get('type',None))
-        py_obj_type,_ = load_loader(py_obj_type)
+    # if h_node has already been loaded cause a reference to it was encountered earlier
+    # direcctly append it to its parent container and return
+    node_ref = memo.get(h_node.id,h_node)
+    if node_ref is not h_node:
+        py_container.append(h_name,node_ref,h_node.attrs)
+        return
+
+    # load the type information of node.
+    py_obj_type,base_type,is_container = memo.resolve_type(h_node)
+    py_obj_type,(_,_,memoise) = load_loader(py_obj_type)
 
-    # Either a file, group, or dataset
-    if isinstance(h_node, h5.Group):
+    if is_container:
+        # Either a h5py.Group representing the structure of complex objects or
+        # a h5py.Dataset representing a h5py.Reference to the node of an object
+        # referred to from multiple places within the objet structure to be dumped 
 
         py_container_class = hkl_container_dict.get(base_type,NoMatchContainer)
         py_subcontainer = py_container_class(h_node.attrs,base_type,py_obj_type)
@@ -421,16 +431,18 @@ def _load(py_container, h_name, h_node,pickle_loads=pickle.loads,load_loader = l
         #       to be handled by container class provided by loader only
         #       as loader has all the knowledge required to properly decide
         #       if sort is necessary and how to sort and at what stage to sort 
-        for h_key,h_subnode in py_subcontainer.filter(h_node.items()):
-            _load(py_subcontainer, h_key, h_subnode, pickle_loads, load_loader)
+        for h_key,h_subnode in py_subcontainer.filter(h_node):
+            _load(py_subcontainer, h_key, h_subnode, memo , load_loader)
 
-        # finalize subitem and append to parent container.
+        # finalize subitem
         sub_data = py_subcontainer.convert()
         py_container.append(h_name,sub_data,h_node.attrs)
-
     else:
         # must be a dataset load it and append to parent container
         load_fn = hkl_types_dict.get(base_type, no_match_load)
-        data = load_fn(h_node,base_type,py_obj_type)
-        py_container.append(h_name,data,h_node.attrs)
+        sub_data = load_fn(h_node,base_type,py_obj_type)
+        py_container.append(h_name,sub_data,h_node.attrs)
+    # store loaded object for properly restoring addtional references to it
+    if memoise:
+        memo[h_node.id] = sub_data
 
diff --git a/hickle/loaders/load_builtins.py b/hickle/loaders/load_builtins.py
@@ -68,7 +68,7 @@ def create_none_dataset(py_obj, h_group, name, **kwargs):
     Returns:
         correspoinding h5py.Dataset and empty subitems list
     """
-    return h_group.create_dataset(name, data=bytearray(b'None'),**kwargs),()
+    return h_group.create_dataset(name, shape = None,dtype = 'V1',**no_compression(kwargs)),()
 
 
 def check_iterable_item_type(first_item,iter_obj):
@@ -118,17 +118,16 @@ def create_listlike_dataset(py_obj, h_group, name,list_len = -1,item_dtype = Non
 
     if isinstance(py_obj,(str,bytes)):
         # strings and bytes are stored as array of bytes with strings encoded using utf8 encoding
-        dataset = h_group.create_dataset(
-            name,
-            data = bytearray(py_obj,"utf8") if isinstance(py_obj,str) else bytearray(py_obj),
-            **kwargs 
-        )
+        string_data = bytearray(py_obj,"utf8") if isinstance(py_obj,str) else memoryview(py_obj)
+        string_data = np.array(string_data,copy=False)
+        string_data.dtype = 'S1'
+        dataset = h_group.create_dataset( name, data = string_data,shape = (1,string_data.size), **kwargs)
         dataset.attrs["str_type"] = py_obj.__class__.__name__.encode("ascii")
         return dataset,()
 
     if len(py_obj) < 1:
         # listlike object is empty just store empty dataset
-        return h_group.create_dataset(name,shape=None,dtype=int,**no_compression(kwargs)),()
+        return h_group.create_dataset(name,shape=None,dtype='int',**no_compression(kwargs)),()
 
     if list_len < 0:
         # neither length nor dtype of items is know compute them now
@@ -246,14 +245,14 @@ def load_scalar_dataset(h_node,base_type,py_obj_type):
     Returns:
         resulting python object of type py_obj_type
     """
-    data = h_node[()] if h_node.size < 2 else bytearray(h_node[()])
+    data = h_node[()] if h_node.size < 2 else memoryview(h_node[()])
 
 
     return py_obj_type(data) if data.__class__ is not py_obj_type else data
 
 def load_none_dataset(h_node,base_type,py_obj_type):
     """
-    returns None value as represented by underlying dataset
+    returns None value as represented by underlying empty dataset
     """
     return None
 
@@ -275,8 +274,8 @@ def load_list_dataset(h_node,base_type,py_obj_type):
     str_type = h_node.attrs.get('str_type', None)
     content = h_node[()]
     if str_type == b'str':
-
-        if "bytes" in h_node.dtype.name:
+        # decode bytes representing python string before final conversion
+        if h_node.dtype.itemsize > 1 and 'bytes' in h_node.dtype.name:
             # string dataset 4.0.x style convert it back to python string
             content = np.array(content, copy=False, dtype=str).tolist()
         else:
@@ -397,11 +396,11 @@ def convert(self):
     [set, b"set", create_setlike_dataset, load_list_dataset,SetLikeContainer],
     [bytes, b"bytes", create_listlike_dataset, load_list_dataset],
     [str, b"str", create_listlike_dataset, load_list_dataset],
-    [int, b"int", create_scalar_dataset, load_scalar_dataset],
-    [float, b"float", create_scalar_dataset, load_scalar_dataset],
-    [complex, b"complex", create_scalar_dataset, load_scalar_dataset],
-    [bool, b"bool", create_scalar_dataset, load_scalar_dataset],
-    [None.__class__, b"None", create_none_dataset, load_none_dataset]
+    [int, b"int", create_scalar_dataset, load_scalar_dataset, None, False],
+    [float, b"float", create_scalar_dataset, load_scalar_dataset, None, False],
+    [complex, b"complex", create_scalar_dataset, load_scalar_dataset, None, False],
+    [bool, b"bool", create_scalar_dataset, load_scalar_dataset, None, False],
+    [None.__class__, b"None", create_none_dataset, load_none_dataset, None, False]
 ]
 
 exclude_register = []
diff --git a/hickle/loaders/load_numpy.py b/hickle/loaders/load_numpy.py
@@ -73,7 +73,10 @@ def create_np_array_dataset(py_obj, h_group, name, **kwargs):
     if "str" in dtype.name:
         if py_obj.ndim < 1:
             # convert string to utf8 encoded bytearray
-            h_node = h_group.create_dataset(name,data = bytearray(py_obj.tolist(),"utf8"),**kwargs)
+            string_data = bytearray(py_obj.item(),"utf8") if 'bytes' not in dtype.name else memoryview(py_obj.item())
+            string_data = np.array(string_data,copy = False)
+            string_data.dtype = 'S1'
+            h_node = h_group.create_dataset(name,data = string_data,shape=(1,string_data.size),**kwargs)
             sub_items = ()
         else:
             # store content as list of strings
@@ -138,8 +141,14 @@ def load_ndarray_dataset(h_node,base_type,py_obj_type):
     restores ndarray like object from dataset
     """
     dtype = np.dtype(h_node.attrs['np_dtype'].decode('ascii'))
-    if "str" in dtype.name and "bytes" not in h_node.dtype.name:
-        return np.array(bytes(h_node[()]).decode("utf8"),dtype=dtype)
+    if "str" in dtype.name:
+        string_data = h_node[()]
+        if h_node.dtype.itemsize <= 1 or 'bytes' not in h_node.dtype.name:
+            # in hickle 4.0.X np.arrays containing multiple strings are 
+            # not converted to list of string but saved as ar consequently
+            # itemsize of dtype is > 1
+            string_data = bytes(string_data).decode("utf8")
+        return np.array(string_data,copy=False,dtype=dtype)
     if issubclass(py_obj_type,np.matrix):
         return py_obj_type(data=h_node[()],dtype=dtype)
     # TODO how to restore other ndarray derived object_types
@@ -209,11 +218,11 @@ def convert(self):
 # %% REGISTERS
 class_register = [
     [np.dtype, b"np_dtype", create_np_dtype, load_np_dtype_dataset],
-    [np.number, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset],
+    [np.number, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset,None,False],
 
     # for all scalars which are not derived from np.number which itself is np.generic subclass
     # to properly catch and handle they will be caught by the following
-    [np.generic, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset],
+    [np.generic, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset,None,False],
 
     [np.ndarray, b"ndarray", create_np_array_dataset, load_ndarray_dataset,NDArrayLikeContainer],
     [np.ma.core.MaskedArray, b"ndarray_masked", create_np_masked_array_dataset, None,NDMaskedArrayContainer],