Skip to content

Commit

Permalink
Implements issue telegraphic#139 H4EP002 Meomoisation scheme (object …
Browse files Browse the repository at this point in the history
…and type)

Basic Menoisation:
==================
Both types of memoisation are handled by Reference manager dictionary
type object. For storing object instance references it is used as
python dict object which stores the references to the py_obj and related
node using py_obj(id) as key when dumping. On load the id of the h_node
is used as key for storing the to be shared reference of the restored
object.

Additoinal references to the same object are represented by
h5py.Datasets with their dtype set to ref_dtype. They are created
by assinging a h5py.Refence object as returned by h5py.Dataset.ref or
h5py.Group.ref attribute. These datasets are resolved by the filter
iterator method of the ExpandReferenceContainer class and returned as
sub_item of the reference dataset.

Type Memoisation
================
The 'type' attribute of all nodes exempt datasets which contain pickle
strings, or expose a ref_dtype as their dtype now contains a reference
to the approriate py_obj_type entry in the global 'hickle_types_table'
this table host the datasets representing all py_obj_types and
base_types encountered by hickle.dump once.

Each py_obj_type is represened by a numbered dataset containing the
corresponding pickle string. The base_types are represented by empty
datasets the name of which is the name of the base_type as defined
by class_register table of loaders. No entry is stored for object,
b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these
can be resolved implicitly on load.
The 'base_type' attribute of a  py_obj_type entry referres to the
base_type used to encode and required to properly restore it again
from the hickle file.

The entries in the 'hickle_types_table' are managed by the
ReferenceManager.store_type and ReferenceManager.resolve_type methods.
The latter is also taking care of properly distinguishing pickle
datasets from reference datasets and resolving hickle 4.0.X dict_item
groups.

The ReferenceManager is implemented as context manager and thus can and
shall be used within the with statement, to ensure proper cleanup. Each
file has its own ReferenceManager instance, therefore different data can
be dumped to distinct files which are open in parallel. The basic
management of managers is provided by the BaseManager base class which can be
used build futher managers for example to allow loaders to be activated only
when specific feature flags are passed to hickle.dump method or
encountered by hickle.load from the file attributes. The BaseManager
class has to be subclassed.

Other changes:
==============
 - lookup.register_class and class_register tables have an addtional
   memoise flag indicating whether py_obj shall be remmembered for
   representing and resolving multiple references to it or if it shall
   be dumped and restored everytime it is encountered

 - lookup.hickle_types table entries include the memoise flag as third entry

 - lookup.load_loader: the tuple returned in addtion to py_obj_type
   includes the memoise flag

 - hickle.load: Whether to use load_fn stored in
   lookup.hkl_types_table or use a PyContainer object stored in
   hkl_container_dict is decided upon the is_container flag returned
   by ReferenceManager.resolve_type instead of checking whether
   processed node is of type h5py.Group

 - dtype of string and bytes datasets is now set to 'S1' instead of 'u8'
   and shape is set to (1,len)
  • Loading branch information
hernot committed Feb 19, 2021
1 parent 74103f9 commit 032a333
Show file tree
Hide file tree
Showing 12 changed files with 1,163 additions and 242 deletions.
18 changes: 15 additions & 3 deletions hickle/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,20 @@ def __init__(self,h5_attrs, base_type, object_type,_content = None):
# when calling the append method
self._content = _content if _content is not None else []

def filter(self,items):
yield from items
def filter(self,h_parent):
"""
PyContainer type child chasses may overload this function to
filter and preprocess the content of h_parent h5py.Group or
h5py.Dataset to ensure it can be properly processed by recursive
calls to hickle._load function.
Per default yields from h_parent.items().
For examples see:
hickle.lookup.ExpandReferenceContainer.filter
hickle.loaders.load_scipy.SparseMatrixContainer.filter
"""
yield from h_parent.items()

def append(self,name,item,h5_attrs):
"""
Expand Down Expand Up @@ -160,7 +172,7 @@ def __getitem__(self,*args,**kwargs):

class no_compression(dict):
"""
subclass of dict which which temporarily removes any compression or data filter related
named dict comprehension which which temporarily removes any compression or data filter related
arguments from the passed iterable.
"""
def __init__(self,mapping,**kwargs):
Expand Down
146 changes: 79 additions & 67 deletions hickle/hickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@
from .helpers import PyContainer, NotHicklable, nobody_is_my_name
from .lookup import (
hkl_types_dict, hkl_container_dict, load_loader, load_legacy_loader ,
create_pickled_dataset, load_nothing, fix_lambda_obj_type
create_pickled_dataset, load_nothing, fix_lambda_obj_type,ReferenceManager,
link_dtype
)


Expand All @@ -69,11 +70,6 @@ class ToDoError(Exception): # pragma: no cover
def __str__(self):
return "Error: this functionality hasn't been implemented yet."

class SerializedWarning(UserWarning):
""" An object type was not understood
The data will be serialized using pickle.
"""

# %% FUNCTION DEFINITIONS
def file_opener(f, path, mode='r'):
"""
Expand Down Expand Up @@ -142,49 +138,65 @@ def file_opener(f, path, mode='r'):
# DUMPERS #
###########

def _dump(py_obj, h_group, name, attrs={} , **kwargs):
def _dump(py_obj, h_group, name, memo, attrs={} , **kwargs):
""" Dump a python object to a group within an HDF5 file.
This function is called recursively by the main dump() function.
Args:
Parameters:
-----------
py_obj: python object to dump.
h_group (h5.File.group): group to dump data into.
name (bytes): name of resultin hdf5 group or dataset
memo (ReferenceManager): the ReferenceManager object
responsible for handling all object and type memoisation
related issues
attrs (dict): addtional attributes to be stored along with the
resulting hdf5 group or hdf5 dataset
kwargs (dict): keyword arguments to be passed to create_dataset
function
"""

py_obj_id = id(py_obj)
py_obj_ref = memo.get(py_obj_id,None)
if py_obj_ref is not None:
# reference data sets do not have any base_type and no py_obj_type set
# as they can be distinguished from pickled data due to their dtype
# of type ref_dtype and thus load implicitly will be assigned b'!node-reference!'
# base_type and hickle.lookup.NodeReference as their py_obj_type
h_link = h_group.create_dataset(name,data = py_obj_ref[0].ref,dtype = link_dtype)
h_link.attrs.update(attrs)
return

# Check if we have a unloaded loader for the provided py_obj and
# retrive the most apropriate method for crating the corresponding
# retrive the most apropriate method for creating the corresponding
# representation within HDF5 file
if isinstance(
py_obj,
(types.FunctionType, types.BuiltinFunctionType, types.MethodType, types.BuiltinMethodType, type)
):
py_obj_type,create_dataset,base_type = object,create_pickled_dataset,b'pickle'
else:
py_obj_type, (create_dataset, base_type) = load_loader(py_obj.__class__)
py_obj_type, (create_dataset, base_type,memoise) = load_loader(py_obj.__class__)
try:
h_node,h_subitems = create_dataset(py_obj, h_group, name, **kwargs)

# loop through list of all subitems and recursively dump them
# to HDF5 file
for h_subname,py_subobj,h_subattrs,sub_kwargs in h_subitems:
_dump(py_subobj,h_node,h_subname,h_subattrs,**sub_kwargs)
# add addtional attributes and set 'base_type' and 'type'
# attributes accordingly
h_node.attrs.update(attrs)

# only explicitly store base_type and type if not dumped by
# create_pickled_dataset
if create_dataset is not create_pickled_dataset:
h_node.attrs['base_type'] = base_type
h_node.attrs['type'] = np.array(pickle.dumps(py_obj_type))
return
except NotHicklable:

# ask pickle to try to store
h_node,h_subitems = create_pickled_dataset(py_obj, h_group, name, reason = str(NotHicklable), **kwargs)
h_node.attrs.update(attrs)
else:
# store base_type and type unless py_obj had to be picled by create_pickled_dataset
memo.store_type(h_node,py_obj_type,base_type,**kwargs)

# add addtional attributes and set 'base_type' and 'type'
# attributes accordingly
h_node.attrs.update((name,attr) for name,attr in attrs.items() if name != 'type' )

# ask pickle to try to store
# if h_node shall be memoised for representing multiple references
# to the same py_obj instance in the hdf5 file store h_node
# in the memo dictionary. Store py_obj along with h_node to ensure
# py_obj_id which represents the memory address of py_obj referrs
# to py_obj until the whole structure is stored within hickle file.
if memoise:
memo[py_obj_id] = (h_node,py_obj)

# loop through list of all subitems and recursively dump them
# to HDF5 file
for h_subname,py_subobj,h_subattrs,sub_kwargs in h_subitems:
_dump(py_subobj,h_node,h_subname,memo,h_subattrs,**sub_kwargs)


def dump(py_obj, file_obj, mode='w', path='/', **kwargs):
Expand Down Expand Up @@ -235,7 +247,8 @@ def dump(py_obj, file_obj, mode='w', path='/', **kwargs):
h_root_group.attrs["HICKLE_VERSION"] = __version__
h_root_group.attrs["HICKLE_PYTHON_VERSION"] = py_ver

_dump(py_obj, h_root_group,'data', **kwargs)
with ReferenceManager.create_manager(h_root_group) as memo:
_dump(py_obj, h_root_group,'data', memo ,**kwargs)
finally:
# Close the file if requested.
# Closing a file twice will not cause any problems
Expand Down Expand Up @@ -268,7 +281,7 @@ def __init__(self,h5_attrs, base_type, object_type): # pragma: no cover
raise RuntimeError("Cannot load container proxy for %s data type " % base_type)


def no_match_load(key): # pragma: no cover
def no_match_load(key,*args,**kwargs): # pragma: no cover
"""
If no match is made when loading dataset , need to raise an exception
"""
Expand Down Expand Up @@ -355,11 +368,12 @@ def load(file_obj, path='/', safe=True):
# eventhough stated otherwise in documentation. Activate workarrounds
# just in case issues arrise. Especially as corresponding lambdas in
# load_numpy are not needed anymore and thus have been removed.
pickle_loads = fix_lambda_obj_type
_load(py_container, 'data',h_root_group['data'],pickle_loads = fix_lambda_obj_type,load_loader = load_legacy_loader)
with ReferenceManager.create_manager(h_root_group,fix_lambda_obj_type) as memo:
_load(py_container, 'data',h_root_group['data'],memo,load_loader = load_legacy_loader)
return py_container.convert()
# 4.1.x file and newer
_load(py_container, 'data',h_root_group['data'],pickle_loads = pickle.loads,load_loader = load_loader)
with ReferenceManager.create_manager(h_root_group,pickle_loads) as memo:
_load(py_container, 'data',h_root_group['data'],memo,load_loader = load_loader)
return py_container.convert()

# Else, raise error
Expand All @@ -376,7 +390,7 @@ def load(file_obj, path='/', safe=True):



def _load(py_container, h_name, h_node,pickle_loads=pickle.loads,load_loader = load_loader):
def _load(py_container, h_name, h_node,memo,load_loader = load_loader):
""" Load a hickle file
Recursive funnction to load hdf5 data into a PyContainer()
Expand All @@ -386,33 +400,29 @@ def _load(py_container, h_name, h_node,pickle_loads=pickle.loads,load_loader = l
h_name (string): the name of the resulting h5py object group or dataset
h_node (h5 group or dataset): h5py object, group or dataset, to spider
and load all datasets.
pickle_loads (FunctionType,MethodType): defaults to pickle.loads and will
be switched to fix_lambda_obj_type if file to be loaded was created by
hickle 4.0.x version
memo (ReferenceManager): the ReferenceManager object
responsible for handling all object and type memoisation
related issues
load_loader (FunctionType,MethodType): defaults to lookup.load_loader and
will be switched to load_legacy_loader if file to be loaded was
created by hickle 4.0.x version
"""

# load base_type of node. if not set assume that it contains
# pickled object data to be restored through load_pickled_data or
# PickledContainer object in case of group.
base_type = h_node.attrs.get('base_type',b'pickle')
if base_type == b'pickle':
# pickled dataset or group assume its object_type to be object
# as true object type is anyway handled by load_pickled_data or
# PickledContainer
py_obj_type = object
else:
# extract object_type and ensure loader beeing able to handle is loaded
# loading is controlled through base_type, object_type is just required
# to allow load_fn or py_subcontainer to properly restore and cast
# py_obj to proper object type
py_obj_type = pickle_loads(h_node.attrs.get('type',None))
py_obj_type,_ = load_loader(py_obj_type)
# if h_node has already been loaded cause a reference to it was encountered earlier
# direcctly append it to its parent container and return
node_ref = memo.get(h_node.id,h_node)
if node_ref is not h_node:
py_container.append(h_name,node_ref,h_node.attrs)
return

# load the type information of node.
py_obj_type,base_type,is_container = memo.resolve_type(h_node)
py_obj_type,(_,_,memoise) = load_loader(py_obj_type)

# Either a file, group, or dataset
if isinstance(h_node, h5.Group):
if is_container:
# Either a h5py.Group representing the structure of complex objects or
# a h5py.Dataset representing a h5py.Reference to the node of an object
# referred to from multiple places within the objet structure to be dumped

py_container_class = hkl_container_dict.get(base_type,NoMatchContainer)
py_subcontainer = py_container_class(h_node.attrs,base_type,py_obj_type)
Expand All @@ -421,16 +431,18 @@ def _load(py_container, h_name, h_node,pickle_loads=pickle.loads,load_loader = l
# to be handled by container class provided by loader only
# as loader has all the knowledge required to properly decide
# if sort is necessary and how to sort and at what stage to sort
for h_key,h_subnode in py_subcontainer.filter(h_node.items()):
_load(py_subcontainer, h_key, h_subnode, pickle_loads, load_loader)
for h_key,h_subnode in py_subcontainer.filter(h_node):
_load(py_subcontainer, h_key, h_subnode, memo , load_loader)

# finalize subitem and append to parent container.
# finalize subitem
sub_data = py_subcontainer.convert()
py_container.append(h_name,sub_data,h_node.attrs)

else:
# must be a dataset load it and append to parent container
load_fn = hkl_types_dict.get(base_type, no_match_load)
data = load_fn(h_node,base_type,py_obj_type)
py_container.append(h_name,data,h_node.attrs)
sub_data = load_fn(h_node,base_type,py_obj_type)
py_container.append(h_name,sub_data,h_node.attrs)
# store loaded object for properly restoring addtional references to it
if memoise:
memo[h_node.id] = sub_data

31 changes: 15 additions & 16 deletions hickle/loaders/load_builtins.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def create_none_dataset(py_obj, h_group, name, **kwargs):
Returns:
correspoinding h5py.Dataset and empty subitems list
"""
return h_group.create_dataset(name, data=bytearray(b'None'),**kwargs),()
return h_group.create_dataset(name, shape = None,dtype = 'V1',**no_compression(kwargs)),()


def check_iterable_item_type(first_item,iter_obj):
Expand Down Expand Up @@ -118,17 +118,16 @@ def create_listlike_dataset(py_obj, h_group, name,list_len = -1,item_dtype = Non

if isinstance(py_obj,(str,bytes)):
# strings and bytes are stored as array of bytes with strings encoded using utf8 encoding
dataset = h_group.create_dataset(
name,
data = bytearray(py_obj,"utf8") if isinstance(py_obj,str) else bytearray(py_obj),
**kwargs
)
string_data = bytearray(py_obj,"utf8") if isinstance(py_obj,str) else memoryview(py_obj)
string_data = np.array(string_data,copy=False)
string_data.dtype = 'S1'
dataset = h_group.create_dataset( name, data = string_data,shape = (1,string_data.size), **kwargs)
dataset.attrs["str_type"] = py_obj.__class__.__name__.encode("ascii")
return dataset,()

if len(py_obj) < 1:
# listlike object is empty just store empty dataset
return h_group.create_dataset(name,shape=None,dtype=int,**no_compression(kwargs)),()
return h_group.create_dataset(name,shape=None,dtype='int',**no_compression(kwargs)),()

if list_len < 0:
# neither length nor dtype of items is know compute them now
Expand Down Expand Up @@ -246,14 +245,14 @@ def load_scalar_dataset(h_node,base_type,py_obj_type):
Returns:
resulting python object of type py_obj_type
"""
data = h_node[()] if h_node.size < 2 else bytearray(h_node[()])
data = h_node[()] if h_node.size < 2 else memoryview(h_node[()])


return py_obj_type(data) if data.__class__ is not py_obj_type else data

def load_none_dataset(h_node,base_type,py_obj_type):
"""
returns None value as represented by underlying dataset
returns None value as represented by underlying empty dataset
"""
return None

Expand All @@ -275,8 +274,8 @@ def load_list_dataset(h_node,base_type,py_obj_type):
str_type = h_node.attrs.get('str_type', None)
content = h_node[()]
if str_type == b'str':

if "bytes" in h_node.dtype.name:
# decode bytes representing python string before final conversion
if h_node.dtype.itemsize > 1 and 'bytes' in h_node.dtype.name:
# string dataset 4.0.x style convert it back to python string
content = np.array(content, copy=False, dtype=str).tolist()
else:
Expand Down Expand Up @@ -397,11 +396,11 @@ def convert(self):
[set, b"set", create_setlike_dataset, load_list_dataset,SetLikeContainer],
[bytes, b"bytes", create_listlike_dataset, load_list_dataset],
[str, b"str", create_listlike_dataset, load_list_dataset],
[int, b"int", create_scalar_dataset, load_scalar_dataset],
[float, b"float", create_scalar_dataset, load_scalar_dataset],
[complex, b"complex", create_scalar_dataset, load_scalar_dataset],
[bool, b"bool", create_scalar_dataset, load_scalar_dataset],
[None.__class__, b"None", create_none_dataset, load_none_dataset]
[int, b"int", create_scalar_dataset, load_scalar_dataset, None, False],
[float, b"float", create_scalar_dataset, load_scalar_dataset, None, False],
[complex, b"complex", create_scalar_dataset, load_scalar_dataset, None, False],
[bool, b"bool", create_scalar_dataset, load_scalar_dataset, None, False],
[None.__class__, b"None", create_none_dataset, load_none_dataset, None, False]
]

exclude_register = []
19 changes: 14 additions & 5 deletions hickle/loaders/load_numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,10 @@ def create_np_array_dataset(py_obj, h_group, name, **kwargs):
if "str" in dtype.name:
if py_obj.ndim < 1:
# convert string to utf8 encoded bytearray
h_node = h_group.create_dataset(name,data = bytearray(py_obj.tolist(),"utf8"),**kwargs)
string_data = bytearray(py_obj.item(),"utf8") if 'bytes' not in dtype.name else memoryview(py_obj.item())
string_data = np.array(string_data,copy = False)
string_data.dtype = 'S1'
h_node = h_group.create_dataset(name,data = string_data,shape=(1,string_data.size),**kwargs)
sub_items = ()
else:
# store content as list of strings
Expand Down Expand Up @@ -138,8 +141,14 @@ def load_ndarray_dataset(h_node,base_type,py_obj_type):
restores ndarray like object from dataset
"""
dtype = np.dtype(h_node.attrs['np_dtype'].decode('ascii'))
if "str" in dtype.name and "bytes" not in h_node.dtype.name:
return np.array(bytes(h_node[()]).decode("utf8"),dtype=dtype)
if "str" in dtype.name:
string_data = h_node[()]
if h_node.dtype.itemsize <= 1 or 'bytes' not in h_node.dtype.name:
# in hickle 4.0.X np.arrays containing multiple strings are
# not converted to list of string but saved as ar consequently
# itemsize of dtype is > 1
string_data = bytes(string_data).decode("utf8")
return np.array(string_data,copy=False,dtype=dtype)
if issubclass(py_obj_type,np.matrix):
return py_obj_type(data=h_node[()],dtype=dtype)
# TODO how to restore other ndarray derived object_types
Expand Down Expand Up @@ -209,11 +218,11 @@ def convert(self):
# %% REGISTERS
class_register = [
[np.dtype, b"np_dtype", create_np_dtype, load_np_dtype_dataset],
[np.number, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset],
[np.number, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset,None,False],

# for all scalars which are not derived from np.number which itself is np.generic subclass
# to properly catch and handle they will be caught by the following
[np.generic, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset],
[np.generic, b"np_scalar", create_np_scalar_dataset, load_np_scalar_dataset,None,False],

[np.ndarray, b"ndarray", create_np_array_dataset, load_ndarray_dataset,NDArrayLikeContainer],
[np.ma.core.MaskedArray, b"ndarray_masked", create_np_masked_array_dataset, None,NDMaskedArrayContainer],
Expand Down
Loading

0 comments on commit 032a333

Please sign in to comment.