-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add simple caching utility #56
Comments
Did some preliminary research on other popular "caching" libraries. This so question was helpful. First up, some takeaways
Now, the other remotely feasible libraries On-disk pickled caches
Server-based caching solutionsThese would be a radically different approach to computing... but maybe? |
oops. |
Here's my implementation for caching NOAA API calls from __future__ import absolute_import
import os
import toolz
import pickle
import inspect
import hashlib
import functools
from os.path import join
from sklearn.gaussian_process.kernels import RBF, _check_length_scale
from scipy.spatial.distance import pdist, squareform, cdist
import numpy as np
import pandas as pd
import shapely as shp
import shapely.geometry
import scipy.interpolate
import pyTC.settings
def get_error_type_indices(ftrs):
io_indices = []
fnf_indices = []
other_indices = []
for ftr in [f for f in ftrs if f.status == "error"]:
if isinstance(ftr.exception(), FileNotFoundError):
fnf_indices.append(ftrs.index(ftr))
elif isinstance(ftr.exception(), OSError):
io_indices.append(ftrs.index(ftr))
else:
other_indices.append(ftrs.index(ftr))
return {"io": io_indices, "fnf": fnf_indices, "other": other_indices}
@toolz.curry
def cache_result_in_pickle(func, cache_dir=None, makedirs=False, error="raise"):
"""
Caches the results of a function in the specified directory
Uses the python pickle module to store the results of a
function call in a directory, with file names set to the
sha256 hash of the function's arguments. Pass `redo=True`
or delete the contents of the directory to reset the cache.
Because the results are cached based only on function
parameters, it is important that the function not have any
side effects.
Note that all function arguments are hashed to derive a
cached filename, and that any change to any input will
produce a new cached file. Therefore, functions that
depend on complex, frequently changing objects, especially
settings objects, should not be cached. Instead, cache
lower-level functions with a small list of simple,
explicit arguments.
Note also that cached files are not cleaned up
automatically, and therefore changes in the arguments to a
function will result in a new set of cached files being
saved without removing the older files. This could result
in cache storage creep unless the cache is periodically
cleared. Clearing the cache based on file creation date
can be an important part of cache maintenance.
.. todo::
replace this function with a more complete
implementation, e.g. the one described in
[GH RhodiumGroup/rhg_compute_tools#56](https://github.com/RhodiumGroup/rhg_compute_tools/issues/56).
Parameters
----------
func : function
function to decorate. cannot have `redo` as an argument.
cache_dir : str
path to the root directory used in caching. If not
provided, will use the `COASTAL_CACHE_DIR` attribute
from `pyTC.settings.Settings()`, either one passed as `ps`
to the wrapped func, or the default settings object if
none is provided.
makedirs : bool, optional
Returns
-------
decorated : function
Function, with cached results
Examples
--------
.. code-block:: python
>>> @cache_result_in_pickle(cache_dir=(tmpdir + '/cache'), makedirs=True)
... def long_running_func(i):
... import time
... time.sleep(0.1)
... return i
...
Initial calls will execute the function fully
.. code-block:: python
>>> long_running_func(1) # > 0.1s
1
Subsequent calls will be much faster
.. code-block:: python
>>> long_running_func(1) # << 0.1 s
1
Changing the arguments will result in re-evaluation
.. code-block:: python
>>> long_running_func(3) # > 0.1s
3
Cached results are stored in the specified directory, under a
subdirectory for each decorated function:
.. code-block:: python
>>> os.listdir(
... tmpdir + '/cache/pyTC.utilities.long_running_func'
... ) # doctest: +NORMALIZE_WHITESPACE
...
['259ca9884c55ef7e909c0558978d73f915c6454d8e38bc576e8d48179138491a',
'57630b792604ad1c663441890cda34728ffcb2c04d6b29dc720fd810318b61b6']
Deleting these files would reset the cache without error. The cache can
also be refreshed on a per-call basis by passing `redo=True` to the
function call:
.. code-block:: python
>>> long_running_func(1, redo=True) # > 0.1s
1
The parameters `'cache_dir'`, `'mkdirs'`, and `'error'` can also be
overridden at function call:
.. code-block:: python
>>> long_running_func(1, cache_dir=(tmpdir + '/cache2'))
1
>>> os.listdir(
... tmpdir + '/cache2/pyTC.utilities.long_running_func'
... ) # doctest: +NORMALIZE_WHITESPACE
...
['259ca9884c55ef7e909c0558978d73f915c6454d8e38bc576e8d48179138491a']
"""
funcname = ".".join([func.__module__, func.__name__])
sig = inspect.Signature.from_callable(func)
default_cache_dir = cache_dir
default_makedirs = makedirs
default_error = error
@functools.wraps(func)
def inner(*args, redo=False, cache_dir=None, makedirs=None, error=None, **kwargs):
if cache_dir is None:
cache_dir = default_cache_dir
if makedirs is None:
makedirs = default_makedirs
if error is None:
error = default_error
if error is None:
error = "raise"
error = str(error).lower()
assert error in [
"raise",
"ignore",
"remove",
], "error must be one of `'raise'`, `'ignore'`, or `'remove'`"
if cache_dir is None:
ps = kwargs.get("ps")
if ps is None:
ps = pyTC.settings.Settings()
cache_dir = ps.DIR_DATA_CACHE
bound_args = sig.bind(*args, **kwargs)
bound_args.apply_defaults()
sha = hashlib.sha256(pickle.dumps(bound_args))
path = os.path.join(cache_dir, funcname, sha.hexdigest())
if not redo:
try:
with open(path, "rb") as f:
return pickle.load(f)
except (OSError, IOError):
pass
res = func(*args, **kwargs)
try:
if makedirs:
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "wb+") as f:
pickle.dump(res, f)
except (OSError, IOError, ValueError) as e:
if error == "raise":
raise
elif error == "remove":
try:
os.remove(path)
except (IOError):
pass
raise RuntimeError from e
else:
# case error == 'ignore'
pass
return res
return inner |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some sort of easy caching utility would be great. Something like a decorator that can accept a filepattern and an overwrite argument on write.
Does something like this already exist? Also would be great to have this work with intake!
Proposed implementation
Lots to still work out here, but here's a stab:
This could be extended with a number of format-specific decorators quite easily
Proposed usage
These could then be used in a variety of ways.
No arguments on decoration requires that a path be provided when called:
Providing a storage pattern allows you to set up a complex directory structure
We can also pass reader/writer kwargs for more complex IO:
Once the argument hashing in the TODO referenced above is implemented, we could handle arbitrarily complex argument calls, which will be hashed to form a unique, stable file name, e.g.:
TODO
The text was updated successfully, but these errors were encountered: