-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify what subset of fsspec's interface we use #137
Comments
FYI @CiaranEvans @sharkinsspatial class AzureBlobStorageStore(MutableMapping):
def __init__(self, container_client, root=""):
if len(root):
assert root[0] != "/"
assert root[-1] == "/"
self.container_client = container_client
self.root = root
def __getitem__(self, key):
key = os.path.join(self.root, key)
with self.container_client.get_blob_client(key) as bc:
try:
stream = bc.download_blob()
except ResourceNotFoundError as e:
raise KeyError(key) from e
data = stream.readall()
return data
def __setitem__(self, key, value):
key = os.path.join(self.root, key)
# bug in zarr? xarray?
if hasattr(value, "size") and value.size == 1 and hasattr(value, "tobytes"):
value = value.tobytes()
with self.container_client.get_blob_client(key) as bc:
bc.upload_blob(value, overwrite=True)
def __delitem__(self, key):
key = os.path.join(self.root, key)
with self.container_client.get_blob_client(key) as bc:
bc.delete_blob()
def __iter__(self):
prefix_len = len(self.root)
return (
x["name"][prefix_len:] for x in self.container_client.list_blobs(self.root)
)
return self.lisdir()
def __len__(self):
return len(list(self.container_client.list_blobs(self.root)))
class AzureBlobStorageFS:
def __init__(self, container_client):
self.container_client = container_client
def get_mapper(self, root):
return AzureBlobStorageStore(self.container_client, root)
def isdir(self, path):
return True usage target = FSSpecTarget(fs_remote, f"{frequency}/{region}.zarr/")
recipe.target = target |
Yes I feel your pain! This is partly what prompted pangeo-data/pangeo-integration-tests#1. It is relatively easy to write your own limited cloud storage adaptor. On the other hand, we have @martindurant on contract to help us with all fsspec-related issues. I would like to see if we can't fix things upstream first, rather than forking. |
This sort of goes around the circle of the specific mappers that exist now in zarr. Yes, you can certainly make things simpler by specialising your storage backend to be mappers only. I do worry about multiple implementations, though - when we learn about how to do things well for a particular backend, we would need to port it to both places. Also, this mechanism skips other things fsspec might offer, such as the discussion about non-local caching: pangeo-forge can implement this, but I would rather it was upstream and useful to a wider community. |
Zarr also has its own bespoke ABS store: https://github.com/zarr-developers/zarr-python/blob/master/zarr/storage.py#L2185 |
@TomAugspurger what is @rabernat @martindurant I've no real understanding of the underlying fsspec/adlfs stuff, but I'm definitely also experiencing some 'fun' with Azure Blob Storage and recipes: pangeo-forge/pangeo-forge-azure-bakery#4 (comment) |
Regardless of what route we choose here, I agree that formalizing the the fsspec interface we use is a good idea. Since #133, all fsspec-related stuff has lived exclusively in storage.py, so it should be relatively easy. |
Agreed we don't want to fork away from fsspec or reimplement stuff. @CiaranEvans was also hitting issues in adlfs / fsspec and I wanted to get them unblocked by sharing that snippet. @CiaranEvans the full context is at https://gist.github.com/TomAugspurger/c1773efa06d71443f135eb1af8832b82, which includes defining The potential action item for this issue is to decide if we can rely on just Zarr's usage of MutableMapping (+ maybe That said, just limiting the API surface wouldn't have fixed things in this case, as the bugs I'm hit by the mutable mapping API codepath. |
I'm trying to push some data through pangeo-forge-recipes, and local testing is working fine. When I go to write to blob storage, I'm running into a convoluted set of bugs conspiring to make things fail (partially written files, stale caches, differences in implementations, plain bugs, ...)
To make some progress on my problem, I wrote a very thin wrapper around the
azure.storage.blob
interface that implements aMutableMapping
, essentially replacingadlfs.AzureBlobFileSystem.get_mapper()
.Given the extremely difficult task fsspec is trying to solve, I wonder if we can come up with a subset of the fsspec interface that we're using, in addition to Zarr's reliance on the MutableMapping interface. Currently,
FSSpecTarget
takes an instance offsspec
'sAbstractFileSystem
, so in theory a target would need to implement the entire interface. But I think we use much less of it. So far, I have a fake "AbstractFileSystem" that implementsisdir(self, path) -> bool
get_mapper(self, path) -> MutableMapping
and with that I've been able to process some data using the XarrayToZarr recipe.
So this issue would be to identify
The text was updated successfully, but these errors were encountered: