Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify what subset of fsspec's interface we use #137

Open
TomAugspurger opened this issue May 15, 2021 · 7 comments
Open

Clarify what subset of fsspec's interface we use #137

TomAugspurger opened this issue May 15, 2021 · 7 comments

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 15, 2021

I'm trying to push some data through pangeo-forge-recipes, and local testing is working fine. When I go to write to blob storage, I'm running into a convoluted set of bugs conspiring to make things fail (partially written files, stale caches, differences in implementations, plain bugs, ...)

To make some progress on my problem, I wrote a very thin wrapper around the azure.storage.blob interface that implements a MutableMapping, essentially replacing adlfs.AzureBlobFileSystem.get_mapper().

Given the extremely difficult task fsspec is trying to solve, I wonder if we can come up with a subset of the fsspec interface that we're using, in addition to Zarr's reliance on the MutableMapping interface. Currently, FSSpecTarget takes an instance of fsspec's AbstractFileSystem, so in theory a target would need to implement the entire interface. But I think we use much less of it. So far, I have a fake "AbstractFileSystem" that implements

  • isdir(self, path) -> bool
  • get_mapper(self, path) -> MutableMapping

and with that I've been able to process some data using the XarrayToZarr recipe.

So this issue would be to identify

  1. If this is worth doing (I'll continue to investigate the issues I'm seeing with adlfs / fsspec, but like I said, it's convoluted)
  2. If so, what pangeo-forge needs out of a target filesystem.
@TomAugspurger
Copy link
Contributor Author

FYI @CiaranEvans @sharkinsspatial

class AzureBlobStorageStore(MutableMapping):
    def __init__(self, container_client, root=""):
        if len(root):
            assert root[0] != "/"
            assert root[-1] == "/"
        self.container_client = container_client
        self.root = root

    def __getitem__(self, key):
        key = os.path.join(self.root, key)
        with self.container_client.get_blob_client(key) as bc:
            try:
                stream = bc.download_blob()
            except ResourceNotFoundError as e:
                raise KeyError(key) from e
            data = stream.readall()
        return data

    def __setitem__(self, key, value):
        key = os.path.join(self.root, key)
        # bug in zarr? xarray?
        if hasattr(value, "size") and value.size == 1 and hasattr(value, "tobytes"):
            value = value.tobytes()

        with self.container_client.get_blob_client(key) as bc:
            bc.upload_blob(value, overwrite=True)

    def __delitem__(self, key):
        key = os.path.join(self.root, key)
        with self.container_client.get_blob_client(key) as bc:
            bc.delete_blob()

    def __iter__(self):
        prefix_len = len(self.root)
        return (
            x["name"][prefix_len:] for x in self.container_client.list_blobs(self.root)
        )
        return self.lisdir()

    def __len__(self):
        return len(list(self.container_client.list_blobs(self.root)))


class AzureBlobStorageFS:
    def __init__(self, container_client):
        self.container_client = container_client

    def get_mapper(self, root):
        return AzureBlobStorageStore(self.container_client, root)

    def isdir(self, path):
        return True

usage

    target = FSSpecTarget(fs_remote, f"{frequency}/{region}.zarr/")
    recipe.target = target

@rabernat
Copy link
Contributor

When I go to write to blob storage, I'm running into a convoluted set of bugs conspiring to make things fail (partially written files, stale caches, differences in implementations, plain bugs, ...)

Yes I feel your pain! This is partly what prompted pangeo-data/pangeo-integration-tests#1.

It is relatively easy to write your own limited cloud storage adaptor. On the other hand, we have @martindurant on contract to help us with all fsspec-related issues. I would like to see if we can't fix things upstream first, rather than forking.

@martindurant
Copy link
Contributor

This sort of goes around the circle of the specific mappers that exist now in zarr. Yes, you can certainly make things simpler by specialising your storage backend to be mappers only. I do worry about multiple implementations, though - when we learn about how to do things well for a particular backend, we would need to port it to both places. Also, this mechanism skips other things fsspec might offer, such as the discussion about non-local caching: pangeo-forge can implement this, but I would rather it was upstream and useful to a wider community.
Obviously I am not impartial here.

@rabernat
Copy link
Contributor

@ciaransweet
Copy link

@TomAugspurger what is fs_remote? Can I just drop the above classes into https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/add-k8s-cluster/flow_test/transform_flow.py for the time being? Or where would you put them?

@rabernat @martindurant I've no real understanding of the underlying fsspec/adlfs stuff, but I'm definitely also experiencing some 'fun' with Azure Blob Storage and recipes: pangeo-forge/pangeo-forge-azure-bakery#4 (comment)

@rabernat
Copy link
Contributor

Regardless of what route we choose here, I agree that formalizing the the fsspec interface we use is a good idea. Since #133, all fsspec-related stuff has lived exclusively in storage.py, so it should be relatively easy.

@TomAugspurger
Copy link
Contributor Author

Agreed we don't want to fork away from fsspec or reimplement stuff. @CiaranEvans was also hitting issues in adlfs / fsspec and I wanted to get them unblocked by sharing that snippet. @CiaranEvans the full context is at https://gist.github.com/TomAugspurger/c1773efa06d71443f135eb1af8832b82, which includes defining fs_remote.

The potential action item for this issue is to decide if we can rely on just Zarr's usage of MutableMapping (+ maybe listdir and rmdir), which fsspec implements. That would maybe limit the API surface that pangeo-forge is exposed to. Currently it looks like we use isdir and mkdir to initialize the Bucket / Storage account. Maybe that could be done ahead of time by the bakery.

That said, just limiting the API surface wouldn't have fixed things in this case, as the bugs I'm hit by the mutable mapping API codepath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants