-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content-addressable storage transformer (v3 protocol extension) #82
Comments
Hi @jakirkham, is this the kind of thing you were thinking of? cc @Carreau |
Why not create meta-keys that are |
@alimanfoo: your proposal for a "Content-addressable storage transformer" is really interesting. In the global management of ESGF climate model data, we have the need to provide checksums for sets of netCDF files with version timestamp. Around the world different nodes may take a copy of a data set (set of netCDF files) and verify the content against a manifest of checksums. An alternative world-view would be to put the data sets into Zarr, and to manage the versioning internally. Your approach looks like it would cope with that use case. There are 3 potential advantages:
Have you had any interest from others to take this forward? |
We are currently still focusing on spec v3, I don't believe doing a content addressable storage will be too hard to do on top of spec v3. |
Thanks for writing this up @alimanfoo! 😄 Yeah this is the kind of thing I was thinking about. Agree Matthias. This would be more intended as something on top of v3. Also to your subpoint on listable stores, yeah the idea was to capture the full listing under one JSON object. So it tries to capture the use case that motivated things like consolidated metadata and avoid listing as a requirement. Though maybe there are some wrinkles we would need to work out here. @agstephens, I think you have more-or-less captured the motivations behind such an extension. Though I proposed the idea, clearly have not had time to push it forward myself. If this is something you'd be interested in exploring, that would be very helpful. 😄 |
Thanks @agstephens for your comment, it's great to know this is potentially useful. As John and Matthias have said this will be easier to develop as an extension to the v3 core protocol, so we should probably stay focused on the v3 spec and implementations for the time being so we have a solid foundation to build on. But please do let us know if this extension is something you'd be interested in helping with at any point, either writing a proper spec or doing a prototype implementation. |
Hi @jakirkham, just to say that this proposal doesn't create a full listing of all metadata objects in a single JSON document, i.e., it doesn't replace consolidated metadata. The metadata would still be scattered into lots of separate objects, one for each node in the hierarchy. All it does is create a layer of indirection when writing or reading any object in the store, that allows you to read objects as at a given time point, and also verify objects haven't got corrupted. Hope that makes sense. |
Also learned about |
A content addressable storage layer like this was part of Mandoline, a precursor to Zarr that I used years ago: Use cases like "time-travel" were indeed the main motivation for this feature. So I think this could be useful, but it's worth keeping in mind that the size of this metadata can add up for arrays with lots of chunks. For such cases, a separate database layer for keeping track of metadata might make sense -- the metadata is too small to make sense to put in the object stores typically used for array chunks, which are focused on throughput rather than latency. |
@shoyer your suggestion about separating out a database layer from the objects is really interesting. I have heard others (@rabernat) make similar suggestions about some files staying on POSIX. In an idealised future where everything lives in the cloud you would need to make sure the database is as accessible as the objects themselves. So you would need an API that queried the DB, and I suppose the content could be consolidated into a single "metadata" construct to avoid time wasted in excessive queries. |
FWIW, I've started some investigating with http://datalad.org which is based on git-annex. From my limited experience, that means you can choose whether you publish the data and/or the history to each remote. |
I am playing around a bit with zarr on top of IPFS and have a use case similar to what @agstephens wrote: distributing datasets globally where only some nodes store some of the data. IPFS basically implements a global content addressable file system. Blocks are hashed, then compiled to files (which are a list of blocks). The files are hashed again and are compiled to directories, thus forming a Merkle tree. This structure maps very well to the tree structure of a zarr dataset. So by adding a content addressable layer below zarr in stead of in the middle, I think the implementation could be a bit simpler. In fact, I think it already works quite ok with the current, unmodified version of zarr. The downside of using a content addressable layer below probably is that we won't benefit from all the different store implementations which are already there. We'd essentially have to use what IPFS (or whatever content addressable filesystem) can run on. So for now, I don't see which variant would turn out to be the better one in the end. There's however one thing which I think is missing and which would be beneficial for both variants (storage transformer and zarr on content addressable storage): it might be very useful to find a way to expose a globally unique content-id (for datasets, variables and chunks) to the user API if such a thing is present in the underlying store. This would create some possibilities for optimization:
The end result of all of those operations shouldn't depend on the content-id being visible to higher level APIs or the user due to automatic deduplication. But in the cases listed above, a large amount of re-hashing and data transfers could be avoided by providing this information. |
We have just discovered IPFS and Filecoin and are now very interested in this conversation. Tagging @jbusecke, @cisaacstern. Perhaps some folks from Protocol Labs might be able to weight in on this and help think about the Zarr / IPFS implementation. |
Hey team, the Filecoin/IPFS community is more than happy to help here. @d70-t can you send me an email at [email protected] please and we can discuss over the phone. We can report back our findings to the team here. |
Sorry to be very late to this discussion. I wonder if ReferenceFileSystem provides the amount of indirection being talked about here. It is an fsspec implementation, so already works with zarr (v2!), and each file is either a short binary embedded in the reference structure (for metadata) or a link to some bytes range of some URL. The list of keys is, at its simplest, just a dictionary. Similarly, name hashing and write mode could be implemented as an fsspec backend without need for explicit code in zarr or even an extension. Of course, you may argue that codifying this process as an extension is exactly the point of this thread. |
I just had another thought about the idea of making a content id available to higher level APIs and tracking it through computations: So this wouldn't go as far as making the CID available to users (which would still be very good), but would already provide some of the benefits and that even for a larger set of underlying filesystems. |
Sounds interesting and viable -- I wonder if you or someone else tried to come up with some prototypical implementation following that idea? |
We certainly are using ReferenceFileSystem and zarr to create virtual datasets consisting of binary blobs inside other files - and it works great! That's a mapping of path->(other path, offset, size), so similar but different to the discussion here - but adapting or making a new driver for content addressing inspired by that implementation should not be too hard. |
This issue describes a concept for zarr v3 protocol extension which enables content-addressable storage to be layered on top of any underlying store. It is a thought experiment only, not a concrete proposal. Elaboration of suggestions at #76.
Goals:
The protocol extension introduces a layer of indirection to the storage protocol. This can be thought of as a transformation layer which sits above the store and modifies the key/value store operations.
When attempting to store a given (key, value) pair, the storage transformer hashes the value, and the hash is used to obtain a content-addressable key.
E.g., if storing an encoded chunk value under key 'data/foo/bar/0.0', if the hash for the value is 'abcdefghijklmnopqrstuvwxyz', then the content-addressable key would be something like 'content/a/b/c/d/efghijklmnopqrstuvwxyz'. The way that the transformer generates content-addressable keys could be configured with depth, width and hash algorithm.
The transformer would then issue a request to the underlying store to set (content-addressable-key, value).
To keep track of the location of the content, a content metadata document would be created, which records the content-addressable key, together with any other metadata such as timestamp of creation. This could be a JSON document, and could record multiple versions of the content. E.g.:
This JSON document could be stored under the original key, in this case 'data/foo/bar/0.0'.
In other words, when the transformer receives a request set(key, value), it does the following:
When the transformation layer receives a request to get(key, value), it does the following:
The transformation layer could also expose an API to set the time state for reading. E.g., if time state is set to time T, then when the transformation layer receives a request to get(key, value), it does the following:
In order to discover that content-addressable storage transformer extension is in use, this could be declared in the zarr entry point metadata document, e.g., zarr.json like:
When a zarr v3 implementation opened a hierarchy using this extension, it could recognise that when parsing the entry point metadata document, and insert the appropriate store transformer if supported. I.e., a user opening such a hierarchy would not need to know that content-addressable storage was used, the implementation would discover that for itself.
There are several potential advantages of this scheme:
The group and array metadata keys would exist in the store as normal, and so any functionality inspecting the keys to infer which groups and arrays are present in the hierarchy would still work unchanged. I.e., the transformation layer could just pass through the list, list_pre and list_dir operations to the underlying store and everything should work as normal.
The metadata that tracks the content locations would be decentralised, allowing all of the same degrees of parallelism that the normal store provides. I.e., chunks could be written in parallel, and arrays could be created in parallel, without requiring any locking or synchronisation.
Because the extension is a transformation layer, any type of underlying store could be used. It would also be fine to migrate data between different types of storage, e.g., create data on a local filesystem, then copy up to object storage.
This extension does provide a slightly stronger guarantee against the underlying store getting corrupted by partially-successful writes, because the content metadata documents would only ever get updated after a successful content write operation.
Notes:
The text was updated successfully, but these errors were encountered: