-
-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize to other storage systems #21
Comments
In principle it sounds great. For the non-synchronized array/chunk implementations this sounds relatively straightforward, i.e., zarr really does just need a MutableMapping where it can store and retrieve bytes. For the synchronized implementations I'm not sure how to handle the locking. I guess the locking is at a level above the MutableMapping interface, because executing Very happy to discuss. |
Perhaps we supply both a When dealing with distributed storage/computation we rarely care about storing in place. It's far more common to make a completely new dataset as output. |
Right, but when you are storing the output, what happens depends on how Yes, maybe a MutableMapping and some other interface that allows to acquire On Friday, 15 April 2016, Matthew Rocklin [email protected] wrote:
Alistair Miles |
Hrm, yes I see. For distributed computing locks are hard. Short term I see two cheap options:
|
I guess if the storage (MutableMapping) and locking interfaces were There would be plenty of scope for users to blow their own feet off here, On Friday, April 15, 2016, Matthew Rocklin [email protected] wrote:
Alistair Miles |
I'm not particularly picky about the API here, I'm quite willing to jump through a couple of hoops. However I'm also not too worried about people shooting off their own feet. Specifying MutableMappings and Locks is probably a bit of a hurdle and unlikely to be attempted by the casual user. |
I think this could be very elegant. It would require some fairly On Friday, April 15, 2016, Matthew Rocklin [email protected] wrote:
Alistair Miles |
Do we know of any students or other folks looking for a project? One of my goals for using zarr for this is that I wouldn't have to load the array handling bits into my head, but could restrict myself to creating MutableMappings. |
It would take some work to get to that point I'm afraid, although many On Friday, April 15, 2016, Matthew Rocklin [email protected] wrote:
Alistair Miles |
👍 I came here to suggest something exactly like this. I was going to suggest some sort of generic filesystem API, but MutableMappings sounds cleaner. Guarding against concurrent attempts to write to the same chunk does seem hard to guard against for arbitrary storage systems, but I think a little bit less safety is probably OK. |
@shoyer is there anyone in the xarray community who might be willing to take this on? Also, what are your thoughts regarding the NetCDF data model and |
@alimanfoo can you provide a more detailed list of steps that would be necessary to accomplish this? How would you go about implementing this? |
Well, you can always try @pwolfram :).
We basically need two things to make this happen, both very straightforward:
|
It would also be interesting to have a tar file backend. Having things in a single file can be convenient for moving data around. |
I agree that single file storage is convenient, especially for sharing datasets. I would recommend Zip over Tar though. Tar doesn't support random access while Zip does. There is, fortunately, already a |
I am interested in this although time is very limited right now and I have to make sure the scope is in "needed to get science done". However, it would be great to have access to a clean bundling capability via tar/zip file and this may be useful/easy within the context of a distributed/dask/xarray integration. See also pydata/xarray#798 |
I'll give this some thought over the next few days and try to write down On Saturday, 16 April 2016, Phillip Wolfram [email protected]
Alistair Miles |
I've given this some thought over the weekend and have an initial sketch for how to design the API and refactor the existing code. The good news is I think this will solve several problems in one go, and make the internal architecture much simpler with clear separation of concerns. So I'm convinced it's well worth the effort. That said, I do think the code and tests need to be completely restructured, so it's not a trivial piece of work. I wish I'd had the foresight to do it this way first time, but hey, that's why open source is good. I've pushed my initial API sketch up to a new "refactor" branch. I will give some notes and discussion below. Please note that this is just an initial sketch and I'm sure will need to be modified/refined. The first step is to separate out the storage layer. I propose the zarr.store.base.ArrayStore abstract class, defining the interface to storage for a single array. This class has a Existing code for storing arrays in memory and on disk would be refactored into the zarr.store.memory.MemoryStore and zarr.store.directory.DirectoryStore classes respectively. New implementations of storage layers, e.g., using a ZipFile or S3, would live alongside these as separate sub-modules. Implementation of the The second step is to separate the synchronization (i.e., locking) functionality. I propose the zarr.sync.ArraySynchronizer abstract class, defining the interface to synchronization operations for a single array. The most important method is the Existing code for doing thread-based locking and inter-process locking would be refactored into classes Once APIs for storage and synchronization are defined, we could implement two classes, Array and SynchronizedArray. These two classes would replace all of the existing array classes. When instantiating an The last issue is how to deal with the operations to get data from a chunk and set or modify data in a chunk, and the lowest-level operations to compress and decompress data using blosc. Previously I had implemented a set of Chunk classes do encapsulate all of this, but in doing this refactoring I realise I think that these classes are unnecessary, i.e., all the chunk classes can be deleted. This not only simplifies the code, but it also removes a source of overhead, because no state needs to be maintained for any chunk, other than holding the compressed data for each chunk in a store. To work this last part of the API through I've implemented the Given this API, all of the existing tests would also need to be refactored. Again the good news is that the cleaner separation of concerns should also simplify the internal architecture of the tests, although this too is a fairly substantial piece of work. Any comments or thoughts on this very welcome. |
Looks like you put a lot of thought on this, and this separation of the storage layer seems a good idea to me too. Just wanted to point out that |
Thanks Francesc. Look forward to blosc2. On Tuesday, April 19, 2016, FrancescAlted [email protected] wrote:
Alistair Miles |
Just to say #22 has work in progress on refactoring to address this issue. I'm pretty excited about this and am going to try and use bits and pieces of free time over the coming weeks to push this forward, but progress may be slow so if anyone else would like to contribute please feel free to jump in. |
A note regarding the possibility of using a zip file as storage. It looks like it is not possible to update an entry in a zip file. Calling writestr('foo', somebytes) more than once will create two 'foo' entries within a zip file. Therefore using a zip file to store a zarr array would only work under the limited circumstances that each chunk of the array is only ever written once. This would mean that calls to |
Yeah, zip files don't support clean in-place updates. Semantically everything works fine but you'll get a lot of entries in the file that are no longer useful. There are other single-file archive formats, it's tricky to find one that does everything. I still think that Zip is a good choice for sending datasets around, though probably not for workflows that involve a great deal of mutation. It's also possible to do a sort of garbage collection on the Zip file to eliminate the zombie entries. This requires a full read/write. |
I want something very similar to
zarr
on S3 and I'm pondering the easiest way to get there. One approach is to generalizezarr
to accept pluggable byte storage solutions.Currently, I believe that
zarr
effectively treats the file system as aMutableMapping
into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using theMutableMapping
interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.For concreteness, here is a
MutableMapping
that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.pyThe text was updated successfully, but these errors were encountered: