Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize to other storage systems #21

Closed
mrocklin opened this issue Apr 14, 2016 · 23 comments
Closed

Generalize to other storage systems #21

mrocklin opened this issue Apr 14, 2016 · 23 comments

Comments

@mrocklin
Copy link
Contributor

I want something very similar to zarr on S3 and I'm pondering the easiest way to get there. One approach is to generalize zarr to accept pluggable byte storage solutions.

Currently, I believe that zarr effectively treats the file system as a MutableMapping into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using the MutableMapping interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.

For concreteness, here is a MutableMapping that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py

@alimanfoo
Copy link
Member

In principle it sounds great.

For the non-synchronized array/chunk implementations this sounds relatively straightforward, i.e., zarr really does just need a MutableMapping where it can store and retrieve bytes. For the synchronized implementations I'm not sure how to handle the locking. I guess the locking is at a level above the MutableMapping interface, because executing __setitem__() on a chunk includes both a retrieval operation (get existing bytes) and a store operation (put modified bytes). Currently for the persistent synchronized chunks zarr uses a fasteners inter-process lock, which depends on access to a file system.

Very happy to discuss.

@mrocklin
Copy link
Contributor Author

mrocklin commented Apr 15, 2016

Perhaps we supply both a MutableMapping and a Lock-like object? Current pairs might be (dict, threading.Lock) or (zict.File, fasteners.InterProcessLock)?

When dealing with distributed storage/computation we rarely care about storing in place. It's far more common to make a completely new dataset as output.

@alimanfoo
Copy link
Member

Right, but when you are storing the output, what happens depends on how
your storage operations are aligned with chunk boundaries. If they are not
exactly aligned then two storage operations may need to store data within
the same chunk. So you need to sync access to the chunk, even when writing
new outputs.

Yes, maybe a MutableMapping and some other interface that allows to acquire
and release a lock for a given key?

On Friday, 15 April 2016, Matthew Rocklin [email protected] wrote:

Perhaps we supply both a MutableMapping and a Lock-like object? Current
pairs might be (dict, Lock) or (zict.File, fasteners.InterProcessLock)?

When dealing with distributed storage/computation we rarely care about
storing in place. It's far more common to make a completely new dataset as
output.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/21#issuecomment-210212761

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@mrocklin
Copy link
Contributor Author

Hrm, yes I see. For distributed computing locks are hard. Short term I see two cheap options:

  1. Only store along chunk boundaries. In the special case of dask.array I'm completely comfortable guaranteeing this.
  2. Only ensure safe writing in the single machine case where our processes have access to a shared file system. Accessing S3, Zip files, or other storage solutions still has value outside of distributed computing.

@alimanfoo
Copy link
Member

I guess if the storage (MutableMapping) and locking interfaces were
decoupled then there would be flexibility to accommodate these options.
I.e., a user could decide to use S3 for storage and not use any locking
because they know writes will always be aligned with chunk boundaries. Or a
user could choose S3 for storage and then choose thread-based locking
because they know they will be executing within a multi-threaded context.
Or a user could choose S3 for storage and choose locking based on access to
a shared file-system. Any combination of storage and locking would be
possible, although some would not be advisable depending on what the user
wants to do and how they want to run it.

There would be plenty of scope for users to blow their own feet off here,
so maybe a lower level API provides the full flexibility, and a
higher-level API provides convenience functions based on sensible
combinations of storage and locking for some common usage patterns?

On Friday, April 15, 2016, Matthew Rocklin [email protected] wrote:

Hrm, yes I see. For distributed computing locks are hard. Short term I see
two cheap options:

  1. Only store along chunk boundaries. In the special case of
    dask.array I'm completely comfortable guaranteeing this.
  2. Only ensure safe writing in the single machine case where our
    processes have access to a shared file system. Accessing S3, Zip files, or
    other storage solutions still has value outside of distributed computing.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/21#issuecomment-210223880

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@mrocklin
Copy link
Contributor Author

I'm not particularly picky about the API here, I'm quite willing to jump through a couple of hoops. However I'm also not too worried about people shooting off their own feet. Specifying MutableMappings and Locks is probably a bit of a hurdle and unlikely to be attempted by the casual user.

@alimanfoo
Copy link
Member

I think this could be very elegant. It would require some fairly
substantial restructuring of the existing code, but I like the decoupling
of concerns and think it would be worth it. I'm sorry I wouldn't have any
time to work on this myself at least until after the summer, but I'd be
more than happy to discuss ideas and review code.

On Friday, April 15, 2016, Matthew Rocklin [email protected] wrote:

I'm not particularly picky about the API here, I'm quite willing to jump
through a couple of hoops. However I'm also not too worried about people
shooting off their own feet. Specifying MutableMappings and Locks is
probably a bit of a hurdle and unlikely to be attempted by the casual user.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/21#issuecomment-210489356

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@mrocklin
Copy link
Contributor Author

Do we know of any students or other folks looking for a project? One of my goals for using zarr for this is that I wouldn't have to load the array handling bits into my head, but could restrict myself to creating MutableMappings.

@alimanfoo
Copy link
Member

It would take some work to get to that point I'm afraid, although many
aspects of the array-handling logic could be left untouched, it is more
about refactoring.

On Friday, April 15, 2016, Matthew Rocklin [email protected] wrote:

Do we know of any students or other folks looking for a project? One of my
goals for using zarr for this is that I wouldn't have to load the array
handling bits into my head, but could restrict myself to creating
MutableMappings.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/21#issuecomment-210495918

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@shoyer
Copy link
Contributor

shoyer commented Apr 15, 2016

👍 I came here to suggest something exactly like this. I was going to suggest some sort of generic filesystem API, but MutableMappings sounds cleaner.

Guarding against concurrent attempts to write to the same chunk does seem hard to guard against for arbitrary storage systems, but I think a little bit less safety is probably OK.

@mrocklin
Copy link
Contributor Author

@shoyer is there anyone in the xarray community who might be willing to take this on?

Also, what are your thoughts regarding the NetCDF data model and zarr? How do we manage related variables and coordinates?

@mrocklin
Copy link
Contributor Author

@alimanfoo can you provide a more detailed list of steps that would be necessary to accomplish this? How would you go about implementing this?

@shoyer
Copy link
Contributor

shoyer commented Apr 15, 2016

@shoyer is there anyone in the xarray community who might be willing to take this on?

Well, you can always try @pwolfram :).

Also, what are your thoughts regarding the NetCDF data model and zarr? How do we manage related variables and coordinates?

We basically need two things to make this happen, both very straightforward:

  • zhdf: a hierarchical storage system for zarrays that supports metadata. The obvious approach is to use the filesystem with __zattr__ JSON files to hold metadata.
  • znetcdf: mapping from the netCDF data model to zhdf. h5netcdf is a good example of what this entails. This should be even easier, since we don't have HDF5 dimension scales to worry about. We just need to settle on a convention for how to name dimensions in __zattr__ files.

@shoyer
Copy link
Contributor

shoyer commented Apr 15, 2016

It would also be interesting to have a tar file backend. Having things in a single file can be convenient for moving data around.

@mrocklin
Copy link
Contributor Author

I agree that single file storage is convenient, especially for sharing datasets. I would recommend Zip over Tar though. Tar doesn't support random access while Zip does. There is, fortunately, already a Zip(MutableMapping) within the zict library. It was actually the first implementation in there and the reason for the name zict.

@pwolfram
Copy link

I am interested in this although time is very limited right now and I have to make sure the scope is in "needed to get science done". However, it would be great to have access to a clean bundling capability via tar/zip file and this may be useful/easy within the context of a distributed/dask/xarray integration. See also pydata/xarray#798

@alimanfoo
Copy link
Member

I'll give this some thought over the next few days and try to write down
some ideas on the API and how to do the refactoring.

On Saturday, 16 April 2016, Phillip Wolfram [email protected]
wrote:

I am interested in this although time is very limited right now and I have
to make sure the scope is in "needed to get science done". However, it
would be great to have access to a clean bundling capability via tar/zip
file and this may be useful/easy within the context of a
distributed/dask/xarray integration. See also pydata/xarray#798
pydata/xarray#798


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/21#issuecomment-210686715

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@alimanfoo
Copy link
Member

I've given this some thought over the weekend and have an initial sketch for how to design the API and refactor the existing code. The good news is I think this will solve several problems in one go, and make the internal architecture much simpler with clear separation of concerns. So I'm convinced it's well worth the effort. That said, I do think the code and tests need to be completely restructured, so it's not a trivial piece of work. I wish I'd had the foresight to do it this way first time, but hey, that's why open source is good.

I've pushed my initial API sketch up to a new "refactor" branch. I will give some notes and discussion below. Please note that this is just an initial sketch and I'm sure will need to be modified/refined.

The first step is to separate out the storage layer. I propose the zarr.store.base.ArrayStore abstract class, defining the interface to storage for a single array. This class has a meta property, holding a MutableMapping for storing essential configuration metadata for the array such as shape, chunk shape, dtype, compression options, etc. The data property is a MutableMapping that maps chunk indices to bytes objects holding compressed chunk data. So, i.e., data[0, 0] returns compressed bytes for the chunk at index (0, 0). The attrs property is a MutableMapping storing user-defined attributes. The cbytes property returns the total size of compressed data held for the array. It may also be useful to implement an initialized property returning the number of chunks that have been initialised with some data.

Existing code for storing arrays in memory and on disk would be refactored into the zarr.store.memory.MemoryStore and zarr.store.directory.DirectoryStore classes respectively. New implementations of storage layers, e.g., using a ZipFile or S3, would live alongside these as separate sub-modules. Implementation of the MemoryStore class could be very simple, with the meta, data and attrs properties each being a dict.

The second step is to separate the synchronization (i.e., locking) functionality. I propose the zarr.sync.ArraySynchronizer abstract class, defining the interface to synchronization operations for a single array. The most important method is the lock_chunk method, which implements a lock on a single chunk.

Existing code for doing thread-based locking and inter-process locking would be refactored into classes zarr.sync.ThreadSynchronizer and zarr.sync.ProcessSynchronizer respectively.

Once APIs for storage and synchronization are defined, we could implement two classes, Array and SynchronizedArray. These two classes would replace all of the existing array classes. When instantiating an Array an ArrayStore is provided as the only constructor argument. When instantiating a SynchronizedArray both an ArrayStore and an ArraySynchronizer are provided.

The last issue is how to deal with the operations to get data from a chunk and set or modify data in a chunk, and the lowest-level operations to compress and decompress data using blosc. Previously I had implemented a set of Chunk classes do encapsulate all of this, but in doing this refactoring I realise I think that these classes are unnecessary, i.e., all the chunk classes can be deleted. This not only simplifies the code, but it also removes a source of overhead, because no state needs to be maintained for any chunk, other than holding the compressed data for each chunk in a store.

To work this last part of the API through I've implemented the __getitem__ and __setitem__ methods on the Array class. All blosc-related operations could now be encapsulated within a very simple cython extension which I've put at blosc.pyx. Hopefully this provides enough information to figure out how to begin reorganising the existing code. Details of how to get the blosc interface right might need to get modified when this actually hits the ground.

Given this API, all of the existing tests would also need to be refactored. Again the good news is that the cleaner separation of concerns should also simplify the internal architecture of the tests, although this too is a fairly substantial piece of work.

Any comments or thoughts on this very welcome.

@FrancescAlted
Copy link

Looks like you put a lot of thought on this, and this separation of the storage layer seems a good idea to me too.

Just wanted to point out that blosc_compress_ctx() and blosc_decompress_ctx() will be removed in Blosc2 and replaced by newer blosc2_compress_ctx() and blosc2_decompress_ctx(). This is because contexts will be fully supported in Blosc2, and that will allow for a faster operation (i.e. contexts can be reused). More info on these new Blosc2 functions: https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.h#L535

@alimanfoo alimanfoo mentioned this issue Apr 19, 2016
5 tasks
@alimanfoo
Copy link
Member

Thanks Francesc. Look forward to blosc2.

On Tuesday, April 19, 2016, FrancescAlted [email protected] wrote:

Looks like you put a lot of thought on this, and this separation of the
storage layer seems a good idea to me too.

Just wanted to point out that blosc_compress_ctx() and
blosc_decompress_ctx() will be removed in Blosc2 and replaced by newer
blosc2_compress_ctx() and blosc2_decompress_ctx(). This is because
contexts will be fully supported in Blosc2, and that will allow for a
faster operation (i.e. contexts can be reused). More info on these new
Blosc2 functions:
https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.h#L535


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/21#issuecomment-211781243

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@alimanfoo
Copy link
Member

Just to say #22 has work in progress on refactoring to address this issue. I'm pretty excited about this and am going to try and use bits and pieces of free time over the coming weeks to push this forward, but progress may be slow so if anyone else would like to contribute please feel free to jump in.

@alimanfoo
Copy link
Member

A note regarding the possibility of using a zip file as storage. It looks like it is not possible to update an entry in a zip file. Calling writestr('foo', somebytes) more than once will create two 'foo' entries within a zip file. Therefore using a zip file to store a zarr array would only work under the limited circumstances that each chunk of the array is only ever written once. This would mean that calls to __setitem__ on the array would have to be perfectly aligned with chunk boundaries.

@mrocklin
Copy link
Contributor Author

Yeah, zip files don't support clean in-place updates. Semantically everything works fine but you'll get a lot of entries in the file that are no longer useful. There are other single-file archive formats, it's tricky to find one that does everything. I still think that Zip is a good choice for sending datasets around, though probably not for workflows that involve a great deal of mutation.

It's also possible to do a sort of garbage collection on the Zip file to eliminate the zombie entries. This requires a full read/write.

alimanfoo added a commit that referenced this issue May 17, 2016
Refactoring for v1.0. Resolves #27, #25, #21, #7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants