Integration with dask/distributed (xarray backend design) #798

pwolfram · 2016-03-21T23:18:02Z

Dask (https://github.com/dask/dask) currently provides on-node parallelism for medium-size data problems. However, large climate data sets will require multiple-node parallelism to analyze large climate data sets because this constitutes a big data problem. A likely solution to this issue is integration of distributed (https://github.com/dask/distributed) with dask. Distributed is now integrated with dask and its benefits are already starting to be realized, e.g., see http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3.

Thus, this issue is designed to identify the steps needed to perform this integration, at a high-level. As stated by @shoyer, it will

definitely require some refactoring of the xarray backend system to make this work cleanly, but that's
OK -- the xarray backend system is indicated as experimental/internal API precisely because we
hadn't figured out all the use cases yet."

To be honest, I've never been entirely happy with the design we took there (we use inheritance rather
than composition for backend classes), but we did get it to work for our use cases. Some refactoring
with an eye towards compatibility with dask distributed seems like a very worthwhile endeavor. We
do have the benefit of a pretty large test suite covering existing use cases.

Thus, we have the chance to make xarray big-data capable as well as provide improvements to the backend.

To this end, I'm starting this issue to help begin the design process following the xarray mailing list discussion some of us have been having (@shoyer, @mrocklin, @rabernat).

Task To Do List:

Verify asynchronous access error for to_netcdf output is resolved (e.g., dask.async.RuntimeError: NetCDF: HDF error on xarray to_netcdf #793)
LRU-cached file IO supporting serialization to robustly support HDF/NetCDF reads

The text was updated successfully, but these errors were encountered:

pwolfram · 2016-03-21T23:21:07Z

The full mailing list discussion is at https://groups.google.com/d/msgid/xarray/CAJ8oX-E7Xx6NT4F6J8B4__Q-kBazoob9_qe_oFLi5hany9-%3DKQ%40mail.gmail.com?utm_medium=email&utm_source=footer

pwolfram · 2016-03-21T23:57:27Z

See also dask/dask#922

mrocklin · 2016-03-21T23:59:18Z

Copying over a comment from that issue:

Yes, so the problem as I see it is that, for serialization and open-file reasons we want to use a function like the following:

def get_chunk_of_array(filename, datapath, slice):
    with netCDF4.Dataset(filename) as f:
        return f.variables[datapath][slice]

However, this opens and closes many files, which while robust, is slow. We can alleviate this by maintaining an LRU cache in a global variable so that it is created separately per process.

from toolz import memoize

cache = LRUDict(size=100, on_eviction=lambda file: file.close())

netCDF4_Dataset = memoize(netCDF4.Dataset, cache=cache)

def def get_chunk_of_array(filename, datapath, slice):
    f = netCDF4_Dataset(filename)
    return f.variables[datapath][slice]

I'm happy to supply the memoize function with toolz and an appropriate LRUDict object with other microprojects that I can publish if necessary.

We would then need to use such a function within the dask.array and xarary codebases.

Anyway, that's one approach. Thoughts welcome.

pwolfram · 2016-03-22T00:01:55Z

Here is an example of a use case for a nanmean over ensembles in collaboration with @mrocklin and following http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3: https://gist.github.com/mrocklin/566a8d5c3f6721abf36f

pwolfram · 2016-03-24T03:02:19Z

@shoyer and @mrocklin, I've updated the summary above in the PR description with a to do list. Do either of you see any obvious tasks I missed on the list in the PR description? If so, can you please update the to do list so that I can see what needs done to modify the backend for the dask/distributed integration?

pwolfram · 2016-03-24T03:04:25Z

Repeating @mrocklin:

Dask.array writes data to any object that supports numpy style setitem syntax like the following:

dataset[my_slice] = my_numpy_array

Objects like h5py.Dataset and netcdf objects support this syntax.

So dask.array would work today without modification if we had such an object that represented many netcdf files at once and supported numpy-style setitem syntax, placing the numpy array properly across the right files. This work could happen easily without deep knowledge of either project.

Alternatively, we could make the dask.array.store function optionally lazy so that users (or xarray) could call store many times before triggering execution.

pwolfram · 2016-03-24T15:09:42Z

This issue of connecting to dask/distributed may also be connected with #463, #591, and #524.

mrocklin · 2016-03-24T16:00:52Z

I believe that robustly supporting HDF/NetCDF reads with the mechanism mentioned above will resolve most problems from a dask.array perspective. I have no doubt that other things will arise though. Switching from shared to distributed memory always come with (surmountable) obstacles

shoyer · 2016-03-25T04:54:09Z

I agree with @mrocklin that the LRUCache for file-like objects should take care of things from the dask.array perspective. It should also solve #463 in a very clean way. We'll just need to reorganize things a bit to make use of it.

pwolfram · 2016-03-29T03:49:11Z

Thanks @shoyer. If you can provide some guidance on bounds for the reorganization that would be really great. I want your and @jhamman's feedback on this before I try a solution. The trick is just to make the time, as always, and I may have some time this coming weekend.

pwolfram · 2016-04-02T18:25:18Z

Another note in support of this PR, especially "robustly support HDF/NetCDF reads": I am having problems with NetCDF: HDF error as previously reported by @rabernat in #463. Thus, a solution here will save time and may arguably be on the critical path of some workflows because fewer jobs will fail and require baby-sitting/restarts, especially when dealing with running multiple jobs.

mrocklin · 2016-04-02T22:29:04Z

FWIW I've uploaded a tiny LRU dict implementation to a new zict project (which also has some other stuff):

http://zict.readthedocs.org/en/latest/

pip install zict

from zict import LRU
d = LRU(100, dict())

There are a number of good alternatives out there though for LRU dictionaries.

pwolfram · 2016-04-04T04:35:09Z

Thanks @mrocklin! This has been really helpful and was what I needed to get going.

A prelim design I'm seeing is to modify the NetCDF4DataStore class https://github.com/pydata/xarray/blob/master/xarray/backends/netCDF4_.py#L170 to meet these requirements:

At __init__, try to open file via the LRU cache. I think the LRU dict has to be a global because because the file restriction is an attribute of the system, correct?
For each read from a file, ensure it hasn't been closed via a @ds.getter property method. If so, reopen it via the LRU cache. This is ok because for a read the file is essentially read-only. The LRU closes out stale entries to prevent the too many open file errors. Checking this should be fast.
sync is only for a write but seems like it should following the above approach.

A clean way to do this is just to make sure that each time self.ds is called, it is re-validated via the LRU cache. This should be able to be implemented via property getter methods https://docs.python.org/2/library/functions.html#property.

Unless I'm missing something big, I don't think this change will require at large refactor but it is quite possible I overlooked something important. @shoyer and @mrocklin, do you see any obvious pitfalls in this scope? If not, it shouldn't be too hard to implement.

fmaussion · 2016-04-04T16:08:57Z

Sorry if I am just producing noise here (I am not a specialist), but I have two naive questions:

To 1. how will you handle concurrent access to the LRU cache if it's a global variable?

To 2. Once the file has been closed by the LRU, won't it also be erased from it? So that a simple if file in LRU: could suffice as a test if the file has been closed or not?

shoyer · 2016-04-04T16:25:03Z

I think the LRU dict has to be a global because because the file restriction is an attribute of the system, correct?

Correct, the LRU dict should be global. I believe the file restriction is generally per-process, and creating a global dict should assure that works properly.

For each read from a file, ensure it hasn't been closed via a @ds.getter property method. If so, reopen it via the LRU cache. This is ok because for a read the file is essentially read-only. The LRU closes out stale entries to prevent the too many open file errors. Checking this should be fast.

The challenge is that we only call the .get_variables() method (and hence self.ds) once on a DataStore when a Dataset is opened from disk. I think we need to refactor NetCDF4ArrayWrapper to take a filename instead, and use something like @mrocklin's netcdf_Dataset.

My bigger concern was how to make use of a method like futures_to_dask_arrays. But it looks like that may actually not be necessary, at least if we are happy to open all netCDF files (and read out the metadata) from a master process.

pwolfram · 2016-04-04T20:24:41Z

Just to be clear, we are talking about this https://github.com/mrocklin/hdf5lazy/blob/master/hdf5lazy/core.py#L83 for @mrocklin's netcdf_Dataset, right?

pwolfram · 2016-04-04T20:31:24Z

@fmaussion, for

The LRU cache should be used serially for the read initially, but something more like @mrocklin's netcdf_Dataset appears to be needed as @shoyer points out. I need to think about this more.
I was thinking we would keep track of the file name outside the LRU and only use the filename to open up datasets inside the LRU if they aren't already open. Agreed that if file in LRU should designate whether the file is open.

pwolfram · 2016-04-04T20:32:23Z

@shoyer, if if we are happy to open all netCDF files and read out the metadata from a master process that would imply that we would open a file, read the metadata, and then close it, correct?

Array access should then follow something like the @mrocklin's netcdf_Dataset approach, right?

shoyer · 2016-04-04T20:40:58Z

@pwolfram I was referring to this comment for @mrocklin's netCDF4_Dataset.

shoyer · 2016-04-04T20:54:42Z

@shoyer, if if we are happy to open all netCDF files and read out the metadata from a master process that would imply that we would open a file, read the metadata, and then close it, correct?

Array access should then follow something like the @mrocklin's netcdf_Dataset approach, right?

Yes, this is correct.

In principle, if we have a very large number of files containing many variables each, we might want to do the read in parallel using futures, and then use something like futures_to_dask_arrays to bring them together. That seems much trickier to integrate into our current backend approach.

mrocklin · 2016-04-12T16:55:02Z

It's probably best to avoid futures within xarray, so far they're only in the distributed memory scheduler. I think that ideally we create graphs that can be used robustly in either. I think that the memoized netCDF4_Dataset approach can probably do this just fine. Is there anything that is needed from me to help push this forward?

kynan · 2016-10-20T18:14:38Z

Has this issue progressed since?

Being able to distribute loading of files to a dask cluster and composing an xarray Dataset from data on remote workers would be a great feature.

Is @mrocklin's blog post from Feb 2016 still the reference for remote data loading on a cluster? Adapting it to loading xarray Datasets rather than plain arrays is not straightforward since there is no way to combine futures representing Datasets out of the box.

mrocklin · 2016-10-20T18:24:10Z

I haven't worked on this but agree that it is important.

pwolfram · 2016-10-20T18:28:24Z

@kynan, I'm still interested in this but have not had time to advance this further. Are you interested in contributing to this too?

I view this as a key component of future climate analysis workflows. This may also be something that is addressed at the upcoming hackathon at Columbia with @rabernat early next month.

Also, I suspect that both @mrocklin and @shoyer would be willing to continue to provide key advice because this appears to be aligned with their interests too (please correct me if I'm wrong in this assessment).

mrocklin · 2016-10-20T18:34:35Z

Definitely happy to support from the Dask side.

I think that the LRU method described above is feasible.

mrocklin · 2016-10-20T18:35:27Z

If XArray devs want to chat sometime I suspect we could hammer out a plan fairly quickly. My hope is that once a plan exists then a developer will arise to implement that plan. I'm free all of today and tomorrow.

pwolfram · 2016-10-20T18:44:03Z

@mrocklin, I would be happy to chat because I am interested in seeing this happen (e.g., eventually contributing code). The question is whether we need additional expertise from @shoyer, @jhamman, @rabernat etc who likely have a greater in-depth understanding of xarray than me. Perhaps this warrants an email to the wider list?

mrocklin · 2016-10-20T18:48:53Z

I agree that this conversation needs expertise from a core xarray developer. I suspect that this change is more likely to happen in xarray than in dask.array. Happy to continue the conversation wherever. I do have a slight preference to switch to real-time at some point though. I suspect that we can hash this out in a moderate number of minutes.

shoyer · 2016-11-08T16:16:15Z

We have something very hacky working with #1095

I'm also going to see if I can get something working with the LRU cache, since that seems closer to the solution we want eventually.

mrocklin · 2016-11-08T16:17:20Z

FYI Dask is committed to maintaining this: https://github.com/dask/zict/blob/master/zict/lru.py

shoyer · 2016-11-08T16:28:13Z

One slight subtlety is writes. We'll need to switch from 'w' to 'a' mode
the second time we open a file.
On Tue, Nov 8, 2016 at 8:17 AM Matthew Rocklin [email protected]
wrote:

FYI Dask is committed to maintaining this:
https://github.com/dask/zict/blob/master/zict/lru.py

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#798 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1rz8sYoBXjMbJvQqrP3XHZx3_fJhks5q8KCRgaJpZM4H1p4q
.

shoyer · 2016-11-08T17:27:55Z

A few other thoughts on thread safety with the LRU approach:

We need to a global lock ensure internal consistency of the LRU cache, and so that we don't overwrite files without closing them. It probably makes sense to put this in memoize function.
We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item.

kynan · 2016-11-08T22:17:14Z

Great to see this moving! I take it the workshop was productive?

How does #1095 work in the scenario of a distributed scheduler with remote workers? Do I understand correctly that all workers and the client would need to see the same shared filesystem from where NetCDF files are read?

mrocklin · 2016-11-08T22:18:42Z

Yes.

On Tue, Nov 8, 2016 at 5:17 PM, Florian Rathgeber [email protected]
wrote:

Great to see this moving! I take it the workshop was productive?

How does #1095 #1095 work in the
scenario of a distributed scheduler with remote workers? Do I understand
correctly that all workers and the client would need to see the same shared
filesystem from where NetCDF files are read?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#798 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszCe45oN0_1tBsrCycyr2N01M75xNks5q8PTsgaJpZM4H1p4q
.

kynan · 2016-11-22T11:18:56Z

When using xarray with the dask.distributed scheduler it would be useful to be able to persist intermediate DataArrays / Datasets on remote workers.

There could be a persist method analogous to the compute method introduced in #1024. Potential issues with this approach are:

What are the semantics of this operation for the general case where dask or distributed are not used?
Is it justified to add an operation which is rather specific to the distributed scheduler?

(Could create a separate issue for this if preferred).

mrocklin · 2016-11-22T13:10:03Z

One solution is to create protocols on the Dask side to enable dask.distributed.Client.persist itself to work on XArray objects. This keeps the scheduler specific details like persist on the scheduler.

shoyer · 2016-11-28T23:42:54Z

@mrocklin Any thoughts on my thread safety concerns (#798 (comment)) for the LRU cache? I suppose the simplest thing to do is to simply refuse to evict a file until the per-file lock is released, but I can see that strategy failing pretty badly in edge cases.

mrocklin · 2016-11-29T04:02:05Z

A lock on the LRU cache makes sense to me.

We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item

If it were me I would just block on the evicted file until it becomes available (the stop-gap measure) until it became a performance problem.

…writing Fixes pydata#1172 The serializable lock will be useful for dask.distributed or multi-processing (xref pydata#798, pydata#1173, among others).

…ing (#1179) * Switch to shared Lock (SerializableLock if possible) for reading and writing Fixes #1172 The serializable lock will be useful for dask.distributed or multi-processing (xref #798, #1173, among others). * Test serializable lock * Use conda-forge for builds * remove broken/fragile .test_lock

pwolfram · 2017-03-22T14:23:45Z

@mrocklin and @shoyer, we now have dask.distributed and xarray support. Should this issue be closed?

mrocklin · 2017-03-22T14:26:08Z

Has anyone used XArray on NetCDF data on cluster without resorting to any tricks?

mrocklin · 2017-06-01T14:17:11Z

@shoyer regarding per-file locking this probably only matters if we are writing as well, yes?

Here is a small implementation of a generic file-open cache. I haven't yet decided on a eviction policy but either LRU or random (filtered by closeable files) should work OK.

from contextlib import contextmanager
import threading

class OpenCache(object):
    def __init__(self, maxsize=100):
        self.refcount = defaultdict(lambda: 0)
        self.maxsize = 0
        self.cache = {}
        self.i = 0
        self.lock = threading.Lock()

    @contextmanager
    def open(self, myopen, fn, mode='r'):
        assert 'r' in mode
        key = (myopen, fn, mode)
        with self.lock:
            try:
                file = self.cache[key]
            except KeyError:
                file = myopen(fn, mode=mode)
                self.cache[key] = file

            self.refcount[key] += 1

            if len(self.cache) > self.maxsize:
                # Clear old files intelligently

        try:
            yield file
        finally:
            with self.lock:
                self.refcount[key] -= 1

cache = OpenCache()
with cache.open(h5py.File, 'myfile.hdf5') as f:
    x = f['/data/x']
    y = x[:1000, :1000]

Is this still useful?

I'm curious to hear from users like @pwolfram and @rabernat who may be running into the many file problem about what the current pain points are.

jhamman · 2019-01-13T04:12:32Z

Closing this old issue. The final checkbox in @pwolfram's original post was completed in #2261.

pwolfram mentioned this issue Apr 4, 2016

Option for closing files with scipy backend #524

Closed

pwolfram mentioned this issue Apr 15, 2016

Generalize to other storage systems zarr-developers/zarr-python#21

Closed

delgadom mentioned this issue Nov 2, 2016

Support creating DataSet from streaming object #1075

Closed

shoyer mentioned this issue Nov 6, 2016

Initial hack to get dask distributed working #1083

Closed

shoyer mentioned this issue Nov 20, 2016

Remove caching logic from xarray.Variable #1128

Merged

shoyer mentioned this issue Nov 29, 2016

open_mfdataset too many files #463

Closed

mrocklin mentioned this issue Dec 20, 2016

Some queries #1173

Closed

shoyer mentioned this issue Dec 22, 2016

Switch to shared Lock (SerializableLock if possible) for reading/writing #1179

Merged

pwolfram mentioned this issue Jan 10, 2017

Fixes OS error arising from too many files open #1198

Merged

mrocklin mentioned this issue Apr 15, 2017

Trouble computing with h5py-based Dask Arrays dask/dask#1777

Closed

mrocklin mentioned this issue Jun 27, 2017

da.store for hdf file or to_hdf fails with distributed scheduler dask/dask#2488

Open

mrocklin mentioned this issue Jan 17, 2018

Multiprocess writes using to_hdf5 dask/dask#3074

Open

jhamman mentioned this issue Jan 28, 2018

fix distributed writes #1793

Merged

5 tasks

mrocklin mentioned this issue Feb 19, 2018

Multiprocess friendly to_hdf5 dask/dask#3179

Closed

jhamman closed this as completed Jan 13, 2019

mrocklin mentioned this issue Jan 4, 2020

Writing stacked delayed dask arrays raises exception dask/dask#5763

Closed

Integration with dask/distributed (xarray backend design) #798

Integration with dask/distributed (xarray backend design) #798

Comments

pwolfram commented Mar 21, 2016 • edited by jhamman Loading

pwolfram commented Mar 21, 2016

pwolfram commented Mar 21, 2016

mrocklin commented Mar 21, 2016

pwolfram commented Mar 22, 2016

pwolfram commented Mar 24, 2016

pwolfram commented Mar 24, 2016

pwolfram commented Mar 24, 2016

mrocklin commented Mar 24, 2016

shoyer commented Mar 25, 2016

pwolfram commented Mar 29, 2016

pwolfram commented Apr 2, 2016

mrocklin commented Apr 2, 2016

pwolfram commented Apr 4, 2016

fmaussion commented Apr 4, 2016

shoyer commented Apr 4, 2016

pwolfram commented Apr 4, 2016

pwolfram commented Apr 4, 2016

pwolfram commented Apr 4, 2016

shoyer commented Apr 4, 2016

shoyer commented Apr 4, 2016

mrocklin commented Apr 12, 2016

kynan commented Oct 20, 2016

mrocklin commented Oct 20, 2016

pwolfram commented Oct 20, 2016

mrocklin commented Oct 20, 2016

mrocklin commented Oct 20, 2016

pwolfram commented Oct 20, 2016

mrocklin commented Oct 20, 2016

shoyer commented Nov 8, 2016

mrocklin commented Nov 8, 2016

shoyer commented Nov 8, 2016

shoyer commented Nov 8, 2016 • edited Loading

kynan commented Nov 8, 2016

mrocklin commented Nov 8, 2016

kynan commented Nov 22, 2016

mrocklin commented Nov 22, 2016 • edited Loading

shoyer commented Nov 28, 2016

mrocklin commented Nov 29, 2016

pwolfram commented Mar 22, 2017

mrocklin commented Mar 22, 2017

mrocklin commented Jun 1, 2017

jhamman commented Jan 13, 2019

pwolfram commented Mar 21, 2016 •

edited by jhamman

Loading

shoyer commented Nov 8, 2016 •

edited

Loading

mrocklin commented Nov 22, 2016 •

edited

Loading