-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding parallelism with Dask within Iris #3919
Comments
Hi @thomascrocker, I am not as familiar with dask.bags (henceforth db), as I am with dask.array (da). Indeed, I think that most likely Having said that, I must ask for a bit of clarification:
Lastly, maybe you could consider to save yourself the trouble of developing this software by using any one from the following list |
Thanks for getting back to me, 1.) I'm not using the distributed scheduler at all. The Dask bag by default uses the multiprocessing scheduler Which requires no setup. Our slurm system has multiple nodes, each node has 42 CPUs and ~242GB of memory. It's not possible to submit multi node jobs in our setup (although I understand one can get around this by creating a dask.distributed scheduler to do so). The multiprocessing scheduler allows me to launch parallel jobs on a single node and can create new processes on each CPU requested. So if I launch my job requesting 24 CPUs, I'll get 24 different processes running on a single node. The multiprocessing scheduler handles launching these jobs and collecting the results automatically. 2 and 3.)
For example: and
I have been using ESMValTool and did look into the extreme events recipe, however, the data I'm using is from an RCM and I think the extra rotated latitude and longitude dimensions caused the diagnostic to fail, my R isn't good enough to fix it unfortunately. (Incidentally the data is from the same source as this issue. ESMValGroup/ESMValCore#772) For future though, I'd be very interested in making proper use of dask array. If you could share an example of how you invoke a parallel calculation on iris cubes using the dask array approach I'd be interested to see it. Thanks |
On 1), I see, so it is a single node job, it's just that the node has many cores. I am afraid the terminology in this area (nodes, sockets, cpus, cores, hyperthreads,...) is always a bit confusing. But seems like what you are doing is working. I would still suggest giving distributed a try, it works very well (and is preferred by some) on a single machine (see here), and gives you some nice things like the dashboard and certain profiling support. Plus it offers a straightforward way to multi-node jobs later on. On 2), 3), a couple of points, though I am afraid more details and a discussion of percentile-based indices is a bit too much for me right now. As a side note, if possible, try to change the units of your thresholds to the units of your cube. That is much less work for the computer. As for using In the simple case of
|
Thanks for the tips! I've begun experimenting with distributed to see if I can avoid manually breaking up my data for dask bag. And it... kind of works So.. if I do this...
I get this error when it tries to save...
My understanding is the actual compute isn't called until it's needed, so on the I managed to work around it by doing...
Accessing the |
Glad it worked! The save thing is a longstanding known issue, unfortunately. rx1d.data
iris.save(rx1d, "rx1d.nc") Oh, and I would be very curious if this makes any difference in terms of memory and runtime compared to your bags approach. |
Thanks, the "touch" approach worked. Weirdly I tried to run a test dask bag version shortly afterward, and ended up with a bunch of runtime errors from the multiprocessing library. I wonder if the cluster started by my previous script was not shutdown properly.. I suspect you are right though and the dask bag approach would likely take longer to run and use more memory.. Unfortunately I am constrained to using it for the time being to handle calculation of for example, cumulative wet days etc. since I have defined those using custom aggregators built with numpy functionality. Looking at the dask documentation though, I think I may be able to rewrite these to use dask arrays rather than numpy, in which case I hope the distributed approach would work. A lot of these habits I have I think stem from having to work around iris features that didn't support lazy evaluation in the past. The newer versions definitely seem to have made big improvements in this area though. |
@thomascrocker, sounds good! In fact, moving from numpy to ask often involves only minimal overhead, frequently even none at all, thanks to numpy's dispatching mechanisms. A notable exception is when dask consciously refuses to implement a numpy method, in which case some more thought is necessary. Perhaps the biggest example for that is sorting, where a true sort is not well suited for data-local parallelization as done in dask, and the better suited Anyway, do you think some of the lessons you learned are missing from the user guide right now? Would it be worth adding a Section or two? |
Possibly. Maybe the sections on lazy data and lazy evaluation could be expanded a little to make clear that these will take advantage of parallel processing, provided this has been initialised earlier in the script in some way. I'd be happy to help out further, but it will have to be in the new year as I'm currently very busy writing a report that is due at the end of this year. |
Hi @zklaus ,
Following up the discussion in #3218 (comment)
As I mentioned, in the past I've always found that even if an iris operator supports lazy evaluation, executing the code with multiple cores doesn't invoke the calculations in parallel. As a result I've got into the habit of setting up my parallel calculations manually. It could well be though that newer versions of iris are capable of doing this. If so, I'd appreciate some guidance on how to enable it.
As an example, the analysis I'm working on at the moment involves calculating climate indices on a cube with 656 x 310 grid boxes. (And 30 years of daily data). The netCDF file itself is almost 9GB.
For calculating the yearly indices I use the following dask bag based approach
I'm running on the SPICE system at the Met Office, so I submit that as a slurm job with 30 CPUs and get the result back.
What I'm struggling with at the moment though is slightly more complex.. I need to calculate the baseline percentiles for some indices.. This involves calculating a percentile value for each day of year, using a 5 day window centred on that day, see for example, number 10.. http://etccdi.pacificclimate.org/list_27_indices.shtml
So to do that I have added a day of year co-ordinate, then I loop over the days, creating a constraint of +/- 2 from that day of year. This then gives 150 time points (5 days per year across the 30 years) from which the percentiles are calculated at each grid point. (Using iris.collapsed and the PERCENTILE aggregator).
If I submit that as is as a slurm job with multiple CPUs I don't see any speed up vs submitting with only 1, i.e. the extra cores are not being used to do the calculations in paralell.
In order to get speedups, I need to again follow the dask bag approach and create a dask bag of each day of year, which I then map functions to and call compute to calculate in paralell.
The text was updated successfully, but these errors were encountered: