-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask support improvements #367
Conversation
…ray, and set a maximum block size for chunks if not specified
# Sometimes compute() has to be called twice to return a Numpy array, | ||
# so we need to check here if this is the case and call the first compute() | ||
if isinstance(dask_array.ravel()[0].compute(), da.Array): | ||
dask_array = dask_array.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to avoid issues seen with @Cadair's data - interestingly a similar issue was seen with a different dataset in glue (glue-viz/glue#2399) so maybe this is a common issue, hence why it might be worth having the workaround here.
if block_size: | ||
output_array_dask = da.empty(shape_out, chunks=block_size) | ||
else: | ||
output_array_dask = da.empty(shape_out).rechunk(block_size_limit=8 * 1024**2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't do this, then for a 4k by 4k image the default chunk is the whole image so there is no gain from parallelizing. For small images that is true anyway (no gain from parallelization) but I do think the default chunk size is too big.
Codecov Report
@@ Coverage Diff @@
## main #367 +/- ##
==========================================
- Coverage 92.50% 92.29% -0.21%
==========================================
Files 24 24
Lines 840 844 +4
==========================================
+ Hits 777 779 +2
- Misses 63 65 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@Cadair - recording this here because not sure where else to put it, but if I run: import numpy as np
import dkist
from astropy.wcs.utils import pixel_to_pixel
ds = dkist.Dataset.from_asdf("BLKGA_test.asdf")
N = 10000000
pixel_to_pixel(ds.flat[0][0].wcs, ds.flat[1][0].wcs, np.ones(N), np.ones(N)) I get the following CPU usage: So there are indeed multi-threaded spikes. I don't think this is a problem per se although it would be nice to know where this is coming from. But if some of the transformation is indeed multi-threaded it does explain why I was having a hard time gaining performance with multi-processing as in your case the coordinate transforms are the bulk of the time. |
Simpler example: import numpy as np
import dkist
import dask
dask.config.set(scheduler='single-threaded')
ds = dkist.Dataset.from_asdf("BLKGA_test.asdf")
N = 100000000
ds.flat[0][0].wcs.low_level_wcs.pixel_to_world_values(np.ones(N), np.ones(N)) Interestingly there are two spikes in CPU, one near the start of the transform and one near the end (ignore the ones close to t=0) - regardless of how many coordinates there are to transform. |
I'm going to merge this as-is as I am now going to do a reasonably big refactor of the dask stuff so want to keep it separate from this simple diff. |
Just to close the loop on the question of what in the solar examples is multi-threaded, it looks like this is because GWCS uses astropy modeling and in particular AffineTransformation2D which in turn uses
before Numpy and Scipy are imported. |
This is a work in progress as I try and improve the performance of the parallelization
Issues I'm still seeing:
Performance notes:
I have tried reprojecting a (32000, 8600) array to a similar WCS and when using parallel mode, it is quite a bit faster (56 seconds with 16 processes, instead of 4m26s in synchronous mode). When using (2048, 2048) chunks, using 4 processes (... s) is almost as fast as 16 processes (1m26s instead of 56s). Using (4096, 4096) chunks (to try and reduce overhead) actually makes things slower for 16 processes, presumably because it's not an efficient division of the data. Using (1024, 1024) chunks makes things a little faster (52s) with 16 processes. Using (512, 512) chunks makes things a little faster still (49s), and going all the way down to (256, 256) is surprisingly still ok (49s).
Using dask-distributed works well but one needs (for now) to change:
to
With these changes, using 16 single-threaded local workers and the above large array example, with (1024, 1024) chunks, the runtime is 1 minute (compared to 52s with the regular processes scheduler). With (2048, 2048) chunks, the runtime is 1m6s, so similar. It would be interesting to try out an example with multiple machines to see how efficient the network communication is.
with
inside
reproject_single_block
, the synchronous mode takes 13s, and parallel with 16 processes takes 15s, so that gives us an idea of the overhead of the chunking itself before any reprojection. Taking that into account, the main parallelizable part of the code then takes 30s in parallel compared to 1m07s, so still not massively faster.TODOs:
reproject_interp