You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a meta issue to think about/discuss dask support - because there are actually a number of different aspects to consider
Internal dask usage
Since #314 was merged, we now use dask internally to parallelize the reprojection over the output array in the interpolation reprojection. We still need to:
Implement this for the adaptive reprojection
Implement this for the 'exact' reprojection, removing previous custom parallelization code
Note that this only parallelizes over the output array, and still requires a Numpy array for the input
Output dask arrays
Users might want to get dask arrays out of the reproject functions and lazily do the reprojection as needed. With the above implemented, this shouldn't be too difficult, because internally the blocked reprojection produces dask arrays that we write out at the last minute to Numpy arrays. We should be able to add a keyword argument to control whether to just return a dask array. We could have return_type='numpy'/return_type='dask' for flexibility if we want to add anything else in future
Input dask arrays
This is the trickier one and there are several use cases to consider:
Input where all dimensions are to be reprojected
In the general case, we might have an N-dimensional input that we want to reproject to a new N-dimensional WCS, and all pixel/world axes might be correlated. In this case, it's actually non-trivial to define a strategy for reprojection that operates on the input in chunks because every output pixel might depend on any combination of input pixels in the general case. In this case, the simplest approach is to dump the dask array to a temporary file that can be read back in using a Numpy memmap and then used as a regular input to the internal reproject functions, and then make use of the parallelization over output chunks to improve performance. The advantage of this is that e.g. scipy's map_coordinates does the right thing with memmap arrays and won't load it all up in memory - so it's possible to keep the memory usage very low. However, the main cost is disk space because the whole array has to be dumped to disk. There is precedent for this approach - for instance in spectral-cube.
Note that if dask/dask-image#237 gets merged and released, we do have the option just for the interpolation reprojection to avoid writing the dask array out and do the whole reprojection 'natively' with dask, but that is a special case that won't apply to other reprojection algorithms.
Input with broadcasted/extra dimensions
In #332, @svank added the ability to specify for example a 3D data cube but only reprojecting two of the dimensions to a new celestial WCS and keeping the additional dimension as a dummy dimension. The approach in that PR was to actually try and reproject all pixels along dummy dimensions at the same time because they require the same coordinate transformations. However, if the input is huge, and since we can't chunk over the dimensions that need to be reprojected, this leaves us with the only sensible option of chunking over the extra dimensions. So for example if we had a (10000,10000,10000) spectral cube to reproject, and wanted to reproject just the celestial dimensions, we could chunk over the spectral dimension to operate over chunks of say (10, 10000, 10000) in the input array. Internally the first dimension would still be handled 'efficiently' as an extra/broadcasted dimension within this chunk.
One open question is whether when doing this we want to then also give the option of chunking over the output array as is already an option - I think the answer is yes, we might want to essentially chain them, i.e. chunking first over the input array, then for each chunk we would operate as for the general case by writing out that chunk to a Numpy memmap array and proceeding with the remainder of the reprojection, optionally chunking over the output. This approach is still memory efficient but also more disk efficient because we wouldn't be writing out the whole input array, only chunks, to disk.
Input where all dimensions have to be reprojected but not all axes are correlated
A common use case for N-dimensional data might be that one does want to reproject a spectral cube along the spatial and spectral dimension, but the spatial and spectral dimensions are separable in both the input and output WCS. In this case, we could internally separate the problem into two successive reprojection and then follow the approach of dealing with broadcasted/extra dimensions. This doesn't require any dask per se, it is more that we could write a helper function that can separate out the calculation into multiple steps.
cc @svank@Cadair in case you have any thoughts/comments!
The text was updated successfully, but these errors were encountered:
This is a meta issue to think about/discuss dask support - because there are actually a number of different aspects to consider
Internal dask usage
Since #314 was merged, we now use dask internally to parallelize the reprojection over the output array in the interpolation reprojection. We still need to:
Note that this only parallelizes over the output array, and still requires a Numpy array for the input
Output dask arrays
Users might want to get dask arrays out of the reproject functions and lazily do the reprojection as needed. With the above implemented, this shouldn't be too difficult, because internally the blocked reprojection produces dask arrays that we write out at the last minute to Numpy arrays. We should be able to add a keyword argument to control whether to just return a dask array. We could have
return_type='numpy'
/return_type='dask'
for flexibility if we want to add anything else in futureInput dask arrays
This is the trickier one and there are several use cases to consider:
Input where all dimensions are to be reprojected
In the general case, we might have an N-dimensional input that we want to reproject to a new N-dimensional WCS, and all pixel/world axes might be correlated. In this case, it's actually non-trivial to define a strategy for reprojection that operates on the input in chunks because every output pixel might depend on any combination of input pixels in the general case. In this case, the simplest approach is to dump the dask array to a temporary file that can be read back in using a Numpy memmap and then used as a regular input to the internal reproject functions, and then make use of the parallelization over output chunks to improve performance. The advantage of this is that e.g. scipy's
map_coordinates
does the right thing with memmap arrays and won't load it all up in memory - so it's possible to keep the memory usage very low. However, the main cost is disk space because the whole array has to be dumped to disk. There is precedent for this approach - for instance in spectral-cube.Note that if dask/dask-image#237 gets merged and released, we do have the option just for the interpolation reprojection to avoid writing the dask array out and do the whole reprojection 'natively' with dask, but that is a special case that won't apply to other reprojection algorithms.
Input with broadcasted/extra dimensions
In #332, @svank added the ability to specify for example a 3D data cube but only reprojecting two of the dimensions to a new celestial WCS and keeping the additional dimension as a dummy dimension. The approach in that PR was to actually try and reproject all pixels along dummy dimensions at the same time because they require the same coordinate transformations. However, if the input is huge, and since we can't chunk over the dimensions that need to be reprojected, this leaves us with the only sensible option of chunking over the extra dimensions. So for example if we had a (10000,10000,10000) spectral cube to reproject, and wanted to reproject just the celestial dimensions, we could chunk over the spectral dimension to operate over chunks of say (10, 10000, 10000) in the input array. Internally the first dimension would still be handled 'efficiently' as an extra/broadcasted dimension within this chunk.
One open question is whether when doing this we want to then also give the option of chunking over the output array as is already an option - I think the answer is yes, we might want to essentially chain them, i.e. chunking first over the input array, then for each chunk we would operate as for the general case by writing out that chunk to a Numpy memmap array and proceeding with the remainder of the reprojection, optionally chunking over the output. This approach is still memory efficient but also more disk efficient because we wouldn't be writing out the whole input array, only chunks, to disk.
Input where all dimensions have to be reprojected but not all axes are correlated
A common use case for N-dimensional data might be that one does want to reproject a spectral cube along the spatial and spectral dimension, but the spatial and spectral dimensions are separable in both the input and output WCS. In this case, we could internally separate the problem into two successive reprojection and then follow the approach of dealing with broadcasted/extra dimensions. This doesn't require any dask per se, it is more that we could write a helper function that can separate out the calculation into multiple steps.
cc @svank @Cadair in case you have any thoughts/comments!
The text was updated successfully, but these errors were encountered: