Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dask array support #44

Open
1 of 3 tasks
thewtex opened this issue May 25, 2018 · 9 comments
Open
1 of 3 tasks

Add dask array support #44

thewtex opened this issue May 25, 2018 · 9 comments
Labels
enhancement New feature or request

Comments

@thewtex
Copy link
Member

thewtex commented May 25, 2018

Stages:

  1. Basic support for dask.array
  2. Stream dask.array
  3. Use dask.delayed or dask.futures to help downsample images.

CC: @mrocklin @dani-lbnl
xref #43 scisprints/2018_05_sklearn_skimage_dask#12

@mrocklin
Copy link

mrocklin commented May 26, 2018 via email

@thewtex thewtex added the enhancement New feature or request label Oct 29, 2018
@thewtex
Copy link
Member Author

thewtex commented Oct 29, 2018

@mrocklin good suggestion regarding coarsen! 👍

Some experiments are summarized here:

https://gist.github.com/thewtex/079c59ea71a1da59bea5e14cb6b73d1f

For large image support it looks like we should not convert NumPy arrays, ITK Images, etc. to Dask for downsampling because of the extra memory required, the time required for chunking, and the performance of downsampling. However, I will look into adding first class Dask array support by using the Dask arrays directly for large images and coarsen. This will remove the extra memory and time required to allocate the memory.

CC: @jakirkham

@mrocklin
Copy link

In general I agree that the main advantage here is probably for larger-than-memory data. However, there are likely several things you can do to accelerating things on the Dask side here. For example you might pass name=False to from_array to avoid that cost (it's hashing the data to find a deterministic key that you don't care about).

@hmaarrfk might have some other suggestions.

I'm not sure if I'm reading the notebook correctly, but it seems like you have three approaches that perform as follows. Is this right?

  • Numpy 40s
  • Dask array 3s
  • Some ITK specialized code 1.5s

If this is correct then I'm curious, what does the ITK specialized code do differently than Numpy?

@hmaarrfk
Copy link

hmaarrfk commented Oct 29, 2018

It seems to me you have a 6.7 GB dataset you are working on. I feel your pain. ITK is going at about 4GB/s. You are shrinking by 10, 10, 10 is that correct?

I've found numpy's code to be pretty slow and inefficient for large images. Slowdowns typically occur from a few places:

  • Generous calls to np.ascontiguousarray which do wonders for small data (copying it is almost free) but which are very expensive for larger nd arrays.
  • np.pad and np.block were optimized for small arrays where repeated concatenations is actually must faster than building up the metadata required for single copy padding/blocking. Please chime in here MAINT: Rewrite numpy.pad without concatenate numpy/numpy#11358 to express support as some believe that this is a "micro optimization". As a note, an np.block algorithm for large arrays just got in, so the performance of that one should be much better in 1.16 for arrays larger than 512x512.

The reason this is important is scikit-image calls np.pad to pad the array before downscale_local_means. @lagru seems to have wrtiten _fast_pad for scikit-image which implements the algorithm above, but only for pad widths of 1. I would support any PR that extends that functionality for width>1, but I would much prefer numpy actually update their algorithm with @lagru's work.

Since you are downsampling by 10, and some of the dimensions are 2048, it will pad to get that extra 2 points, and thus, copy the array twice, once for each dimension it needs to pad.

Finally, the downscale local means function is using, in my mind, a very inefficient algorithm.

Eventually it will calls this function
https://github.com/scikit-image/scikit-image/blob/972ce191ba6361356de17274bcf7ef2bb4d989e8/skimage/measure/block.py#L77
and for your 3D image, create a 6D image where the strides are arranged in such a way to have the 0:3 dims be equal to your final dimension, and 3:6 be what you want to average over. But, it will make 3 calls to np.mean instead of 1, and thus, allocate 3 large arrays in the process. I'm not even sure if np.mean supports non-contiguous arrays. You better hope it doesn't copy. It really might, so you have to check.

See this PR scikit-image/scikit-image#3522
this fails the median test, which was almost surely wrong. Help writing a benchmark would be appreciated!

dask provides a trim_excess parameter that seems like an interesting way to deal with this excess that can probably be integrated in scikit-image.

Summarizing the places where I see slowdowns:

  1. you are downsampling 2048, 2048, 1600 by 10, 10, 10 which causes the array to get padded. The input array should be contiguous and its shape should be divisible by the shrink factors.
  2. Try my dev branch of skimage.
  3. If you are on linux, try numpy dev branch to make use of kernel optimizations for large memory allocs.

PS, that you can probably speedup your large array creating by following some of the not so much hacks suggested here: numpy/numpy#11919 though these shouldn't be so necessary in numpy 1.16.

Dask seems to a similar strategy to numpy: reshape the array (free), then call mean once over the new shape
Your dask code could be optimized in a few places:

  1. I got hung up on the name parameter too. See discussion: from_array performance when no name is provided dask/dask#3946
  2. use persist. It is unlikely that you care about creating a final contiguous matrix (unless you do).
  3. Remove the call to astype. This is pretty contentious in scikit-image. I've tried to see if using smaller dtypes would help, it just depends on how the compiler can optimize them. It isn't obvious that Cython is the right tool to ensure that fast operations occur on uint8.

Finally, if you really want to hack, provide a dtype to coarsen and to local_means and feed in float32.

@hmaarrfk
Copy link

How is dask shriking 256x256 by 10x10? Is it trimming each chunk? or just the whole?

@thewtex
Copy link
Member Author

thewtex commented Oct 30, 2018

In general I agree that the main advantage here is probably for larger-than-memory data.

Yes, the prospect of eventually handling larger-than-memory with a distributed system is exciting ❗

However, there are likely several things you can do to accelerating things on the Dask side here. For example you might pass name=False to from_array to avoid that cost (it's hashing the data to find a deterministic key that you don't care about).

Ah, that is good to know! Yes, adding name=False makes the distribution in a Dask Array much, much faster (10 sec to 4.2 ms).

If this is correct then I'm curious, what does the ITK specialized code do differently than Numpy?

Right, this is a compile-time, optimized-to-the-task implementation. More details are here, PDF.

@thewtex
Copy link
Member Author

thewtex commented Oct 30, 2018

It seems to me you have a 6.7 GB dataset you are working on. I feel your pain. ITK is going at about 4GB/s. You are shrinking by 10, 10, 10 is that correct?

Yes, this was a rough previous to understand if we can get interactive performance when dynamically downsampling for visualization (first implementation in #98).

@thewtex
Copy link
Member Author

thewtex commented Oct 30, 2018

The reason this is important is scikit-image calls np.pad to pad the array before downscale_local_means. @lagru seems to have wrtiten _fast_pad for scikit-image which implements the algorithm above, but only for pad widths of 1. I would support any PR that extends that functionality for width>1, but I would much prefer numpy actually update their algorithm with @lagru's work.

Since you are downsampling by 10, and some of the dimensions are 2048, it will pad to get that extra 2 points, and thus, copy the array twice, once for each dimension it needs to pad.

Good points, @hmaarrfk .

dask provides a trim_excess parameter that seems like an interesting way to deal with this excess that can probably be integrated in scikit-image.

👍

The padding also generates "artifacts" on the border for this this use case.

@thewtex
Copy link
Member Author

thewtex commented Oct 30, 2018

you are downsampling 2048, 2048, 1600 by 10, 10, 10 which causes the array to get padded. The input array should be contiguous and its shape should be divisible by the shrink factors.

For this use case, we have to accept all data as it comes. All data is welcome :-).

PS, that you can probably speedup your large array creating by following some of the not so much hacks suggested here: numpy/numpy#11919 though these shouldn't be so necessary in numpy 1.16.

Nicely done!

2\. use `persist`. It is unlikely that you care about creating a final contiguous matrix (unless you do).

3\. Remove the call to `astype`. This is pretty contentious in scikit-image. I've tried to see if using smaller dtypes would help, it just depends on how the compiler can optimize them. It isn't obvious that Cython is the right tool to ensure that fast operations occur on uint8.

Finally, if you really want to hack, provide a dtype to coarsen and to local_means and feed in float32.

Good ideas, but these were unfortunately constraints from the use case.

Thanks for the reviews @hmaarrfk @mrocklin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants