Find `pdist` with known shape #71

jakirkham · 2017-10-06T21:58:25Z

Instead of using triu and masking out the upper triangle from the flattened array, simply slice out portions corresponding to the upper triangle and use concatenate to join them all together into the result.

This confers a number of benefits. First that the shape of the result is known instead of being unknown. Second no zero arrays are generated in the process. Third compatibility functions related to reshaping and generating indices can be dropped reducing the maintenance overhead.

As far as optimizations go, this should be similar to triu as it results in dropping out the same chunks that triu would have. For the chunks that are dropped out completely, this strategy will be more performant than triu as it will not generate zeros in there place. For the chunks along the diagonal, this should have similar performance as there will be some duplication of effort for these computations. However slicing out the selections of interest should be a little more performant than calling NumPy's triu due to using less memory and potentially avoiding a copy.

jakirkham · 2017-10-06T23:18:56Z

dask_distance/__init__.py

-    result = _compat._ravel(result)[_compat._ravel(mask)]
+    result = dask.array.concatenate([
+        result[i, i + 1:] for i in range(0, len(result) - 1)
+    ])


Am a little concerned about this performance-wise for large numbers of points. Reason being this makes the graph balloon with getitem entries. Would be good if we could cut this down somehow, but it is not obvious to me how without reusing the old masking strategy.

That said, it seems to do ok given reasonable chunk sizes when playing around with it locally. So perhaps this is not worth worrying about until use cases that have issues present themselves.

Instead of using `triu` and masking out the upper triangle from the flattened array, simply slice out portions corresponding to the upper triangle and use `concatenate` to join them all together into the result. This confers a number of benefits. First that the shape of the result is known instead of being unknown. Second no zero arrays are generated in the process. Third compatibility functions related to reshaping and generating indices can be dropped reducing the maintenance overhead. As far as optimizations go, this should be similar to `triu` as it results in dropping out the same chunks that `triu` would have. For the chunks that are dropped out completely, this strategy will be more performant than `triu` as it will not generate zeros in there place. For the chunks along the diagonal, this should have similar performance as there will be some duplication of effort for these computations. However slicing out the selections of interest should be a little more performant than calling NumPy's `triu` due to using less memory and potentially avoiding a copy.

As we are no longer using `_ravel` in `pdist`, we no longer need to keep our own internal implementation of `_ravel`. So drop the function itself and associated tests. Should lighten our maintenance burden.

As we are no longer using `_indices` in `pdist`, we no longer need to keep our own internal implementation of `_indices`. So drop the function itself and associated tests. Should lighten our maintenance burden.

Make sure that before computing `pdist` (while we simply have a Dask Array) the shape is known and matches the expected shape of an equivalent computation from NumPy.

jakirkham · 2017-10-08T00:17:57Z

dask_distance/__init__.py

-
-    result = _compat._ravel(result)[_compat._ravel(mask)]
+    result = dask.array.concatenate([
+        result[i, i + 1:] for i in range(0, len(result) - 1)


Missed using irange here. Fixing in PR ( #80 ).

jakirkham · 2017-10-09T00:57:54Z

Should have dropped this check too. Dropping in PR ( #85 ).

jakirkham commented Oct 6, 2017

View reviewed changes

jakirkham force-pushed the pdist_knwn_shape branch 2 times, most recently from c5aa4ac to c8e510f Compare October 7, 2017 00:22

jakirkham added 4 commits October 6, 2017 20:30

Drop internal implementation of ravel

74a8e0b

As we are no longer using `_ravel` in `pdist`, we no longer need to keep our own internal implementation of `_ravel`. So drop the function itself and associated tests. Should lighten our maintenance burden.

Drop internal implementation of indices

3bb9063

As we are no longer using `_indices` in `pdist`, we no longer need to keep our own internal implementation of `_indices`. So drop the function itself and associated tests. Should lighten our maintenance burden.

Test that pdist has a known, expected shape

630d1c1

Make sure that before computing `pdist` (while we simply have a Dask Array) the shape is known and matches the expected shape of an equivalent computation from NumPy.

jakirkham force-pushed the pdist_knwn_shape branch from c8e510f to 630d1c1 Compare October 7, 2017 00:32

jakirkham merged commit 2a1cd2e into master Oct 7, 2017

jakirkham deleted the pdist_knwn_shape branch October 7, 2017 00:41

jakirkham commented Oct 8, 2017

View reviewed changes

jakirkham mentioned this pull request Oct 9, 2017

Drop unneeded Dask version check from tests #85

Merged

jakirkham changed the title ~~Find pdist with known shape~~ Find pdist with known shape Oct 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find `pdist` with known shape #71

Find `pdist` with known shape #71

jakirkham commented Oct 6, 2017

jakirkham Oct 6, 2017

jakirkham Oct 7, 2017

jakirkham Oct 8, 2017

jakirkham commented Oct 9, 2017

Find pdist with known shape #71

Find pdist with known shape #71

Conversation

jakirkham commented Oct 6, 2017

jakirkham Oct 6, 2017

Choose a reason for hiding this comment

jakirkham Oct 7, 2017

Choose a reason for hiding this comment

jakirkham Oct 8, 2017

Choose a reason for hiding this comment

jakirkham commented Oct 9, 2017

Find `pdist` with known shape #71

Find `pdist` with known shape #71