-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find pdist
with known shape
#71
Conversation
result = _compat._ravel(result)[_compat._ravel(mask)] | ||
result = dask.array.concatenate([ | ||
result[i, i + 1:] for i in range(0, len(result) - 1) | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am a little concerned about this performance-wise for large numbers of points. Reason being this makes the graph balloon with getitem
entries. Would be good if we could cut this down somehow, but it is not obvious to me how without reusing the old masking strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, it seems to do ok given reasonable chunk sizes when playing around with it locally. So perhaps this is not worth worrying about until use cases that have issues present themselves.
c5aa4ac
to
c8e510f
Compare
Instead of using `triu` and masking out the upper triangle from the flattened array, simply slice out portions corresponding to the upper triangle and use `concatenate` to join them all together into the result. This confers a number of benefits. First that the shape of the result is known instead of being unknown. Second no zero arrays are generated in the process. Third compatibility functions related to reshaping and generating indices can be dropped reducing the maintenance overhead. As far as optimizations go, this should be similar to `triu` as it results in dropping out the same chunks that `triu` would have. For the chunks that are dropped out completely, this strategy will be more performant than `triu` as it will not generate zeros in there place. For the chunks along the diagonal, this should have similar performance as there will be some duplication of effort for these computations. However slicing out the selections of interest should be a little more performant than calling NumPy's `triu` due to using less memory and potentially avoiding a copy.
As we are no longer using `_ravel` in `pdist`, we no longer need to keep our own internal implementation of `_ravel`. So drop the function itself and associated tests. Should lighten our maintenance burden.
As we are no longer using `_indices` in `pdist`, we no longer need to keep our own internal implementation of `_indices`. So drop the function itself and associated tests. Should lighten our maintenance burden.
Make sure that before computing `pdist` (while we simply have a Dask Array) the shape is known and matches the expected shape of an equivalent computation from NumPy.
c8e510f
to
630d1c1
Compare
|
||
result = _compat._ravel(result)[_compat._ravel(mask)] | ||
result = dask.array.concatenate([ | ||
result[i, i + 1:] for i in range(0, len(result) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed using irange
here. Fixing in PR ( #80 ).
Instead of using
triu
and masking out the upper triangle from the flattened array, simply slice out portions corresponding to the upper triangle and useconcatenate
to join them all together into the result.This confers a number of benefits. First that the shape of the result is known instead of being unknown. Second no zero arrays are generated in the process. Third compatibility functions related to reshaping and generating indices can be dropped reducing the maintenance overhead.
As far as optimizations go, this should be similar to
triu
as it results in dropping out the same chunks thattriu
would have. For the chunks that are dropped out completely, this strategy will be more performant thantriu
as it will not generate zeros in there place. For the chunks along the diagonal, this should have similar performance as there will be some duplication of effort for these computations. However slicing out the selections of interest should be a little more performant than calling NumPy'striu
due to using less memory and potentially avoiding a copy.