Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find pdist with known shape #71

Merged
merged 4 commits into from
Oct 7, 2017
Merged

Find pdist with known shape #71

merged 4 commits into from
Oct 7, 2017

Conversation

jakirkham
Copy link
Owner

Instead of using triu and masking out the upper triangle from the flattened array, simply slice out portions corresponding to the upper triangle and use concatenate to join them all together into the result.

This confers a number of benefits. First that the shape of the result is known instead of being unknown. Second no zero arrays are generated in the process. Third compatibility functions related to reshaping and generating indices can be dropped reducing the maintenance overhead.

As far as optimizations go, this should be similar to triu as it results in dropping out the same chunks that triu would have. For the chunks that are dropped out completely, this strategy will be more performant than triu as it will not generate zeros in there place. For the chunks along the diagonal, this should have similar performance as there will be some duplication of effort for these computations. However slicing out the selections of interest should be a little more performant than calling NumPy's triu due to using less memory and potentially avoiding a copy.

result = _compat._ravel(result)[_compat._ravel(mask)]
result = dask.array.concatenate([
result[i, i + 1:] for i in range(0, len(result) - 1)
])
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am a little concerned about this performance-wise for large numbers of points. Reason being this makes the graph balloon with getitem entries. Would be good if we could cut this down somehow, but it is not obvious to me how without reusing the old masking strategy.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, it seems to do ok given reasonable chunk sizes when playing around with it locally. So perhaps this is not worth worrying about until use cases that have issues present themselves.

@jakirkham jakirkham force-pushed the pdist_knwn_shape branch 2 times, most recently from c5aa4ac to c8e510f Compare October 7, 2017 00:22
Instead of using `triu` and masking out the upper triangle from the
flattened array, simply slice out portions corresponding to the upper
triangle and use `concatenate` to join them all together into the
result.

This confers a number of benefits. First that the shape of the
result is known instead of being unknown. Second no zero arrays are
generated in the process. Third compatibility functions related to
reshaping and generating indices can be dropped reducing the maintenance
overhead.

As far as optimizations go, this should be similar to `triu` as it
results in dropping out the same chunks that `triu` would have. For the
chunks that are dropped out completely, this strategy will be more
performant than `triu` as it will not generate zeros in there place. For
the chunks along the diagonal, this should have similar performance as
there will be some duplication of effort for these computations. However
slicing out the selections of interest should be a little more
performant than calling NumPy's `triu` due to using less memory and
potentially avoiding a copy.
As we are no longer using `_ravel` in `pdist`, we no longer need to keep
our own internal implementation of `_ravel`. So drop the function itself
and associated tests. Should lighten our maintenance burden.
As we are no longer using `_indices` in `pdist`, we no longer need to
keep our own internal implementation of `_indices`. So drop the function
itself and associated tests. Should lighten our maintenance burden.
Make sure that before computing `pdist` (while we simply have a Dask
Array) the shape is known and matches the expected shape of an
equivalent computation from NumPy.
@jakirkham jakirkham merged commit 2a1cd2e into master Oct 7, 2017
@jakirkham jakirkham deleted the pdist_knwn_shape branch October 7, 2017 00:41

result = _compat._ravel(result)[_compat._ravel(mask)]
result = dask.array.concatenate([
result[i, i + 1:] for i in range(0, len(result) - 1)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed using irange here. Fixing in PR ( #80 ).

@jakirkham
Copy link
Owner Author

Should have dropped this check too. Dropping in PR ( #85 ).

@jakirkham jakirkham changed the title Find pdist with known shape Find pdist with known shape Oct 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant