Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute] Implement TopK/BottomK #17579

Closed
asfimport opened this issue Sep 19, 2017 · 5 comments
Closed

[C++][Compute] Implement TopK/BottomK #17579

asfimport opened this issue Sep 19, 2017 · 5 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Sep 19, 2017

Heap-based topk can compute these indices in O(n log k) time

Reporter: Wes McKinney / @wesm
Assignee: Alexander Ocsa / @aocsa

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-1565. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Another possibility is to use quickselect.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I reframed this issue as a query processing task. We need to be able to compute TopK/BottomK with chunked data

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
cc @bkietz

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Daniel Lemire compared various approaches (modifications of the heap approach or modifications of quickselect) and found that the heap wins in his tests: https://lemire.me/blog/2017/06/21/top-speed-for-top-k-queries/

If we want this to perfectly emulate Pandas's nlargest/nsmallest then there are some additional complications (inputs must have a defined order to resolve ties, and/or we must be able to track and output duplicate values).

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Issue resolved by pull request 11019
#11019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant