Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added collectNumberOrderedElements #45

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benraha
Copy link

@benraha benraha commented Oct 9, 2024

In a lot of our solutions, we select only a fixed number of rows, based on ordering by a column, usually a small amount. Datafu has dedupTopN, which uses a window function, and dedupWithCombiner, which is limited to only taking one record per grouping. dedupTopN is using a window function, which is inefficient because it orders all of the rows per group, and is very susceptible to skew. DedupWithCombiner won't let us take more than one row.

This PR introduces a solution - a class that implements DeclarativeAggregate, to avoid declaring the schemas explicitly and using the combiner to avoid skew and Codegen.

@benraha benraha changed the title Added the first version of collectNumberOrderedElements Added collectNumberOrderedElements Oct 9, 2024
@eyala
Copy link
Contributor

eyala commented Dec 9, 2024

Did you specify that DataFu build with Spark 3.3 or 3.4? I think your PR assumes a newer interface of DeclarativeAggregate than what we have currently, and that's why the build is failing in our CI.

I'm planning on pushing code that will upgrade us to these versions, so that will probably make your PR pass tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants