Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve aggregation code readability #12335

Open
lewiszlw opened this issue Sep 5, 2024 · 8 comments
Open

Improve aggregation code readability #12335

lewiszlw opened this issue Sep 5, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@lewiszlw
Copy link
Member

lewiszlw commented Sep 5, 2024

Is your feature request related to a problem or challenge?

I'm reading aggregation code recently, found it's hard to read due to different AggregateMode, these modes are mixed up in the code.

Describe the solution you'd like

Might split stream based on AggregateMode?

Describe alternatives you've considered

No response

Additional context

No response

@lewiszlw lewiszlw added the enhancement New feature or request label Sep 5, 2024
@jayzhan211
Copy link
Contributor

I think it is a good chance to add documentation that helps yourself and others, or refactor the code to improve the codebase.

@lewiszlw
Copy link
Member Author

lewiszlw commented Sep 6, 2024

Yeah, I will try. I'm not familiar with the whole code yet, I'll keep reading.

@2010YOUY01
Copy link
Contributor

This would be a valuable improvement. Now the execution behavior is determined by AggregateMode + other more subtle inner states like spill_state.xxx, and some functions will handle several less related paths at the same time, it's a good idea to separate them (or at least add more doc)
However if it's a big refactor, it's better to wait until #11943 gets merged to avoid conflicts

@Rachelint
Copy link
Contributor

This would be a valuable improvement. Now the execution behavior is determined by AggregateMode + other more subtle inner states like spill_state.xxx, and some functions will handle several less related paths at the same time, it's a good idea to separate them (or at least add more doc) However if it's a big refactor, it's better to wait until #11943 gets merged to avoid conflicts

I think at least we should separate partial and the teminals(Final, Single...).
I am making a poc trying to improve the performance of partial, and I found it is too hard to make it, because all modes are mixed up...
I am worried that it will become unmaintainable in future...

@eejbyfeldt
Copy link
Contributor

I was also reading the aggregation code and came up with one possible simplification. What do you think about the idea of a change like this #12517 I think it helps simplify the logic as it removes the is_stream_merging flag and lets that be handled as just another aggregation with final mode.

@Rachelint
Copy link
Contributor

Rachelint commented Sep 18, 2024

I was also reading the aggregation code and came up with one possible simplification. What do you think about the idea of a change like this #12517 I think it helps simplify the logic as it removes the is_stream_merging flag and lets that be handled as just another aggregation with final mode.

I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.

@eejbyfeldt
Copy link
Contributor

I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.

Yeah, I guess it a bit of a double edge sword that it just moves the complexity elsewhere. But I could think well together with your suggestion of

I think at least we should separate partial and the teminals(Final, Single...).

Because without doing something like this the partial would still need all the logic for the terminal. Or at least my understand is that the stream merging is basically just performing a terminal aggregation.

@Rachelint
Copy link
Contributor

I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.

Yeah, I guess it a bit of a double edge sword that it just moves the complexity elsewhere. But I could think well together with your suggestion of

I think at least we should separate partial and the teminals(Final, Single...).

Because without doing something like this the partial would still need all the logic for the terminal. Or at least my understand is that the stream merging is basically just performing a terminal aggregation.

Maybe we can refactor the codes thoroughly after #11943 (which can obviously improve aggr performance) is merged.

I think the partial and other terminals are more reasonable and maintainable to be split into two different structs, and placed into different files.
Because obviously that, some paths only belong to partial aggr:

  • Skipping
  • Emit early due to memory usage
    ...

And some paths only belong to terminals:

  • Spilling due to memory usage
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants