colexec: hash aggregator doesn't maintain the partial ordering when spilling to disk #63159

yuzefovich · 2021-04-06T18:27:54Z

Currently, the vectorized hash aggregator doesn't maintain the partial ordering if it has to spill to disk. Consider the following logic test which will fail on fakedist-disk config:

statement ok
create table ab (a int, b int, index(a) storing (b));
insert into ab values (1,1),(3,3),(2,2),(5,5),(0,0),(1,1);

query III
select a, b, count(*) from ab group by a,b order by a
----
0  0  1
1  1  2
2  2  1
3  3  1
5  5  1

The issue is present only on 21.1 since before this release we didn't have the disk spilling support. There are several possible ways to mitigate this problem, and as the first step I will look into supporting the partial ordering by the external hash aggregator.

The text was updated successfully, but these errors were encountered:

yuzefovich · 2021-04-07T03:37:45Z

I think that fixing this on the execution side will be too invasive and could be error-prone because we use the same component hashBasedPartitioner to support the disk spilling for hash joins, hash aggregation, and unordered distinct. Adjusting hashBasedPartitioner to support maintaining the partial ordering in case of hash aggregation will require reworking the state transitions so that we can work on "chunks" one at a time. The code there is quite complex already, but I'm extremely worried of introducing even more complexity.

So I think - at least in the short term - it is better to fix it from the optimizer side. An idea that was mentioned is using a segmented sort + streaming aggregation in this case, or, alternatively, planning a general sort after the hash aggregation. cc @rytaft @RaduBerinde

yuzefovich · 2021-04-07T03:39:04Z

I guess another idea would be to fallback to the row-by-row processor in such case, but I think that would be quite unfortunate, and I would treat it as the last resort. The advantage of this approach is that it'll be a very small change (like 3 lines of code).

yuzefovich · 2021-04-08T14:50:20Z

Another option is to plan an explicit external sort to restore the partial ordering not maintained by the hash aggregator when it spills to disk. This might be a bit finicky when the columns from the partial ordering are not output by the aggregator - we'll need to insert "fake" any_not_null aggregates for those and then project them out.

RaduBerinde · 2021-04-08T15:59:53Z

That sounds error-prone. I agree that the cleanest solution is in the optimizer, I will work on it.

rytaft · 2021-04-08T16:57:06Z

Thank you @RaduBerinde!

yuzefovich added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker branch-release-21.1 labels Apr 6, 2021

yuzefovich self-assigned this Apr 6, 2021

yuzefovich removed their assignment Apr 7, 2021

RaduBerinde self-assigned this Apr 8, 2021

yuzefovich assigned yuzefovich and unassigned RaduBerinde Apr 8, 2021

yuzefovich mentioned this issue Apr 9, 2021

colexec: fix hash aggregator when spilling to disk #63372

Merged

craig bot closed this as completed in 71a023f Apr 9, 2021

yuzefovich mentioned this issue Apr 9, 2021

release-21.1: colexec: fix hash aggregator when spilling to disk #63408

Merged

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexec: hash aggregator doesn't maintain the partial ordering when spilling to disk #63159

colexec: hash aggregator doesn't maintain the partial ordering when spilling to disk #63159

yuzefovich commented Apr 6, 2021

yuzefovich commented Apr 7, 2021

yuzefovich commented Apr 7, 2021

yuzefovich commented Apr 8, 2021

RaduBerinde commented Apr 8, 2021

rytaft commented Apr 8, 2021

colexec: hash aggregator doesn't maintain the partial ordering when spilling to disk #63159

colexec: hash aggregator doesn't maintain the partial ordering when spilling to disk #63159

Comments

yuzefovich commented Apr 6, 2021

yuzefovich commented Apr 7, 2021

yuzefovich commented Apr 7, 2021

yuzefovich commented Apr 8, 2021

RaduBerinde commented Apr 8, 2021

rytaft commented Apr 8, 2021