Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered #6034

mustafasrepo · 2023-04-17T16:00:24Z

Which issue does this PR close?

Closes #5133.

Rationale for this change

A discussed in the document. If some of the expressions in the GROUP BY clause is already ordered. We can generate aggregator results without breaking the pipeline. Consider the query below

SELECT a, b,
   SUM(c) as summation1
   FROM annotated_data
   GROUP BY a, b

If source is ordered by a, b. We can calculate group by values in streaming fashion (Corresponds to Fully Streaming in the document). When the value of a, b columns change, it means that corresponding group will not receive any more values(Otherwise it would contradict with ordering).

If the source were ordered by a. We can calculate group by values in streaming fashion also ((Corresponds to Partial Streaming in the document)). In this case, However, results would be generated when the value of column a change.

What changes are included in this PR?

This PR enables us to produce non-pipeline breaking results for the conditions Fully Streaming and Partial Streaming. Please see the document for more detailed discussion and what these terms refer.

Since the behavior of the executor changes with existing ordering (If result can be calculated in streaming fashion output will have an ordering. Otherwise output will not have an ordering). This functionality requires us to calculate output_ordering dynamically. For this reason, this PR accompanies the api change from fn output_ordering(&self) -> Option<&[PhysicalSortExpr]> to fn output_ordering(&self) -> Option<Vec<PhysicalSortExpr>>. To support dynamic calculation of the output ordering. See the corresponding PR for more information.

Are these changes tested?

Yes aggregate_fuzz.rs file contains random test to check for whether streamed results and existing version produces same result.

Also when run on benchmarks I have verified that there is no regression. Benchmark results can be found below

Query	Main	Branch	Comparison
QQuery 1	2528.72ms	2476.56ms	no change
QQuery 2	430.39ms	428.26ms	no change
QQuery 3	1808.89ms	1808.69ms	no change
QQuery 4	1475.49ms	1472.43ms	no change
QQuery 5	1826.64ms	1829.95ms	no change
QQuery 6	1599.18ms	1587.20ms	no change
QQuery 7	2123.82ms	2119.41ms	no change
QQuery 8	1829.00ms	1820.43ms	no change
QQuery 9	2229.39ms	2193.94ms	no change
QQuery 10	1827.89ms	1797.32ms	no change
QQuery 11	416.38ms	417.49ms	no change
QQuery 12	1655.84ms	1628.03ms	no change
QQuery 13	673.42ms	686.45ms	no change
QQuery 14	1550.93ms	1518.83ms	no change
QQuery 15	3061.58ms	2999.56ms	no change
QQuery 16	229.33ms	237.00ms	no change
QQuery 17	5935.75ms	5963.68ms	no change
QQuery 18	4655.64ms	4653.72ms	no change
QQuery 19	1829.92ms	1794.32ms	no change
QQuery 20	2137.18ms	2129.47ms	no change
QQuery 21	4349.41ms	4234.84ms	no change
QQuery 22	355.48ms	352.52ms	no change

Are there any user-facing changes?

api change

# Conflicts: # datafusion/core/src/physical_optimizer/repartition.rs # datafusion/core/src/physical_plan/joins/sort_merge_join.rs # datafusion/core/src/physical_plan/joins/symmetric_hash_join.rs # datafusion/core/src/physical_plan/mod.rs # datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs # datafusion/core/src/physical_plan/windows/bounded_window_agg_exec.rs # datafusion/core/src/physical_plan/windows/window_agg_exec.rs # datafusion/physical-expr/src/utils.rs

# Conflicts: # datafusion/core/src/test_util/mod.rs # datafusion/physical-expr/src/utils.rs

# Conflicts: # datafusion/common/src/utils.rs

# Conflicts: # datafusion/core/src/physical_optimizer/repartition.rs # datafusion/core/src/physical_optimizer/sort_pushdown.rs

# Conflicts: # datafusion/core/src/physical_plan/aggregates/mod.rs

# Conflicts: # datafusion/core/src/physical_plan/aggregates/row_hash.rs

# Conflicts: # datafusion/core/src/physical_plan/aggregates/mod.rs # datafusion/core/src/physical_plan/aggregates/row_hash.rs # datafusion/core/src/test_util/mod.rs # datafusion/core/tests/sql/window.rs

# Conflicts: # datafusion/core/tests/sqllogictests/test_files/window.slt

# Conflicts: # datafusion/common/src/utils.rs # datafusion/core/src/physical_optimizer/sort_enforcement.rs # datafusion/core/src/physical_plan/streaming.rs # datafusion/core/src/physical_plan/windows/bounded_window_agg_exec.rs # datafusion/core/src/physical_plan/windows/window_agg_exec.rs # datafusion/physical-expr/src/utils.rs

mustafasrepo · 2023-04-24T10:05:32Z

Should we create a different group stream instead of adding new methods to GroupedHashAggregateStream ?

Since most of the code is common with existing GroupedHashAggregateStream, we chose this implementation. Otherwise we would need to duplicate a lot of code.

alamb · 2023-04-24T13:46:54Z

I ran some preliminary benchmarks against this branch and it seems like some queries have gotten slightly slower:

alamb@aal-dev:~/benchmarking/feature%2Fstream_groupby4$ python3 ~/arrow-datafusion/benchmarks/compare.py tpch_sf1_parquet_mem.json tpch_sf1_mem_branch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃           -o ┃           -o ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │     770.75ms │     760.05ms │    no change │
│ QQuery 2     │     289.80ms │     312.71ms │ 1.08x slower │
│ QQuery 3     │     174.31ms │     175.09ms │    no change │
│ QQuery 4     │     106.65ms │     104.51ms │    no change │
│ QQuery 5     │     477.41ms │     480.83ms │    no change │
│ QQuery 6     │      38.15ms │      37.78ms │    no change │
│ QQuery 7     │    1071.70ms │    1082.32ms │    no change │
│ QQuery 8     │     252.64ms │     264.53ms │    no change │
│ QQuery 9     │     581.89ms │     598.15ms │    no change │
│ QQuery 10    │     332.62ms │     339.50ms │    no change │
│ QQuery 11    │     282.02ms │     291.65ms │    no change │
│ QQuery 12    │     145.87ms │     152.48ms │    no change │
│ QQuery 13    │     679.94ms │     680.18ms │    no change │
│ QQuery 14    │      59.35ms │      58.90ms │    no change │
│ QQuery 15    │      96.58ms │      96.56ms │    no change │
│ QQuery 16    │     251.37ms │     266.31ms │ 1.06x slower │
│ QQuery 17    │    2435.04ms │    2539.73ms │    no change │
│ QQuery 18    │    3021.24ms │    3272.84ms │ 1.08x slower │
│ QQuery 19    │     142.99ms │     153.61ms │ 1.07x slower │
│ QQuery 20    │     925.24ms │    1058.29ms │ 1.14x slower │
│ QQuery 21    │    1423.51ms │    1407.18ms │    no change │
│ QQuery 22    │     148.12ms │     144.86ms │    no change │
└──────────────┴──────────────┴──────────────┴──────────────┘
alamb@aal-dev:~/benchmarking/feature%2Fstream_groupby4$ python3 ~/arrow-datafusion/benchmarks/compare.py tpch_sf1_parquet_main.json  tpch_sf1_parquet_branch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │    1470.88ms │    1456.73ms │    no change │
│ QQuery 2     │     394.00ms │     422.56ms │ 1.07x slower │
│ QQuery 3     │     564.83ms │     540.62ms │    no change │
│ QQuery 4     │     222.21ms │     221.49ms │    no change │
│ QQuery 5     │     717.36ms │     702.32ms │    no change │
│ QQuery 6     │     460.41ms │     454.66ms │    no change │
│ QQuery 7     │    1216.67ms │    1230.08ms │    no change │
│ QQuery 8     │     717.35ms │     731.94ms │    no change │
│ QQuery 9     │    1337.85ms │    1326.94ms │    no change │
│ QQuery 10    │     765.05ms │     787.99ms │    no change │
│ QQuery 11    │     337.95ms │     344.80ms │    no change │
│ QQuery 12    │     329.03ms │     327.28ms │    no change │
│ QQuery 13    │    1105.33ms │    1170.87ms │ 1.06x slower │
│ QQuery 14    │     449.61ms │     450.68ms │    no change │
│ QQuery 15    │     405.36ms │     417.38ms │    no change │
│ QQuery 16    │     330.18ms │     349.17ms │ 1.06x slower │
│ QQuery 17    │    2772.98ms │    2891.72ms │    no change │
│ QQuery 18    │    3592.01ms │    3802.18ms │ 1.06x slower │
│ QQuery 19    │     769.32ms │     771.99ms │    no change │
│ QQuery 20    │    1237.75ms │    1326.82ms │ 1.07x slower │
│ QQuery 21    │    1663.89ms │    1633.89ms │    no change │
│ QQuery 22    │     197.52ms │     202.74ms │    no change │
└──────────────┴──────────────┴──────────────┴──────────────┘

Script I used is here: https://github.com/alamb/datafusion-benchmarking/blob/628151e3e3d27ff6e5242052d017f71dcd0d80ef/bench.sh

I am rerunning the numbers to see if I can reproduce the results

alamb

I think the structure to make a single GroupHash operator support both ordered and unordered data is very clever. Thank you @mustafasrepo

However, I suspect the overhead of this tracking is slowing down existing aggregation.

If this is in fact slowing down queries, then I think making separate operators for streaming and non streaming (as @mingmwang suggests and I tried to say in #5133 (comment))

I will post my next benchmark run shortly

datafusion-examples/examples/custom_datasource.rs

datafusion/core/src/physical_plan/aggregates/row_hash.rs

alamb · 2023-04-24T14:10:49Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+                .last()
+                .map(|item| item.ordered_columns.clone());
+
+            if let Some(last_ordered_columns) = last_ordered_columns {


I may be mis understanding this code, but it seems like it is tracking per-group if the group can be emitted or not. As I understand The “Partial Streaming” / “Partitioned Streaming” section of https://docs.google.com/document/d/16rm5VR1nGkY6DedMCh1NUmThwf3RduAweaBH9b1h6AY/edit#heading=h.uapxuhfa9wyi

The entire hash table could be flushed each time a new value of date is seen:

Perhaps with the obvious vectorization of only checking on record batch boundaries, or something

alamb · 2023-04-24T14:20:35Z

I got similar results in my next performance run:

****** TPCH SF1 (Parquet) ******
+ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_parquet_main.json /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_parquet_branch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │    1439.84ms │    1472.45ms │    no change │
│ QQuery 2     │     408.59ms │     428.13ms │    no change │
│ QQuery 3     │     541.64ms │     550.25ms │    no change │
│ QQuery 4     │     218.62ms │     227.71ms │    no change │
│ QQuery 5     │     690.76ms │     702.12ms │    no change │
│ QQuery 6     │     447.31ms │     456.74ms │    no change │
│ QQuery 7     │    1223.94ms │    1262.33ms │    no change │
│ QQuery 8     │     707.30ms │     731.55ms │    no change │
│ QQuery 9     │    1331.92ms │    1309.76ms │    no change │
│ QQuery 10    │     791.17ms │     813.92ms │    no change │
│ QQuery 11    │     338.22ms │     338.76ms │    no change │
│ QQuery 12    │     330.78ms │     329.38ms │    no change │
│ QQuery 13    │    1102.45ms │    1134.98ms │    no change │
│ QQuery 14    │     450.43ms │     447.24ms │    no change │
│ QQuery 15    │     401.69ms │     407.17ms │    no change │
│ QQuery 16    │     335.41ms │     370.62ms │ 1.10x slower │
│ QQuery 17    │    2793.22ms │    2984.28ms │ 1.07x slower │
│ QQuery 18    │    3602.06ms │    3855.55ms │ 1.07x slower │
│ QQuery 19    │     757.20ms │     772.06ms │    no change │
│ QQuery 20    │    1208.05ms │    1409.62ms │ 1.17x slower │
│ QQuery 21    │    1662.33ms │    1672.38ms │    no change │
│ QQuery 22    │     191.45ms │     195.55ms │    no change │
└──────────────┴──────────────┴──────────────┴──────────────┘
+ echo '****** TPCH SF1 (mem) ******'
****** TPCH SF1 (mem) ******
+ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_mem_main.json /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_mem_branch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃           -o ┃           -o ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │     731.33ms │     759.80ms │     no change │
│ QQuery 2     │     266.90ms │     315.18ms │  1.18x slower │
│ QQuery 3     │     168.24ms │     181.07ms │  1.08x slower │
│ QQuery 4     │     107.01ms │     107.07ms │     no change │
│ QQuery 5     │     469.84ms │     472.69ms │     no change │
│ QQuery 6     │      37.35ms │      38.33ms │     no change │
│ QQuery 7     │    1119.56ms │    1121.01ms │     no change │
│ QQuery 8     │     253.48ms │     258.02ms │     no change │
│ QQuery 9     │     595.14ms │     604.48ms │     no change │
│ QQuery 10    │     326.75ms │     343.41ms │  1.05x slower │
│ QQuery 11    │     262.07ms │     293.92ms │  1.12x slower │
│ QQuery 12    │     141.94ms │     151.47ms │  1.07x slower │
│ QQuery 13    │     645.04ms │     698.31ms │  1.08x slower │
│ QQuery 14    │      51.78ms │      48.35ms │ +1.07x faster │
│ QQuery 15    │     100.32ms │     114.44ms │  1.14x slower │
│ QQuery 16    │     240.47ms │     285.68ms │  1.19x slower │
│ QQuery 17    │    2448.05ms │    2590.14ms │  1.06x slower │
│ QQuery 18    │    3131.16ms │    3295.09ms │  1.05x slower │
│ QQuery 19    │     147.72ms │     150.86ms │     no change │
│ QQuery 20    │     931.54ms │    1052.21ms │  1.13x slower │
│ QQuery 21    │    1454.51ms │    1438.62ms │     no change │
│ QQuery 22    │     144.76ms │     141.84ms │     no change │
└──────────────┴──────────────┴──────────────┴───────────────┘

alamb · 2023-04-24T16:29:21Z

For reference, here is the same benchmark run against main itself:

****** TPCH SF1 (Parquet) ******
+ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/alamb-main/tpch_sf1_parquet_main.json /home/alamb/benchmarking/alamb-main/tpch_sf1_parquet\
_branch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    1430.86ms │    1423.29ms │     no change │
│ QQuery 2     │     399.75ms │     405.00ms │     no change │
│ QQuery 3     │     520.40ms │     525.56ms │     no change │
│ QQuery 4     │     218.29ms │     223.87ms │     no change │
│ QQuery 5     │     693.57ms │     685.46ms │     no change │
│ QQuery 6     │     416.62ms │     423.02ms │     no change │
│ QQuery 7     │    1258.17ms │    1243.79ms │     no change │
│ QQuery 8     │     690.25ms │     687.29ms │     no change │
│ QQuery 9     │    1304.02ms │    1288.01ms │     no change │
│ QQuery 10    │     770.91ms │     748.94ms │     no change │
│ QQuery 11    │     356.32ms │     336.55ms │ +1.06x faster │
│ QQuery 12    │     335.14ms │     329.12ms │     no change │
│ QQuery 13    │    1170.83ms │    1146.78ms │     no change │
│ QQuery 14    │     422.25ms │     421.47ms │     no change │
│ QQuery 15    │     391.14ms │     381.71ms │     no change │
│ QQuery 16    │     348.38ms │     344.13ms │     no change │
│ QQuery 17    │    2860.96ms │    2838.27ms │     no change │
│ QQuery 18    │    3726.11ms │    3734.67ms │     no change │
│ QQuery 19    │     728.53ms │     737.35ms │     no change │
│ QQuery 20    │    1250.75ms │    1208.06ms │     no change │
│ QQuery 21    │    1688.40ms │    1757.45ms │     no change │
│ QQuery 22    │     192.36ms │     190.43ms │     no change │
└──────────────┴──────────────┴──────────────┴───────────────┘
+ echo '****** TPCH SF1 (mem) ******'
****** TPCH SF1 (mem) ******
+ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/alamb-main/tpch_sf1_mem_main.json /home/alamb/benchmarking/alamb-main/tpch_sf1_mem_branch.\
json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃           -o ┃           -o ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │     759.07ms │     770.73ms │     no change │
│ QQuery 2     │     269.81ms │     291.05ms │  1.08x slower │
│ QQuery 3     │     180.67ms │     161.61ms │ +1.12x faster │
│ QQuery 4     │     105.16ms │     105.46ms │     no change │
│ QQuery 5     │     467.46ms │     466.11ms │     no change │
│ QQuery 6     │      38.08ms │      42.72ms │  1.12x slower │
│ QQuery 7     │    1170.05ms │    1147.43ms │     no change │
│ QQuery 8     │     249.06ms │     238.74ms │     no change │
│ QQuery 9     │     613.38ms │     609.98ms │     no change │
│ QQuery 10    │     342.67ms │     327.23ms │     no change │
│ QQuery 11    │     279.84ms │     281.69ms │     no change │
│ QQuery 12    │     143.94ms │     146.57ms │     no change │
│ QQuery 13    │     676.22ms │     668.79ms │     no change │
│ QQuery 14    │      53.06ms │      51.73ms │     no change │
│ QQuery 15    │      98.47ms │      92.68ms │ +1.06x faster │
│ QQuery 16    │     244.93ms │     257.73ms │  1.05x slower │
│ QQuery 17    │    2473.33ms │    2503.47ms │     no change │
│ QQuery 18    │    3150.26ms │    3169.32ms │     no change │
│ QQuery 19    │     154.90ms │     150.53ms │     no change │
│ QQuery 20    │     969.12ms │     929.69ms │     no change │
│ QQuery 21    │    1476.10ms │    1457.34ms │     no change │
│ QQuery 22    │     148.71ms │     143.20ms │     no change │
└──────────────┴──────────────┴──────────────┴───────────────┘

ozankabak · 2023-04-24T16:37:51Z

For reference, here is the same benchmark run against main itself:

This is very helpful. The variance seems larger than one expects.

It seems there may be a tiny slow-down of magnitude noise variance / 2 (in high cardinality cases?). @mustafasrepo and I just had a meeting to go over why it could be. He will respond explaining our theory in greater detail, answer your other questions and maybe even suggest a fix/improvement.

Based on the discussion afterwards and the final numbers, we can reach a consensus on whether we should have two impls with some code duplication, or use the current structure -- we will then take the necessary steps accordingly.

Thanks for all the reviews!

# Conflicts: # datafusion/core/src/physical_plan/windows/bounded_window_agg_exec.rs # datafusion/core/tests/sql/group_by.rs # datafusion/core/tests/sqllogictests/test_files/window.slt

mingmwang · 2023-04-25T07:51:38Z

@ozankabak @mustafasrepo

I strongly suggest to have separate implementation(Exec) for Streaming Aggregation. This is similar to how we separate the HashJoinExec /SortMergeJoinExec and UnionExec /InterleaveExec.
With the split of physical plans, The physical plans will deliver clear informations about what kind real physical operators they are composed of.
With the split of physical plans, we can keep each operator's code base (HashAggregation and StreamingHashAggregation) relatively simple.
We can further keep a relatively lightweight grouping state for each operators. The memory layout of the grouping state is critical for performance. For the hash aggregation performance, currently, we still have huge gaps compared with DuckDb.

mingmwang · 2023-04-25T08:46:10Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

 /// The state that is built for each output group.
 #[derive(Debug)]
 pub struct GroupState {
    /// The actual group by values, stored sequentially
    group_by_values: OwnedRow,

+    ordered_columns: Option<Vec<ScalarValue>>,


I think this is not efficient. We should avoid using Vec of Vec structs in the critical data structs. The Vec itself is actually a pointer which will point to some other memory address.
Because the GroupState is hold in a global Vec, if we store ordered_columns in another Vec, when the code access those member, the memory access pattern will be very random.

You can implement something similar to arrow's GenericStringArray to achieve a better memory layout.

mingmwang · 2023-04-25T09:25:19Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+    /// Prune the groups from the `self.aggr_state.group_states` which are in
+    /// `GroupStatus::Emitted`(this status means that result of this group emitted/outputted already, and
+    /// we are sure that these groups cannot receive new rows.) status.
+    fn prune(&mut self) {


I think this is good to keep the global group_states in a relatively small size.
But is it possible that in some cases only a small percentage is pruned? And we pay for more copies. I'm not sure when the prune will be triggered.

ordered_columns store the section of the group by expression that defines ordering in GroupState. When a different ordered_columns is received, we are sure that previous groups with different ordered_columns are finalized (They will no longer receive new value). At the end of group_aggregate_batch we iterate over self.aggr_state.group_states and mark the groups that have different ordered_columns with the ordered_columns of the most recent (last) group as prunable.

As an example, If the table is like below, and we know that it satisfies ORDER BY a ASC

a

1

1

2

2

3

3

and group by clause is GROUP BY a group with ordered_columns= Some(vec![1]) and ordered_columns= Some(vec![2]) will be pruned. Since they are different than ordered_columns= Some(vec![3]). However, last group is not pruned because we still can receive values with 3 for column a

mustafasrepo · 2023-04-25T09:57:38Z

Regarding the performance downgrade. I have examined the code to see where performance downgrade occurs. In new implementation we extend GroupState, to be able to determine whether a group is prunable or not, and to prune it safely.
Specifically, we were adding ordered_columns: Vec<ScalarValue>, to store ordered section of the group(When this change group is finalized, and can be pruned). This API required us to create empty Vector for each group (even if they are not prunable in theory. This may downgrade performance in high cardinality cases.) I have changed the API of this member from ordered_columns: Vec<ScalarValue> to ordered_columns: Option<Vec<ScalarValue>> to prevent unnecessary empty vector creation. Below are my test results.

Query	main	branch	Change
QQuery 1	1887.95ms	1896.02ms	no change
QQuery 2	347.76ms	349.14ms	no change
QQuery 3	1473.75ms	1475.31ms	no change
QQuery 4	1442.71ms	1436.79ms	no change
QQuery 5	1497.76ms	1501.52ms	no change
QQuery 6	1171.16ms	1174.17ms	no change
QQuery 7	1801.61ms	1801.15ms	no change
QQuery 8	1500.70ms	1503.70ms	no change
QQuery 9	1771.67ms	1771.04ms	no change
QQuery 10	1464.79ms	1467.14ms	no change
QQuery 11	348.81ms	349.58ms	no change
QQuery 12	1623.71ms	1613.43ms	no change
QQuery 13	640.64ms	643.67ms	no change
QQuery 14	1225.25ms	1228.61ms	no change
QQuery 15	2350.93ms	2362.08ms	no change
QQuery 16	224.96ms	230.69ms	no change
QQuery 17	3580.82ms	3584.29ms	no change
QQuery 18	3396.31ms	3425.83ms	no change
QQuery 19	1401.42ms	1405.69ms	no change
QQuery 20	1543.74ms	1576.21ms	no change
QQuery 21	4267.18ms	4258.72ms	no change
QQuery 22	341.99ms	343.75ms	no change

However, I cannot say definitely this has fixed the problem.
I think we have 3 options

Extend GroupedHashAggregateStream to support streaming aggregation (current approach).
Create a new kind of Stream for streaming aggregation (use it from AggregateExec).
Create a new kind of Executor such StreamingAggregateExec for streaming use cases.

My judgement is as follows for each case
Pros of first approach:

Introduces least amount of change, utilizes a lot of common code.

Cons of first approach

Possibly downgrade performance
Extends state with members that are not used in all cases

Pros of 2nd approach

No performance penalty
Clear state
We can still produce non-pipeline breaking results given that there is an existing ordering for group by expressions

Pros of 3rd approach

No performance penalty
Clear state

Cons of 3rd approach

We can only produce non-pipeline breaking results when source is marked as unbounded. (For bounded cases even if there is an existing ordering that can produce non-pipeline breaking result; without an additional rule to change executor, we cannot produce non-pipeline breaking results.)

We can pursue any one of the above approaches. If community has a preference we can pursue that approach.

ozankabak · 2023-04-25T15:05:55Z

@alamb, can you try measuring again on your end? I wonder if you will find a similar result with @mustafasrepo after his last change. If you also see no (or very little) performance change, I propose we get the ball rolling by merging this. We can always refactor the code to the 2nd approach with a follow-on PR.

# Conflicts: # datafusion/core/src/physical_plan/aggregates/mod.rs

alamb · 2023-04-25T19:02:10Z

I am running the benchmarks again

alamb · 2023-04-25T19:58:51Z

+ echo '****** TPCH SF1 (Parquet) ******'
****** TPCH SF1 (Parquet) ******
+ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_parquet_main.json /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_parqu\
et_branch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │    1440.81ms │    1466.22ms │    no change │
│ QQuery 2     │     382.39ms │     404.11ms │ 1.06x slower │
│ QQuery 3     │     543.05ms │     552.98ms │    no change │
│ QQuery 4     │     231.01ms │     220.95ms │    no change │
│ QQuery 5     │     693.54ms │     687.44ms │    no change │
│ QQuery 6     │     426.04ms │     449.61ms │ 1.06x slower │
│ QQuery 7     │    1185.19ms │    1188.84ms │    no change │
│ QQuery 8     │     702.14ms │     709.04ms │    no change │
│ QQuery 9     │    1300.47ms │    1316.84ms │    no change │
│ QQuery 10    │     773.33ms │     785.76ms │    no change │
│ QQuery 11    │     332.28ms │     353.08ms │ 1.06x slower │
│ QQuery 12    │     330.60ms │     321.46ms │    no change │
│ QQuery 13    │    1088.95ms │    1150.66ms │ 1.06x slower │
│ QQuery 14    │     426.84ms │     438.99ms │    no change │
│ QQuery 15    │     390.98ms │     400.00ms │    no change │
│ QQuery 16    │     328.41ms │     351.25ms │ 1.07x slower │
│ QQuery 17    │    2761.33ms │    2798.95ms │    no change │
│ QQuery 18    │    3650.22ms │    3674.77ms │    no change │
│ QQuery 19    │     724.28ms │     776.02ms │ 1.07x slower │
│ QQuery 20    │    1214.26ms │    1322.29ms │ 1.09x slower │
│ QQuery 21    │    1651.66ms │    1685.36ms │    no change │
│ QQuery 22    │     196.26ms │     196.06ms │    no change │
└──────────────┴──────────────┴──────────────┴──────────────┘
+ echo '****** TPCH SF1 (mem) ******'
****** TPCH SF1 (mem) ******
+ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_mem_main.json /home/alamb/benchmarking/feature%2Fstream_groupby4/tpch_sf1_mem_branc\
h.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃           -o ┃           -o ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │     758.37ms │     778.09ms │     no change │
│ QQuery 2     │     277.40ms │     309.41ms │  1.12x slower │
│ QQuery 3     │     173.75ms │     188.35ms │  1.08x slower │
│ QQuery 4     │     113.69ms │     110.27ms │     no change │
│ QQuery 5     │     475.95ms │     461.74ms │     no change │
│ QQuery 6     │      36.47ms │      37.38ms │     no change │
│ QQuery 7     │    1083.01ms │    1063.32ms │     no change │
│ QQuery 8     │     259.31ms │     246.75ms │     no change │
│ QQuery 9     │     624.32ms │     575.27ms │ +1.09x faster │
│ QQuery 10    │     310.52ms │     351.58ms │  1.13x slower │
│ QQuery 11    │     284.14ms │     282.22ms │     no change │
│ QQuery 12    │     148.13ms │     145.27ms │     no change │
│ QQuery 13    │     659.01ms │     719.22ms │  1.09x slower │
│ QQuery 14    │      52.88ms │      48.78ms │ +1.08x faster │
│ QQuery 15    │      90.71ms │     103.68ms │  1.14x slower │
│ QQuery 16    │     233.70ms │     258.77ms │  1.11x slower │
│ QQuery 17    │    2403.54ms │    2550.52ms │  1.06x slower │
│ QQuery 18    │    2964.21ms │    3217.22ms │  1.09x slower │
│ QQuery 19    │     139.10ms │     149.13ms │  1.07x slower │
│ QQuery 20    │     930.97ms │    1042.61ms │  1.12x slower │
│ QQuery 21    │    1399.42ms │    1408.19ms │     no change │
│ QQuery 22    │     140.09ms │     138.46ms │     no change │
└──────────────┴──────────────┴──────────────┴───────────────┘

I am running with https://github.com/alamb/datafusion-benchmarking/blob/87ee101b70b15dd4529f124d65189b0fb87e09b7/bench.sh

Running on a gcp machine e2-standard-8:

cat /proc/cpuinfo
...

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping        : 0
microcode       : 0xffffffff
cpu MHz         : 2200.164
cache size      : 56320 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpui\
d tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adj\
ust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed
bogomips        : 4400.32
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I am hoping to make the benchmarks easier to run / reproduce. I also plan to take another close look at this PR tomorrow

alamb · 2023-04-25T19:59:34Z

@tustvold do you have any thoughts about why this code / PR could be making the grouping operator seemingly slow down?

alamb · 2023-04-25T20:00:37Z

Create a new kind of Stream for streaming aggregation (use it from AggregateExec).

This is what makes sense to me, though I am not sure what @mingmwang thinks

tustvold · 2023-04-25T21:33:04Z

to prevent unnecessary empty vector creation

FWIW an empty vector just contains a NonNull::dangling, it doesn't allocate anything, so wrapping it in an Option is redundant. That being said the additions to GroupState will make it non-trivially larger, which could conceivably cause performance regressions.

I'm not very familiar with the group by code anymore, but the use of per-group allocations does immediately stand out to me as at risk of thrashing the memory allocator, and consequently making the code not just slow but also wildly unpredictable. I'm aware this isn't something introduced by this PR, but revisiting this design (see #4973) may make it easier to make changes in this area without introducing regressions.

ozankabak · 2023-04-25T23:40:38Z

I'm not very familiar with the group by code anymore, but the use of per-group allocations does immediately stand out to me as at risk of thrashing the memory allocator, and consequently making the code not just slow but also wildly unpredictable. I'm aware this isn't something introduced by this PR, but revisiting this design (see #4973) may make it easier to make changes in this area without introducing regressions.

I agree, we will be happy to take part in that effort.

Going back to the scope of this PR: If everyone agrees, we will generate another changeset with Approach 2, and @alamb can check how the performance looks there (since our local benchmarks do not show a difference even now).

We can then discuss any trade-offs w.r.t code re-use/duplication and performance differences. We can move forward with either version based on the outcome of that discussion. We can also talk about the next steps for subsequent performance work in this context. Sounds good?

mingmwang · 2023-04-26T08:46:29Z

Create a new kind of Stream for streaming aggregation (use it from AggregateExec).

This is what makes sense to me, though I am not sure what @mingmwang thinks

Yes, I am OK with this Option 2 as the following ticket.

mustafasrepo · 2023-04-26T13:26:21Z

I have implemented version 2. You can find it here. I have measured its performance, result can be found below.

Query	main	branch	Change
QQuery 1	1894.27ms	1930.49ms	no change
QQuery 2	353.20ms	350.90ms	no change
QQuery 3	1494.18ms	1518.88ms	no change
QQuery 4	1458.10ms	1487.40ms	no change
QQuery 5	1535.74ms	1553.12ms	no change
QQuery 6	1193.48ms	1219.73ms	no change
QQuery 7	1840.05ms	1854.09ms	no change
QQuery 8	1548.89ms	1558.78ms	no change
QQuery 9	1817.28ms	1837.66ms	no change
QQuery 10	1500.78ms	1524.82ms	no change
QQuery 11	344.51ms	347.98ms	no change
QQuery 12	1648.40ms	1678.82ms	no change
QQuery 13	635.02ms	644.33ms	no change
QQuery 14	1249.13ms	1277.28ms	no change
QQuery 15	2403.47ms	2449.88ms	no change
QQuery 16	225.48ms	227.18ms	no change
QQuery 17	3598.06ms	3628.50ms	no change
QQuery 18	3406.65ms	3458.96ms	no change
QQuery 19	1423.90ms	1457.02ms	no change
QQuery 20	1567.27ms	1592.57ms	no change
QQuery 21	4310.73ms	4371.10ms	no change
QQuery 22	341.31ms	348.51ms	no change

mustafasrepo added 30 commits March 22, 2023 16:33

add starting code for experimenting

6c1dfeb

stream group by linear implementation

1d8e6f5

sorted implementation

e35703b

Merge branch 'main' into feature/stream_groupby

7057106

minor changes

16c52f8

simplifications

48c8085

Simplifications

da7b2c6

convert vec to Option

ab93bf3

minor changes

6134751

minor changes

2cf0180

minor changes

2802685

simplifications

786caef

Merge branch 'main' into feature/stream_groupby

f04bd05

# Conflicts: # datafusion/core/src/test_util/mod.rs # datafusion/physical-expr/src/utils.rs

minor changes

a9f78cb

all tests pass

a9f6d93

refactor

4f49e55

Merge branch 'main' into feature/stream_groupby

0a0b496

# Conflicts: # datafusion/common/src/utils.rs

simplifications

ae29248

Merge branch 'main' into feature/output_order_vec

8828aec

# Conflicts: # datafusion/core/src/physical_optimizer/repartition.rs # datafusion/core/src/physical_optimizer/sort_pushdown.rs

Merge branch 'feature/output_order_vec' into feature/stream_groupby

45a0aab

# Conflicts: # datafusion/core/src/physical_plan/aggregates/mod.rs

remove unnecessary code

c1872f6

simplifications

b4c25ff

Merge branch 'main' into feature/stream_groupby

c6730c0

# Conflicts: # datafusion/core/src/physical_plan/aggregates/row_hash.rs

Merge branch 'main' into feature/stream_groupby

e321082

# Conflicts: # datafusion/core/src/physical_plan/aggregates/mod.rs # datafusion/core/src/physical_plan/aggregates/row_hash.rs # datafusion/core/src/test_util/mod.rs # datafusion/core/tests/sql/window.rs

minor changes

bb55f50

simplifications

2eab0d0

Merge branch 'main' into feature/stream_groupby

cfc86e4

# Conflicts: # datafusion/core/tests/sqllogictests/test_files/window.slt

minor changes

0932f52

alamb reviewed Apr 24, 2023

View reviewed changes

mustafasrepo added 2 commits April 25, 2023 09:47

Convert to option

e13742c

Merge branch 'main' into feature/stream_groupby4

feb9117

# Conflicts: # datafusion/core/src/physical_plan/windows/bounded_window_agg_exec.rs # datafusion/core/tests/sql/group_by.rs # datafusion/core/tests/sqllogictests/test_files/window.slt

retract back to old API.

0de426c

mingmwang reviewed Apr 25, 2023

View reviewed changes

mustafasrepo and others added 2 commits April 25, 2023 18:18

Merge branch 'main' into feature/stream_groupby4

4a07c10

# Conflicts: # datafusion/core/src/physical_plan/aggregates/mod.rs

Code quality: stylistic changes

70a13f4

mustafasrepo mentioned this pull request Apr 26, 2023

Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered (V2) #6124

Merged

This was referenced Apr 26, 2023

Easy DataFusion vs DataFusion benchmarking apache/arrow-rs#4141

Closed

Easy DataFusion / DataFusion Benchmarking #6127

Closed

mustafasrepo closed this in #6124 Apr 27, 2023

mustafasrepo deleted the feature/stream_groupby4 branch May 8, 2023 12:05

mustafasrepo mentioned this pull request Jul 3, 2023

Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream #6798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered #6034

Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered #6034

mustafasrepo commented Apr 17, 2023 •

edited

Loading

mustafasrepo commented Apr 24, 2023

alamb commented Apr 24, 2023

alamb left a comment

alamb Apr 24, 2023

alamb commented Apr 24, 2023

alamb commented Apr 24, 2023

ozankabak commented Apr 24, 2023

mingmwang commented Apr 25, 2023 •

edited

Loading

mingmwang Apr 25, 2023

mingmwang Apr 25, 2023 •

edited

Loading

mustafasrepo Apr 25, 2023

mustafasrepo commented Apr 25, 2023

ozankabak commented Apr 25, 2023

alamb commented Apr 25, 2023

alamb commented Apr 25, 2023

alamb commented Apr 25, 2023

alamb commented Apr 25, 2023

tustvold commented Apr 25, 2023

ozankabak commented Apr 25, 2023 •

edited

Loading

mingmwang commented Apr 26, 2023

mustafasrepo commented Apr 26, 2023

Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered #6034

Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered #6034

Conversation

mustafasrepo commented Apr 17, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mustafasrepo commented Apr 24, 2023

alamb commented Apr 24, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 24, 2023

Choose a reason for hiding this comment

alamb commented Apr 24, 2023

alamb commented Apr 24, 2023

ozankabak commented Apr 24, 2023

mingmwang commented Apr 25, 2023 • edited Loading

mingmwang Apr 25, 2023

Choose a reason for hiding this comment

mingmwang Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

mustafasrepo Apr 25, 2023

Choose a reason for hiding this comment

mustafasrepo commented Apr 25, 2023

ozankabak commented Apr 25, 2023

alamb commented Apr 25, 2023

alamb commented Apr 25, 2023

alamb commented Apr 25, 2023

alamb commented Apr 25, 2023

tustvold commented Apr 25, 2023

ozankabak commented Apr 25, 2023 • edited Loading

mingmwang commented Apr 26, 2023

mustafasrepo commented Apr 26, 2023

mustafasrepo commented Apr 17, 2023 •

edited

Loading

mingmwang commented Apr 25, 2023 •

edited

Loading

mingmwang Apr 25, 2023 •

edited

Loading

ozankabak commented Apr 25, 2023 •

edited

Loading