Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array agg groups accumulator, second attempt #11096

Closed

Conversation

markusa380
Copy link
Contributor

@markusa380 markusa380 commented Jun 24, 2024

Continuation of #10149

Which issue does this PR close?

Closes #10145.
Closes #10149

Rationale for this change

See the issue.

What changes are included in this PR?

GroupsAccumulator for array_agg aggregation function for:

  • Primitive types
  • String type

Not included:

  • Accumulating arrays of any level of nesting.

Are these changes tested?

Extended tests in aggregate.slt

Are there any user-facing changes?

Yes, IGNORE NULLS now works with array_agg.


I have not yet addressed the comments of the original PR:

but instead want to first find out if my approach to null handling is generally valid or not.

The problem with null handling is that this implementation originally ignored null values.
However a test in aggregate.slt added meanwhile asserts that it is included.
Now I asked in Discord what approach is correct, to which I was informed that DF is trying to follow PG and Spark.
I thus checked PG and Spark, and as it turns out PG includes nulls while Spark does not:

>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import array_agg
>>> data = [("value1",), (None,), ("value3",), (None,), ("value5",)]
>>> df = spark.createDataFrame(data, ["column1"], StringType())
>>> df.count()
5
>>> df.agg(array_agg('column1').alias('r')).collect()
[Row(r=['value1', 'value3', 'value5'])]

However:

-- create
CREATE TABLE EMPLOYEE (
  empId INTEGER PRIMARY KEY,
  name TEXT,
  dept TEXT NOT NULL
);

-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Clark', 'Sales');
INSERT INTO EMPLOYEE VALUES (0002, 'Dave', 'Accounting');
INSERT INTO EMPLOYEE VALUES (0003, 'Ava', 'Sales');
INSERT INTO EMPLOYEE VALUES (0004, NULL, 'Sales');

-- fetch 
SELECT ARRAY_AGG(name), dept FROM EMPLOYEE group by dept;

    array_agg     |    dept
------------------+------------
 {Dave}           | Accounting
 {Clark,Ava,NULL} | Sales
(2 rows)

So I decided to make everyone happy by adding support for IGNORE NULLS

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Jun 24, 2024
@alamb
Copy link
Contributor

alamb commented Jun 25, 2024

I will try and find time to review this tomorrow. THank you @markusa380

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

I plan to review this tomorrow morning

@markusa380
Copy link
Contributor Author

@alamb I suggest you don't review this now, it looks like there's some major conflict with this:
e9e2951
I will have to take some more time to resolve this.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 The benchmark results apear to be mixed

--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ array_agg-groups-accumulator-v2 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.86ms │                          0.87ms │     no change │
│ QQuery 1     │    95.00ms │                         97.90ms │     no change │
│ QQuery 2     │   205.66ms │                        204.51ms │     no change │
│ QQuery 3     │   210.28ms │                        207.06ms │     no change │
│ QQuery 4     │  2239.73ms │                       2191.26ms │     no change │
│ QQuery 5     │  1962.98ms │                       1998.95ms │     no change │
│ QQuery 6     │    87.71ms │                         83.54ms │     no change │
│ QQuery 7     │    97.26ms │                         99.92ms │     no change │
│ QQuery 8     │  3419.52ms │                       3293.90ms │     no change │
│ QQuery 9     │  2406.86ms │                       2392.59ms │     no change │
│ QQuery 10    │   863.62ms │                        865.42ms │     no change │
│ QQuery 11    │   950.55ms │                        959.03ms │     no change │
│ QQuery 12    │  2117.07ms │                       2157.11ms │     no change │
│ QQuery 13    │  4621.89ms │                       4507.84ms │     no change │
│ QQuery 14    │  2842.78ms │                       2888.43ms │     no change │
│ QQuery 15    │  2538.73ms │                       2412.01ms │     no change │
│ QQuery 16    │  6070.69ms │                       5941.28ms │     no change │
│ QQuery 17    │  6010.76ms │                       5856.32ms │     no change │
│ QQuery 18    │ 12305.28ms │                      11909.09ms │     no change │
│ QQuery 19    │   171.37ms │                        175.83ms │     no change │
│ QQuery 20    │  2666.03ms │                       2744.37ms │     no change │
│ QQuery 21    │  3356.16ms │                       3446.94ms │     no change │
│ QQuery 22    │  8498.75ms │                       9339.77ms │  1.10x slower │
│ QQuery 23    │ 21538.59ms │                      22489.59ms │     no change │
│ QQuery 24    │  1312.53ms │                       1368.81ms │     no change │
│ QQuery 25    │  1113.37ms │                       1188.67ms │  1.07x slower │
│ QQuery 26    │  1435.31ms │                       1501.12ms │     no change │
│ QQuery 27    │  3960.65ms │                       4002.54ms │     no change │
│ QQuery 28    │ 30742.95ms │                      28935.23ms │ +1.06x faster │
│ QQuery 29    │  1039.04ms │                       1033.84ms │     no change │
│ QQuery 30    │  2535.98ms │                       2563.35ms │     no change │
│ QQuery 31    │  3294.44ms │                       3302.96ms │     no change │
│ QQuery 32    │ 17200.49ms │                      17215.20ms │     no change │
│ QQuery 33    │  9561.27ms │                       9715.35ms │     no change │
│ QQuery 34    │  9528.20ms │                       9708.13ms │     no change │
│ QQuery 35    │  4253.23ms │                       4191.71ms │     no change │
│ QQuery 36    │   351.77ms │                        351.60ms │     no change │
│ QQuery 37    │   219.39ms │                        238.54ms │  1.09x slower │
│ QQuery 38    │   193.37ms │                        203.92ms │  1.05x slower │
│ QQuery 39    │  1171.07ms │                       1154.55ms │     no change │
│ QQuery 40    │    97.56ms │                         97.61ms │     no change │
│ QQuery 41    │    85.25ms │                         87.23ms │     no change │
│ QQuery 42    │   102.14ms │                        104.12ms │     no change │
└──────────────┴────────────┴─────────────────────────────────┴───────────────┘

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                         │ 173476.11ms │
│ Total Time (array_agg-groups-accumulator-v2)   │ 173227.96ms │
│ Average Time (main_base)                       │   4034.33ms │
│ Average Time (array_agg-groups-accumulator-v2) │   4028.56ms │
│ Queries Faster                                 │           1 │
│ Queries Slower                                 │           4 │
│ Queries with No Change                         │          38 │
└────────────────────────────────────────────────┴─────────────┘

--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ array_agg-groups-accumulator-v2 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │  333.51ms │                        339.62ms │    no change │
│ QQuery 2     │  126.50ms │                        134.88ms │ 1.07x slower │
│ QQuery 3     │  165.32ms │                        163.39ms │    no change │
│ QQuery 4     │   99.88ms │                        102.41ms │    no change │
│ QQuery 5     │  202.28ms │                        203.27ms │    no change │
│ QQuery 6     │   92.01ms │                         91.54ms │    no change │
│ QQuery 7     │  278.23ms │                        278.15ms │    no change │
│ QQuery 8     │  200.50ms │                        213.43ms │ 1.06x slower │
│ QQuery 9     │  319.18ms │                        324.96ms │    no change │
│ QQuery 10    │  275.71ms │                        284.99ms │    no change │
│ QQuery 11    │  101.88ms │                        103.48ms │    no change │
│ QQuery 12    │  155.04ms │                        158.06ms │    no change │
│ QQuery 13    │  323.16ms │                        332.34ms │    no change │
│ QQuery 14    │  121.68ms │                        120.21ms │    no change │
│ QQuery 15    │  174.75ms │                        176.83ms │    no change │
│ QQuery 16    │   99.55ms │                        110.40ms │ 1.11x slower │
│ QQuery 17    │  315.42ms │                        315.47ms │    no change │
│ QQuery 18    │  446.77ms │                        436.15ms │    no change │
│ QQuery 19    │  200.79ms │                        198.34ms │    no change │
│ QQuery 20    │  195.76ms │                        194.69ms │    no change │
│ QQuery 21    │  343.41ms │                        334.32ms │    no change │
│ QQuery 22    │   76.56ms │                         74.73ms │    no change │
└──────────────┴───────────┴─────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                         │ 4647.90ms │
│ Total Time (array_agg-groups-accumulator-v2)   │ 4691.66ms │
│ Average Time (main_base)                       │  211.27ms │
│ Average Time (array_agg-groups-accumulator-v2) │  213.26ms │
│ Queries Faster                                 │         0 │
│ Queries Slower                                 │         3 │
│ Queries with No Change                         │        19 │
└────────────────────────────────────────────────┴───────────┘

@@ -59,6 +62,8 @@ pub struct NullState {
/// If `seen_values[i]` is false, have not seen any values that
/// pass the filter yet for group `i`
seen_values: BooleanBufferBuilder,

seen_nulls: Int64BufferBuilder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintaining NullState is typically on the inner loop of aggregate performance so I am worried about the impact to performance here

I will run some benchmark numbers to gather some data

@@ -132,7 +138,7 @@ impl NullState {
mut value_fn: F,
) where
T: ArrowPrimitiveType + Send,
F: FnMut(usize, T::Native) + Send,
F: FnMut(usize, Option<T::Native>) + Send,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs should be updated as well to reflect this change

@alamb alamb marked this pull request as draft June 28, 2024 21:21
@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Aug 28, 2024
@github-actions github-actions bot closed this Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement GroupsAccumulator for array_agg aggregation function
2 participants