Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: why nullable of list item is set to true #11626

Merged
merged 11 commits into from
Jul 26, 2024
76 changes: 76 additions & 0 deletions datafusion/functions-aggregate/COMMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Why Is List Item Always Nullable?

## Motivation

There were independent proposals to make the `nullable` setting of list
items in accumulator state be configurable. This meant adding additional
fields which captured the `nullable` setting from schema in planning for
the first argument to the aggregation function, and the returned value.

These fields were to be added to `StateFieldArgs`. But then we found out
that aggregate computation does not depend on it, and it can be avoided.

This document exists to make that reasoning explicit.

## Background

The list data type is used in the accumulator state for a few aggregate
functions like:

- `sum`
- `count`
- `array_agg`
- `bit_and`, `bit_or` and `bit_xor`
- `nth_value`

In all of the above cases the data type of the list item is equivalent
to either the first argument of the aggregate function or the returned
value.

For example, in `array_agg` the data type of item is equivalent to the
first argument and the definition looks like this:

```rust
// `args` : `StateFieldArgs`
// `input_type` : data type of the first argument
let mut fields = vec![Field::new_list(
format_state_name(self.name(), "array_agg"),
Field::new("item", args.input_type.clone(), true /* nullable of list item */ ),
false, // nullable of list itself
)];
```

For all the aggregates listed above, the list item is always defined as
nullable.

## Computing Intermediate State

By setting `nullable` to be always `true` like this we ensure that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is another rationale that the intermediate results need to be able to represent "saw no rows" (e.g that partition had no values)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nth_value accumulator, when now rows are present in the partition, then no values are added to the intermediate state.

I haven't checked the other aggregates though. So I don't know for certain if this is the case always. I'll verify and make a follow-on PR if any differences exist. I think we've only looked deeper into nth_value and array_agg (by @jayzhan211) at the moment.

Copy link
Contributor

@alamb alamb Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- I vaguely remember that the null was needed in one of the aggregators to distinguish between

  1. only empty lists had been seen []
  2. No lists at all had been seen NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for the pointer. I'll keep this in mind while making pass through the aggregates next time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a minor copy change to disambiguate that the "Computing Intermediate State" section is talking about the nullability of the list item rather than the nullability of the list container.

Sorry for the confusion. I was not clear earlier.

aggregate computation works even when nulls are present. The advantage
of doing it this way is that it eliminates the need for additional code
and special treatment of nulls in the accumulator state.

## Nullable Of List Itself

The `nullable` of list itself is dependent on the aggregate. In the
case of `array_agg` the list is nullable(`true`), meanwhile for `sum`
the list is not nullable(`false`).
2 changes: 2 additions & 0 deletions datafusion/functions-aggregate/src/array_agg.rs
Original file line number Diff line number Diff line change
Expand Up @@ -92,13 +92,15 @@ impl AggregateUDFImpl for ArrayAgg {
if args.is_distinct {
return Ok(vec![Field::new_list(
format_state_name(args.name, "distinct_array_agg"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
true,
)]);
}

let mut fields = vec![Field::new_list(
format_state_name(args.name, "array_agg"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
true,
)];
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/bit_and_or_xor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ impl AggregateUDFImpl for BitwiseOperation {
args.name,
format!("{} distinct", self.name()).as_str(),
),
// See COMMENTS.md to understand why nullable is set to true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Field::new("item", args.return_type.clone(), true),
false,
)])
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/count.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ impl AggregateUDFImpl for Count {
if args.is_distinct {
Ok(vec![Field::new_list(
format_state_name(args.name, "count distinct"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
false,
)])
Expand Down
5 changes: 1 addition & 4 deletions datafusion/functions-aggregate/src/nth_value.rs
Original file line number Diff line number Diff line change
Expand Up @@ -134,10 +134,7 @@ impl AggregateUDFImpl for NthValueAgg {
fn state_fields(&self, args: StateFieldsArgs) -> Result<Vec<Field>> {
let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
// TODO: The nullability of the list element should be configurable.
// The hard-coded `true` should be changed once the field for
// nullability is added to `StateFieldArgs` struct.
// See: https://github.com/apache/datafusion/pull/11063
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
false,
)];
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/sum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ impl AggregateUDFImpl for Sum {
if args.is_distinct {
Ok(vec![Field::new_list(
format_state_name(args.name, "sum distinct"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.return_type.clone(), true),
false,
)])
Expand Down