Implement GroupColumn Decimal128Array #13505

alamb · 2024-11-20T21:28:41Z

Is your feature request related to a problem or challenge?

In #12269 @jayzhan211 made significant improvements to how group values are stored in multi-column aggregations.

Specifically for queries like

SELECT ... FROM ... GROUP BY col1, ... colN

The improvement relies on implementing specialized versions of GroupColumn for the types of col1, colN

We have implemented the primitive types and Strings/StringViews now, but we have not implemented all types

This means queries like

SELECT ... FROM ... GROUP BY int_cl, decimal_col

Will fall back to the slower (but general) GroupValuesRows:

datafusion/datafusion/physical-plan/src/aggregates/group_values/row.rs

Lines 40 to 41 in a6586cc

    
           /// representation. 
        
           pub struct GroupValuesRows {

Describe the solution you'd like

Implement GroupColumn for Decimal128 types.

You can see how to do this here:

datafusion/datafusion/physical-plan/src/aggregates/group_values/mod.rs

Lines 117 to 121 in e4bd579

    
           macro_rules! downcast_helper { 
        
               ($t:ty, $d:ident) => { 
        
                   return Ok(Box::new(GroupValuesPrimitive::<$t>::new($d.clone()))) 
        
               }; 
        
           }

@jonathanc-n also made a really nice PR here

feat: Support faster multi-column grouping ( GroupColumn) for Date/Time/Timestamp types #13457 (comment)

and the make sure there are tests for each of those types in queries that group on multiple columns

Describe alternatives you've considered

No response

Additional context

Here is an example for how this was done for Strings: #12809

The text was updated successfully, but these errors were encountered:

alamb · 2024-11-20T22:12:03Z

BTW we can verify that this is working as expected after merging

Minor: Add debug log message for creating GroupValuesRows #13506

Then you can do

cd datafusion-cli
RUST_LOG=debug cargo run -- -c "create or replace table foo(x decimal(10,3), y int) as values (10.0, 100), (21.2, 200), (33.0, 300); select count(*) from foo group by x, y";

You should not see any lines about Creating GroupValuesRows . Here is what is printed out on main

[2024-11-20T22:08:58Z DEBUG datafusion_physical_plan::aggregates::group_values::row] Creating GroupValuesRows for schema: Field { name: "x", data_type: Decimal128(10, 3), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "y", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

jonathanc-n · 2024-11-20T22:17:24Z

take

jonathanc-n · 2024-11-21T23:20:43Z

@alamb For this pr, will it need its own custom column implementation for decimal128 instead of instantiate_primitive!, similar to how byte, byteview, stringview, etc. are dealt with? I am thinking that due to the parameters

alamb · 2024-11-22T12:08:46Z

@alamb For this pr, will it need its own custom column implementation for decimal128 instead of instantiate_primitive!, similar to how byte, byteview, stringview, etc. are dealt with? I am thinking that due to the parameters

I think you can use Vec<u128> for the underlying storage (aka use the PrimtiveGroup struct), for the underlying primitive type, but then call this function to:

datafusion/datafusion/functions-aggregate-common/src/utils.rs

Line 51 in 207e855

    
           pub fn adjust_output_array(data_type: &DataType, array: ArrayRef) -> Result<ArrayRef> {

To adjust the type at the end

For example like this:

datafusion/datafusion/physical-plan/src/aggregates/topk/heap.rs

Line 156 in 207e855

    
           let vals = adjust_output_array(&self.data_type, vals).expect("Type is incorrect");

So the idea is that you keep the actual group values as native types (u128 in this case) as we are only comparing their values

Does that make sense?

jonathanc-n · 2024-11-22T19:59:12Z

Yep, thanks!

jayzhan211 · 2024-11-23T01:53:55Z

@alamb For this pr, will it need its own custom column implementation for decimal128 instead of instantiate_primitive!, similar to how byte, byteview, stringview, etc. are dealt with? I am thinking that due to the parameters

I think you can use Vec<u128> for the underlying storage (aka use the PrimtiveGroup struct), for the underlying primitive type, but then call this function to:

datafusion/datafusion/functions-aggregate-common/src/utils.rs

Line 51 in 207e855

pub fn adjust_output_array(data_type: &DataType, array: ArrayRef) -> Result<ArrayRef> {

To adjust the type at the end

For example like this:

datafusion/datafusion/physical-plan/src/aggregates/topk/heap.rs

Line 156 in 207e855

let vals = adjust_output_array(&self.data_type, vals).expect("Type is incorrect");

So the idea is that you keep the actual group values as native types (u128 in this case) as we are only comparing their values

Does that make sense?

It seems adjust_output_array clone the array, is it better to use with_data_type to avoid the clone?

datafusion/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs

Line 211 in c0ca4b4

Ok(vec![Arc::new(array.with_data_type(self.data_type.clone()))])

alamb · 2024-11-23T11:28:13Z

It seems adjust_output_array clone the array, is it better to use with_data_type to avoid the clone?

I think they are roughly equivalent - if we can avoid cloning the array that certainly seems better

This was referenced Nov 20, 2024

[EPIC] Improvements to GroupColumn multi-column aggregation performance #12680

Open

Minor: Add debug log message for creating GroupValuesRows #13506

Merged

alamb mentioned this issue Nov 20, 2024

feat: Support faster multi-column grouping ( GroupColumn) for Date/Time/Timestamp types #13457

Merged

github-actions bot assigned jonathanc-n Nov 20, 2024

jonathanc-n linked a pull request Nov 26, 2024 that will close this issue

feat: Add GroupColumn Decimal128Array #13564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GroupColumn Decimal128Array #13505

Implement GroupColumn Decimal128Array #13505

alamb commented Nov 20, 2024 •

edited

Loading

alamb commented Nov 20, 2024

jonathanc-n commented Nov 20, 2024

jonathanc-n commented Nov 21, 2024 •

edited

Loading

alamb commented Nov 22, 2024

jonathanc-n commented Nov 22, 2024

jayzhan211 commented Nov 23, 2024

alamb commented Nov 23, 2024

Implement GroupColumn Decimal128Array #13505

Implement GroupColumn Decimal128Array #13505

Comments

alamb commented Nov 20, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Nov 20, 2024

jonathanc-n commented Nov 20, 2024

jonathanc-n commented Nov 21, 2024 • edited Loading

alamb commented Nov 22, 2024

jonathanc-n commented Nov 22, 2024

jayzhan211 commented Nov 23, 2024

alamb commented Nov 23, 2024

alamb commented Nov 20, 2024 •

edited

Loading

jonathanc-n commented Nov 21, 2024 •

edited

Loading