Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialize Dictionaries in Group Keys #7647

Open
tustvold opened this issue Sep 25, 2023 · 12 comments · Fixed by #8291
Open

Materialize Dictionaries in Group Keys #7647

tustvold opened this issue Sep 25, 2023 · 12 comments · Fixed by #8291
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently grouping on a dictionary column will return dictionary-encoded group keys. Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns. Additionally following #7587 computing the dictionary is a non-trivial operation that could be eliminated

Describe the solution you'd like

When grouping on a dictionary column, e.g. Dictionary(DataType::Int32, DataType::Utf8), the returned schema should be the underlying value type, i.e. DataType::Utf8.

Describe alternatives you've considered

No response

Additional context

No response

@qrilka
Copy link
Contributor

qrilka commented Oct 7, 2023

Hi @tustvold
I wanted to look into this ticket, I see the TODO you've left in #7587. Could you advise me some easy way to to simulate this situation e.g. in datafusion-cli or maybe there is some similar test in the code base?

@alamb
Copy link
Contributor

alamb commented Oct 7, 2023

You can great dictionary encoded columns using arrow_cast

Here is an example

Here is a better example (use string dictionaries):
https://github.com/apache/arrow-datafusion/blob/e23d34bae60bb2f9c496241e218bab795af3af83/datafusion/sqllogictest/test_files/topk.slt#L223-L232

@qrilka
Copy link
Contributor

qrilka commented Oct 14, 2023

@tustvold I've spent some time getting familiar with the code for aggregations and I have a couple of questions. Maybe you could help me here?
I see that for aggregates in an initial plan we have in the base case 2 nested AggregateExecs: Partial and Final(Partitioned). And every AggregateExec uses RowConverter to convert arrays into rows for its input and back from rows into arrays for its output.
I assume that these conversions are basically inevitable because RecordBatches are sent between execution stages and aggregation internally operates on rows, not arrays, is my understanding correct here?
It looks to me that it makes sense to convert from Dictionary to Utf8 on the first AggregateExec but it doesn't look like there's no mode like AggregateInitial. What could be the proper way out here? Maybe there could be a conversion step before aggregation added if any of the fields has a dictionary type? What would you advise?

@qrilka
Copy link
Contributor

qrilka commented Nov 15, 2023

@alamb maybe you could help me here?

@tustvold
Copy link
Contributor Author

tustvold commented Nov 15, 2023

You should not need to do any conversion, the conversion is already being done by RowConverter. Currently there is a cast annotated with a TODO linking back to this ticket. It should just be a case of removing this cast and updating the schema logic within the plan to account for this

@qrilka
Copy link
Contributor

qrilka commented Nov 15, 2023

@tustvold I saw the conversion in RowConverter but as I understand it happens multiple times: i.e. we convert dictionary into utf8 in AggregatePartial, then cast results back to dictionary and then we do the same in AggregateFinal.

Please let me know if I get your idea right: in the case of AggregatePartial followed by AggregateFinal, the schema for the first one should have Utf8 as its output and the second aggregate will then use Utf8 both as its input and as output.

qrilka added a commit to qrilka/arrow-datafusion that referenced this issue Nov 21, 2023
Given that group keys inherently have few repeated values, especially
when grouping on a single column, the use of dictionary encoding is
unlikely to be yielding significant returns
qrilka added a commit to qrilka/arrow-datafusion that referenced this issue Dec 1, 2023
Given that group keys inherently have few repeated values, especially
when grouping on a single column, the use of dictionary encoding is
unlikely to be yielding significant returns
alamb pushed a commit that referenced this issue Dec 1, 2023
Given that group keys inherently have few repeated values, especially
when grouping on a single column, the use of dictionary encoding is
unlikely to be yielding significant returns
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this issue Dec 14, 2023
Given that group keys inherently have few repeated values, especially
when grouping on a single column, the use of dictionary encoding is
unlikely to be yielding significant returns
@alamb
Copy link
Contributor

alamb commented Jan 5, 2024

TLDR I recommend we revert this change and reopen this ticket while we reconsider how to handle this case better.

Background

This ticket caused a functional regression for us downstream in #8738 (a query that used to run started erroring).

The cause of the issue is that the LogicalPlan and ExecutionPlans schemas no longer match. I started exploring making them match in #8766

However while working on that issue, I thought more about this change and I am not sure the change described in this issue is good for multi column groupings at all.

Rationale

Dictionary Encoding on single column group keys is bad because there is no repetition in the data and therefore the Dictionary encoding is pure over head. For example, given

SELECT ... GROUP BY a                                                                

Dictionary Encoding is worse than native encoding:

┌──────┐                   ┌──────┐      ┌────┐     
│ foo  │                   │ foo  │      │ 0  │     
├──────┤                   ├──────┤      ├────┤     
│ bar  │                   │ bar  │      │ 1  │     
├──────┤                   ├──────┤      ├────┤     
│ baz  │  ────────▶        │ baz  │      │ 2  │     
├──────┤                   ├──────┤      ├────┤     
│  ff  │                   │  ff  │      │ 3  │     
├──────┤                   ├──────┤      ├────┤     
│  ..  │                   │  ..  │      │ .. │     
├──────┤                   ├──────┤      ├────┤     
│ aaz  │                   │ aaz  │      │9999│     
└──────┘                   └──────┘      └────┘     
                                                    
 Group Values            values array   keys array  
   (distinct                                        
 values of a)                                       

However, the story is different when there is a multi column group key, as in that case, dictionary encoding each column can be a significant performance improvement as they are applied to each column individually and each column may have substantial redundancy. For example, given this query

SELECT ... GROUP BY a,b
┌──────┐  ┌──────┐               ┌──────┐      ┌────┐         ┌──────┐      ┌────┐      
│ foo  │  │ tag1 │               │ foo  │      │ 0  │         │ tag1 │      │ 0  │      
├──────┤  ├──────┤               ├──────┤      ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag2 │               │ baz  │      │ 0  │         │ tag2 │      │ 1  │      
├──────┤  ├──────┤               └──────┘      ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag3 │ ────────▶                   │ 0  │         │ tag3 │      │ 2  │      
├──────┤  ├──────┤                             ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag4 │                             │ 0  │         │ tag4 │      │ 3  │      
├──────┤  ├──────┤                             ├────┤         └──────┘      ├────┤      
│  ..  │  │  ..  │                             │ .. │                       │ .. │      
├──────┤  ├──────┤                             ├────┤                       ├────┤      
│ baz  │  │ tag4 │                             │ 1  │                       │ 3  │      
└──────┘  └──────┘                             └────┘                       └────┘      
                                                                                        
       Group Values            values array   keys array    values array   keys array   
    (distinct values of             (a)           (a)            (b)           (b)      
           a,b)                                                                         

This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns.

In fact we think we may have seen a performance regression when picking up this change downstream as well, which could also be explained by the observation above

Thus I recommend we revert this change via #8740 while we reconsider how to handle this case (maybe just for single column group by? Maybe do the dictionary encoding within the RowEncoder to avoid generating many redundant strings?

@tustvold
Copy link
Contributor Author

tustvold commented Jan 5, 2024

This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns.

If anything this is precisely the case where the old behaviour was especially inefficient. It will compute the rows, expanding the dictionary, perform the grouping, convert the rows back to strings, convert the strings back to dictionaries, convert back to strings in the row format, again expanding the dictionaries, and repeat for each stage

@alamb
Copy link
Contributor

alamb commented Jan 5, 2024

I want to be clear that I have no particular evidence one way or the other about the performance implications of this particular change (and I probably confused the issue with speculation in #7647 (comment))

So what I would like to do is:

  1. Revert the change
  2. Reopen this ticket
  3. Work on getting a PR ready to put the change back in that both 1) doesn't cause a functional regression and 2) we have evidence improves performance

@qrilka
Copy link
Contributor

qrilka commented Jan 6, 2024

@alamb what could be some evidence for 3.2? Is there anything in the code base or maybe in some other ticket? The plan you show makes total sense

@alamb
Copy link
Contributor

alamb commented Jan 8, 2024

Thanks @qrilka

@alamb what could be some evidence for 3.2?

I think we need to create a benchmark that does aggregation queries on dictionary encoded string columns (I know the existing end to end TPCH, and ClickBench benchmarks do not do this). I will file a ticket shortly about this

@alamb
Copy link
Contributor

alamb commented Jan 8, 2024

I filed #8791 to track adding additional test coverage, and I plan to work on that in the next few days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants