Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

physical-plan: Cast nested group values back to dictionary if necessary #12586

Merged
merged 1 commit into from
Sep 24, 2024

Conversation

brancz
Copy link
Contributor

@brancz brancz commented Sep 23, 2024

Which issue does this PR close?

Closes #12542

Rationale for this change

Non-nested arrays were already being dictionary encoded again if the input was dictionary encoded, so this just replicates that behavior for structs and lists.

What changes are included in this PR?

  1. Check whether there are any nested dictionaries to encode.
  2. Encode arrays that have nested dictionaries appropriately.

Note this may not be conclusive, I've only included lists and structs since those are the cases/types that I need and that I'm most familiar with. I think it'd be nice to get these changes in though and handle any further cases in separate PRs.

Are these changes tested?

Yes, added a unit test that tests precisely the scenario that led me to opening the issue.

Are there any user-facing changes?

No, only a bug fix. If the queries didn't panic before it could have been a change in behavior, but since all queries of this sort panicked before it's purely a bug fix.

@alamb @tustvold @andygrove

@github-actions github-actions bot added the physical-expr Physical Expressions label Sep 23, 2024
@brancz brancz force-pushed the nested-dict-materialize branch 2 times, most recently from c1a9dae to 4adf217 Compare September 23, 2024 11:25
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @brancz -- I have a suggestion for how to simplify the code. Let me know if it makes sense

Comment on lines 235 to 238
if expected.is_nested() && needs_nested_dictionary_encoding(expected, array)?
{
*array = dictionary_encode_nested(array.clone(), expected)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like dictionary_encode_nested would work on non nested types as well, and thus you could simply always call it (and avoid the special case for Dict above too).

That which would make this code simpler and easier to follow. You can probably avoid the clone if you wanted (not a huge deal give it is an Arc):

Suggested change
if expected.is_nested() && needs_nested_dictionary_encoding(expected, array)?
{
*array = dictionary_encode_nested(array.clone(), expected)?;
}
*array = dictionary_encode_nested(&array, expected)?;

Copy link
Contributor Author

@brancz brancz Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough! I was optimizing for not attempting to dictionary encode a bit prematurely. Changed it!

@brancz brancz force-pushed the nested-dict-materialize branch 3 times, most recently from 4888aac to d0e908a Compare September 24, 2024 15:10
@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

Fixed all lints. Can you re-run tests?

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

BTW the code looks 👨‍🍳 to me -- once the CI passes I think this will be good to go

Also, once we have merged this CI runs on your future PR will run automaticallly

@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

Also, once we have merged this CI runs on your future PR will run automatically

Awesome! 🥳

We have more things on our list that we'll work off over the next weeks/months, so that'll come in handy!

@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

Huh weird, that last lint didn't fail for me locally, fixed it.

@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

I'm really sorry, I think now it should truly be the last time that you need to re-run. I fixed it exactly the way clippy suggested.

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

Another thing that would make it slightly easier to review these PRs is if you pushed new commits rather than force pushing

Screenshot 2024-09-24 at 2 51 47 PM

All PRs get squash into a single commit on merge, so the commit history isn't preserved but makes it easy to see what has changed since the last revie

@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

Got it!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 thank you very much @brancz 🙏

here is hoping that the next few PRs will go more smoothly!

@alamb alamb merged commit 7ccc5fb into apache:main Sep 24, 2024
24 checks passed
@brancz brancz deleted the nested-dict-materialize branch September 25, 2024 08:12
bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
@brancz brancz mentioned this pull request Oct 9, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic on aggregations on struct of dictionaries
2 participants