Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLE and dictionary filter only enabled for UTF8 since format version 17. #3868

Merged
merged 1 commit into from
Feb 8, 2023

Conversation

KiterLuc
Copy link
Contributor

@KiterLuc KiterLuc commented Feb 7, 2023

This fixes an issue in the filter pipeline where we should only skip offsets unfiltering for RLE/dictionary filters for UTF8 strings starting at version 17.


TYPE: IMPROVEMENT
DESC: RLE and dictionary filter only enabled for UTF8 since format version 17.

@KiterLuc KiterLuc requested review from Shelnutt2 and ypatia February 7, 2023 22:43
This fixes an issue in the filter pipeline where we should only skip offsets unfiltering for RLE/dictionary filters for UTF8 strings starting at version 17.

---
TYPE: IMPROVEMENT
DESC: RLE and dictionary filter only enabled for UTF8 since format version 17.
@KiterLuc KiterLuc force-pushed the lr/rle-dict-filter-utf8-v17 branch from 340113a to 58d37ed Compare February 7, 2023 22:44
@KiterLuc KiterLuc changed the title RLE and dictionary filter only enabled since format version 17. RLE and dictionary filter only enabled for UTF8 since format version 17. Feb 7, 2023
Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @KiterLuc !
Added a backport label for 2.14 but I wanted to check with @ihnorton : are we planning some maintenance release for 2.13 too? The reader_base.cc part of this bugfix is a zero day bug for RLE/dict when we have nullable string attributes, so it would need to be backported to every maintenance release we are planning to issue.

@Shelnutt2
Copy link
Member

Thanks for the fix @KiterLuc ! Added a backport label for 2.14 but I wanted to check with @ihnorton : are we planning some maintenance release for 2.13 too? The reader_base.cc part of this bugfix is a zero day bug for RLE/dict when we have nullable string attributes, so it would need to be backported to every maintenance release we are planning to issue.

We only enabled this for 2.14 right? Is 2.13 effected?

A further question I have is, we didn't allow creation of utf-8 attributes with RLE or Dictionary before format 17, so while the safety check is needed, its not possible for someone to have an array where this would cause a problem?

@ypatia
Copy link
Member

ypatia commented Feb 8, 2023

Thanks for the fix @KiterLuc ! Added a backport label for 2.14 but I wanted to check with @ihnorton : are we planning some maintenance release for 2.13 too? The reader_base.cc part of this bugfix is a zero day bug for RLE/dict when we have nullable string attributes, so it would need to be backported to every maintenance release we are planning to issue.

We only enabled this for 2.14 right? Is 2.13 effected?

A further question I have is, we didn't allow creation of utf-8 attributes with RLE or Dictionary before format 17, so while the safety check is needed, its not possible for someone to have an array where this would cause a problem?

first question:
The UTF8 part is for 2.14 only, yes. However, those are 2 separate bugfixes Luc is packing in this PR. The "nullable" part in reader_base.cc applies to STRING_ASCII as well and AFAIU it was there since day 0. (@KiterLuc a regression UT would help confirm/show that btw.). Also note that this is a reader bug, so what is written is with previous versions should be fine - no on-disk format affected.

second question:
I believe there is no problem for someone using UTF8 just from the wrong versions (from the first issue yes though, if he is using nullables). AFAIK Luc found the problem because of some backward_compatibility test failing. @KiterLuc can you confirm though ?

@KiterLuc
Copy link
Contributor Author

KiterLuc commented Feb 8, 2023

Thanks for the fix @KiterLuc ! Added a backport label for 2.14 but I wanted to check with @ihnorton : are we planning some maintenance release for 2.13 too? The reader_base.cc part of this bugfix is a zero day bug for RLE/dict when we have nullable string attributes, so it would need to be backported to every maintenance release we are planning to issue.

We only enabled this for 2.14 right? Is 2.13 effected?
A further question I have is, we didn't allow creation of utf-8 attributes with RLE or Dictionary before format 17, so while the safety check is needed, its not possible for someone to have an array where this would cause a problem?

first question: The UTF8 part is for 2.14 only, yes. However, those are 2 separate bugfixes Luc is packing in this PR. The "nullable" part in reader_base.cc applies to STRING_ASCII as well and AFAIU it was there since day 0. (@KiterLuc a regression UT would help confirm/show that btw.). Also note that this is a reader bug, so what is written is with previous versions should be fine - no on-disk format affected.

second question: I believe there is no problem for someone using UTF8 just from the wrong versions (from the first issue yes though, if he is using nullables). AFAIK Luc found the problem because of some backward_compatibility test failing. @KiterLuc can you confirm though ?

We have no choice to make those two changes at once. Otherwise the back compat tests will not pass. This was actually found by the back compat tests, which include a RLE UTF8 nullable attribute. The back compat tests would actually not pass today if we were to generate back compat arrays for format 17/18, that's why I didn't add a UT for it as we technically have one already.

I have also seen that the back compat arrays previous to format 17 don't RLE encode UTF8 so we are good there.

@KiterLuc KiterLuc merged commit 4aebf1c into dev Feb 8, 2023
@KiterLuc KiterLuc deleted the lr/rle-dict-filter-utf8-v17 branch February 8, 2023 18:18
github-actions bot pushed a commit that referenced this pull request Feb 8, 2023
…17. (#3868)

This fixes an issue in the filter pipeline where we should only skip offsets unfiltering for RLE/dictionary filters for UTF8 strings starting at version 17.

---
TYPE: IMPROVEMENT
DESC: RLE and dictionary filter only enabled for UTF8 since format version 17.
ihnorton pushed a commit that referenced this pull request Feb 8, 2023
…17. (#3868) (#3871)

This fixes an issue in the filter pipeline where we should only skip offsets unfiltering for RLE/dictionary filters for UTF8 strings starting at version 17.

---
TYPE: IMPROVEMENT
DESC: RLE and dictionary filter only enabled for UTF8 since format version 17.

Co-authored-by: KiterLuc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants