Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance improvements for reading Parquet to StringViewArray/BinaryViewArray #5904

Closed
Tracked by #5374
alamb opened this issue Jun 17, 2024 · 5 comments · Fixed by #6004
Closed
Tracked by #5374
Assignees
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Jun 17, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In https://github.com/apache/arrow-rs/issues/ @ariesdevil @XiangpengHao and I implemented pretty fast reading of data in Parquet to Arrow StringViewArray

The solution we have so far is #5877 which doesn't copy the string data 🎉 , but does track a set of offsets which are then converted into StringViewArray

@ariesdevil had a more comprehensive approach in #5557 that built the StringViews directly from the encoded data but hadn't yet removed the string copies

Describe the solution you'd like
It may be worth looking at the StringViewDecoding to see if there is more performance to be had.

Specifically we can se the arrow_array_reader/StringViewArray and related benchmarks to profile and identify any additional potential improvements

Describe alternatives you've considered
It may be good enough now

Additional context

@alexwilcoxson-rel
Copy link
Contributor

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Jun 17, 2024

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

The current gc function won't deduplicating strings, it only use GenericByteViewBuilder to create a new instance of the array.
I think it would be a great addition to implement the deduplicating logic. A straightforward approach is to use a hash table to track the location of the strings while building the GenericByteView.
It is not on my top priority list, but might give it a try when I have time.

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2024

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

I filed #5910 to track discussing this option

@XiangpengHao
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2024

label_issue.py automatically added labels {'arrow'} from #5970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
3 participants