Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[parquet]: feature gate functionality in parquet. #4764

Closed
ritchie46 opened this issue Sep 3, 2023 · 5 comments
Closed

[parquet]: feature gate functionality in parquet. #4764

ritchie46 opened this issue Sep 3, 2023 · 5 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@ritchie46
Copy link
Contributor

I want to try partially moving/doubling IO to arrow-rs and start migration incrementally.

Describe the solution you'd like
Start feature gating some functionality in parquet that reduces compile time.

@ritchie46 ritchie46 added the enhancement Any new improvement worthy of a entry in the changelog label Sep 3, 2023
@tustvold
Copy link
Contributor

tustvold commented Sep 3, 2023

Really cool to see this happening. That being said, my experience with feature flags has been that the combinatorial explosion becomes very hard to test. It also has a tendency to result in poor test iteration times as different crates set different sets of features forcing recompilation.

I have instead found splitting up into sub-crates to be a better approach to this. However, the current granularity is already pretty small, is there a particular crate that is showing up as a bottleneck during compilation?

@ritchie46
Copy link
Contributor Author

Really cool to see this happening.

Gotta start somewhere. :)

It also has a tendency to result in poor test iteration times as different crates set different sets of features forcing recompilation.

This should not hurt your test times as I will ensure the default features remain the same. This will only make it possible for end-users as polars to cherry pick some stuff. For testing we use cargo hack to test every feature gate (not all combinations). But I don't think it's needed here as I don't plan to mess with the defaults.

is there a particular crate that is showing up as a bottleneck during compilation?

The compile times in polars are already way longer than we want. That's why I'd like to cherry pick on functionality. For the parquet crate I was looking add feature gating the IPC readers and take kernels out. Again this should not have any impact on the arrow-rs defaults/ repo and consumers.

@tustvold
Copy link
Contributor

tustvold commented Sep 3, 2023

feature gating the IPC readers

The IPC readers are necessary to read the embedded arrow schema data, I'm also surprised if they are a meaningful bottleneck in compile times.

Yes arrow thrift encodes a base64-encoded flatbuffer 🤯

and take kernels out

The parquet crate doesn't make use of the take kernel, it does make use of the cast kernels though which indirectly use the cast kernel. This would be non-trivial to remove, especially for decimals

This should not hurt your test times as I will ensure the default features remain the same

Yes until someone uses disabled default features 😅 I dunno, I have found features to not really justify their downsides.

But I don't think it's needed here as I don't plan to mess with the defaults.

We still need to ensure compilation with non-default sets of feature flags? We currently already have a fairly long list of combinations in CI...

The compile times in polars are already way longer than we want. That's why I'd like to cherry pick on functionality

Perhaps we might do this empirically? If there are kernels that represent an outsize impact on compile times we then work out mitigation strategies for them? I strongly suspect that there are couple of particularly problematic sub-kernels (likely dictionaries) that have an outsize impact, and which it might be able to address this without having to resort to feature flags? I'm more than happy to help out with this. FWIW I have some ideas on how to make the take kernel less expensive. (Edit: nvm I already implemented this as #4705 I just forgot)

@ritchie46
Copy link
Contributor Author

I did some --timings --release builds on #4765

master feature perc
58.54 42.12 73%
42.6 33.8 79%
59.9 38.6 56%

There are some significant savings with some trivial features. Indeed as you said arrow-select and arrow-cast are the big boys. But I expect that by feature gating some code out we send much less code to the linker.

I understand your hesitation about feature gating, but I'd argue not all are the same. Some feature gates very naturally fit the code order and can gate entire modules. It can save a lot of cumulative compile times for downstream users. :) For the planet! 😉

@tustvold
Copy link
Contributor

tustvold commented Sep 4, 2023

So long as we're being data driven and the returns are significant, I can probably grin and bear it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants