-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster parquet DictEncoder (~20%) #2123
Conversation
Running benchmarks with just the change to ahash show no significant performance change. This is not entirely surprising as the current implementation uses crc32 which is very cheap to compute (although not DOS resistant). The change to hashbrown nets a non-trivial return where value encoding is the major bottleneck, this diminishes as additional overheads from nulls, lists, etc... take effect.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2123 +/- ##
==========================================
- Coverage 83.71% 82.51% -1.20%
==========================================
Files 225 240 +15
Lines 59567 62234 +2667
==========================================
+ Hits 49865 51355 +1490
- Misses 9702 10879 +1177 ☔ View full report in Codecov by Sentry. |
@@ -49,6 +50,7 @@ serde_json = { version = "1.0", default-features = false, features = ["std"], op | |||
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] } | |||
futures = { version = "0.3", default-features = false, features = ["std"], optional = true } | |||
tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "fs", "rt", "io-util"] } | |||
hashbrown = { version = "0.12", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a feature 'inline-more" which is enabled by default in hashbrown which gives sometimes a bit better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By disabling this here, we can delegate that decision downstream
|
||
impl<T: DataType> Encoder<T> for DictEncoder<T> { | ||
fn put(&mut self, values: &[T::T]) -> Result<()> { | ||
for i in values { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's a bottleneck, it might be faster to compute the hashes for values
in one go (i.e. vectorized)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me but I am concerned about the new dependencies as I believe some people use parquet after compiling to WASM or on embedded devices.
I am curious what other maintainers think too
@@ -30,6 +30,7 @@ edition = "2021" | |||
rust-version = "1.57" | |||
|
|||
[dependencies] | |||
ahash = "0.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem to be new dependencies (if optional features are not enabled)
|
||
state: ahash::RandomState, | ||
|
||
/// Used to provide a lookup from value to unique value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the replication of this pattern (maybe now in three places?) perhaps we can factor it into its own structure, mostly for readability as the use of HashMap
to implement a HashSet
takes some thought to totally grok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this, but I was unsure where to put it. It can't live in arrow, as parquet needs to compile without arrow, but aside from creating a new crate I wasn't really sure where to put it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to get this in as I need it for #1764, we have time until the next release to address any issues. |
Benchmark runs are scheduled for baseline = 985760f and contender = 6ce4c4e. 6ce4c4e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
use std::hash::Hash; | ||
|
||
/// Storage trait for [`Interner`] | ||
pub trait Storage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Part of #1764
Rationale for this change
The existing implementation is complex, and slower
What changes are included in this PR?
Gives the encoder the same treatment as #1861, switching to using ahash and hashbrown.
Are there any user-facing changes?
No