-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default Parquet writer settings to match arrow-rs (except for compression & statistics) #11558
Conversation
… without the parquet feature enabled
datafusion/common/src/config.rs
Outdated
#[allow(dead_code)] | ||
/// Default value for [`parquet::WriterProperties::statistics_enabled`] | ||
pub const DEFAULT_STATISTICS_ENABLED: Option<&str> = Some("page"); | ||
#[allow(dead_code)] | ||
/// Default value for [`parquet::BloomFilterProperties::fpp`] | ||
pub const DEFAULT_BLOOM_FILTER_FPP: Option<f64> = | ||
Some(props::DEFAULT_BLOOM_FILTER_FPP); | ||
#[allow(dead_code)] | ||
/// Default value for [`parquet::BloomFilterProperties::ndv`] | ||
pub const DEFAULT_BLOOM_FILTER_NDV: Option<u64> = | ||
Some(props::DEFAULT_BLOOM_FILTER_NDV); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will be addressed in the next PR.
datafusion/common/src/config.rs
Outdated
#[allow(dead_code)] | ||
/// Default value for [`parquet::WriterProperties::compression`] | ||
pub const DEFAULT_COMPRESSION: Compression = Compression::UNCOMPRESSED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow-rs parquet's default is uncompressed. Whereas the datafusion default is zstd(3).
I think it makes sense for datafusion's default to be compressed, which argues to change the other default instead. But I'm unclear as to why the parquet (arrow-rs) default is uncompressed. Is this leftover? Or is intentional for the ArrowWriterOptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in #7692 and decided that DataFusion would have parquet compression by default. I am not sure if arrow-rs has different factors that make uncompressed by default preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a tension between raw read performance (uncompressed) and storage efficiency (compressed). @tustvold I think preferred arrow-rs to be on the read performance side of the equation
…further discussion
4828be1
to
d863f01
Compare
d863f01
to
38124ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld
I changed the title of this PR to more accurately reflect its contents / effects for others
In general I think DataFusion should be following the arrow writer's default values so I think this is a good step forward.
I left a suggestion that I think would dramatically simplify this PR and reduce its size, but I think we could also do it as a follow on PR
cc @devinjdangelo -- I don't recall any reason the defaults for the parquet writer in datafusion were meant to be different than the defaults in parquet-rs. Do you? |
The only default we changed intentionally after discussion was compression. I believe all others were intended to inherit the arrow-rs default. |
@wiedld could you make sure this rationale is documented somewhere (comments and the option's description?) |
3bd55de
to
3e85dc4
Compare
…rquet feature, instead rely upon regression testing
3e85dc4
to
d972416
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Confirm all other settings are equal. | ||
// First resolve the known discrepancies, (set as the same). | ||
// TODO: once we fix the above mis-matches, we should be able to remove this. | ||
let mut from_extern_parquet = | ||
session_config_from_writer_props(&default_writer_props); | ||
from_extern_parquet.global.compression = Some("zstd(3)".into()); | ||
from_extern_parquet.global.data_page_row_count_limit = usize::MAX; | ||
from_extern_parquet.global.column_index_truncate_length = None; | ||
from_extern_parquet.global.dictionary_enabled = None; | ||
from_extern_parquet.global.statistics_enabled = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any rationale for not setting the statistics value to the same as in the arrow-rs writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statistics_enabled
and the bloom filter setting will require a bit more updates to do. In order to keep the diff small, (and more easily reviewable), I was going to do in a follow up PR. Is that ok @alamb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SOunds good to me 👍 thank you
I restarted the failed pyarrow check as it seemed to be an infrastructure problem |
I'll plan to merge this tomorrow unless there are any other comments |
Thanks again @wiedld and @devinjdangelo |
Which issue does this PR close?
Part of #11367
Rationale for this change
Depending on how we construct our parquet writer settings, we are using a different set of default configurations as outlined here. This PR is the first step in reducing this misalignment.
What changes are included in this PR?
In order of commits:
Are these changes tested?
Yes.
Are there any user-facing changes?
No, except if anyone is relying upon this configuration misalignment in their code.