-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308
Conversation
ZSTD(CompressionCodecName.ZSTD); | ||
|
||
// There are some parquet supported compression codec that doesn't supported by Spark. | ||
// LZ4_RAW(CompressionCodecName.LZ4_RAW) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, LZ4_RAW
is already supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reminder. I will check it.
Can we retrigger tests? |
The GA failure is unrelated. |
*/ | ||
public enum ParquetCompressionCodecMapper { | ||
NONE(null), | ||
UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make our own enum if there is already an enum-like list of codecs in parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason is Spark add the compression codecs none
, another is out-of-date. Before #43310, the parquet supported compression codecs and spark supported are not completely one-on-one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks fine
cc @viirya |
8fe92b6
to
3745581
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
checkCompressionCodec(CompressionCodecName.GZIP) | ||
checkCompressionCodec(CompressionCodecName.SNAPPY) | ||
checkCompressionCodec(CompressionCodecName.ZSTD) | ||
checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this pr, but why were only four types of Compression Codec tested here? Was the test case not modified when a new type was added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the other compression codec, the tests failed!
It seems not supported the others yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have time, I will try to cover these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it now. lzo
is supported by cloudera Hadoop. Spark doesn't have it built-in.
|CREATE TABLE t(id int) USING hive | ||
|OPTIONS(fileFormat '$fileFormat', compression '$compression') | ||
|LOCATION '${path.toURI}' | ||
Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)).foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Make Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name))
a variable, and then use seq.foreach { case (fileFormat, compression) =>
. Would the code below need to be reformatted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reduce the change here, let's use
Seq(
("orc", "ZLIB"),
("parquet", ParquetCompressionCodec.GZIP.name)).foreach { case (fileFormat, compression) =>
21b0527
to
cdaa24c
Compare
The GA failure is unrelated. |
…ionCodec` ### What changes were proposed in this pull request? #43308 introduces a mapper for parquet compression codecs. There are many place call `toLowerCase(Locale.ROOT)` to get the lower case name of parquet compression codecs. ### Why are the changes needed? Add `lowerCaseName` for `ParquetCompressionCodec`. ### Does this PR introduce _any_ user-facing change? 'No'. New class. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43571 from beliefer/SPARK-45481_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…rings copy from parquet|orc|avro compression codes ### What changes were proposed in this pull request? This PR follows up #43562, #43528 and #43308. The aim of this PR is to avoid magic strings copy from `parquet|orc|avro` compression codes. This PR also simplify some test cases. ### Why are the changes needed? Avoid magic strings copy from parquet|orc|avro compression codes ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43604 from beliefer/parquet_orc_avro. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Currently, Spark supported all the parquet compression codecs, but the parquet supported compression codecs and spark supported are not completely one-on-one due to Spark introduce a fake compression codecs none.
On the other hand, there are a lot of magic strings copy from parquet compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency.
The
CompressionCodecName
, refer: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.javaWhy are the changes needed?
Let developers easy to use parquet compression codecs.
Does this PR introduce any user-facing change?
'No'.
Introduce a new class.
How was this patch tested?
Exists test cases.
Was this patch authored or co-authored using generative AI tooling?
'No'.