Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

Closed
wants to merge 2 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Oct 10, 2023

What changes were proposed in this pull request?

Currently, Spark supported all the parquet compression codecs, but the parquet supported compression codecs and spark supported are not completely one-on-one due to Spark introduce a fake compression codecs none.
On the other hand, there are a lot of magic strings copy from parquet compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency.

The CompressionCodecName, refer: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java

Why are the changes needed?

Let developers easy to use parquet compression codecs.

Does this PR introduce any user-facing change?

'No'.
Introduce a new class.

How was this patch tested?

Exists test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

ZSTD(CompressionCodecName.ZSTD);

// There are some parquet supported compression codec that doesn't supported by Spark.
// LZ4_RAW(CompressionCodecName.LZ4_RAW)
Copy link
Contributor

@LuciferYang LuciferYang Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#41507

IIRC, LZ4_RAW is already supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder. I will check it.

@srowen
Copy link
Member

srowen commented Oct 18, 2023

Can we retrigger tests?

@beliefer
Copy link
Contributor Author

ping @dongjoon-hyun @srowen @wangyum

@beliefer
Copy link
Contributor Author

The GA failure is unrelated.

*/
public enum ParquetCompressionCodecMapper {
NONE(null),
UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make our own enum if there is already an enum-like list of codecs in parquet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason is Spark add the compression codecs none, another is out-of-date. Before #43310, the parquet supported compression codecs and spark supported are not completely one-on-one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks fine

@beliefer beliefer requested a review from srowen October 24, 2023 01:47
@beliefer
Copy link
Contributor Author

cc @dongjoon-hyun

@beliefer beliefer requested a review from LuciferYang October 25, 2023 02:25
@beliefer
Copy link
Contributor Author

cc @viirya

@beliefer beliefer force-pushed the SPARK-45481 branch 2 times, most recently from 8fe92b6 to 3745581 Compare October 25, 2023 11:35
Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

checkCompressionCodec(CompressionCodecName.GZIP)
checkCompressionCodec(CompressionCodecName.SNAPPY)
checkCompressionCodec(CompressionCodecName.ZSTD)
checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this pr, but why were only four types of Compression Codec tested here? Was the test case not modified when a new type was added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the other compression codec, the tests failed!
It seems not supported the others yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have time, I will try to cover these tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it now. lzo is supported by cloudera Hadoop. Spark doesn't have it built-in.

|CREATE TABLE t(id int) USING hive
|OPTIONS(fileFormat '$fileFormat', compression '$compression')
|LOCATION '${path.toURI}'
Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)).foreach {
Copy link
Contributor

@LuciferYang LuciferYang Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Make Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)) a variable, and then use seq.foreach { case (fileFormat, compression) =>. Would the code below need to be reformatted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reduce the change here, let's use

Seq(
  ("orc", "ZLIB"),
  ("parquet", ParquetCompressionCodec.GZIP.name)).foreach { case (fileFormat, compression) =>

@beliefer beliefer force-pushed the SPARK-45481 branch 2 times, most recently from 21b0527 to cdaa24c Compare October 26, 2023 11:33
@beliefer
Copy link
Contributor Author

The GA failure is unrelated.
Merged to master
@srowen @LuciferYang Thank you!

dongjoon-hyun pushed a commit that referenced this pull request Oct 30, 2023
…ionCodec`

### What changes were proposed in this pull request?
#43308 introduces a mapper for parquet compression codecs.
There are many place call `toLowerCase(Locale.ROOT)` to get the lower case name of parquet compression codecs.

### Why are the changes needed?
Add `lowerCaseName` for `ParquetCompressionCodec`.

### Does this PR introduce _any_ user-facing change?
'No'.
New class.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43571 from beliefer/SPARK-45481_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Oct 31, 2023
…rings copy from parquet|orc|avro compression codes

### What changes were proposed in this pull request?
This PR follows up #43562, #43528 and #43308.
The aim of this PR is to avoid magic strings copy from `parquet|orc|avro` compression codes.

This PR also simplify some test cases.

### Why are the changes needed?
Avoid magic strings copy from parquet|orc|avro compression codes

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43604 from beliefer/parquet_orc_avro.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants