[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

beliefer · 2023-10-10T09:23:09Z

What changes were proposed in this pull request?

Currently, Spark supported all the parquet compression codecs, but the parquet supported compression codecs and spark supported are not completely one-on-one due to Spark introduce a fake compression codecs none.
On the other hand, there are a lot of magic strings copy from parquet compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency.

The CompressionCodecName, refer: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java

Why are the changes needed?

Let developers easy to use parquet compression codecs.

Does this PR introduce any user-facing change?

'No'.
Introduce a new class.

How was this patch tested?

Exists test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

LuciferYang · 2023-10-10T11:40:42Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java

+  ZSTD(CompressionCodecName.ZSTD);
+
+  // There are some parquet supported compression codec that doesn't supported by Spark.
+  // LZ4_RAW(CompressionCodecName.LZ4_RAW)


#41507

IIRC, LZ4_RAW is already supported？

Thank you for the reminder. I will check it.

srowen · 2023-10-18T13:49:41Z

Can we retrigger tests?

beliefer · 2023-10-20T10:31:16Z

ping @dongjoon-hyun @srowen @wangyum

beliefer · 2023-10-20T11:39:00Z

The GA failure is unrelated.

srowen · 2023-10-21T21:41:09Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java

+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),


Why make our own enum if there is already an enum-like list of codecs in parquet?

One reason is Spark add the compression codecs none, another is out-of-date. Before #43310, the parquet supported compression codecs and spark supported are not completely one-on-one.

Ok, looks fine

beliefer · 2023-10-24T12:08:12Z

cc @dongjoon-hyun

beliefer · 2023-10-25T09:28:30Z

cc @viirya

LuciferYang

+1, LGTM

LuciferYang · 2023-10-26T08:45:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

-    checkCompressionCodec(CompressionCodecName.GZIP)
-    checkCompressionCodec(CompressionCodecName.SNAPPY)
-    checkCompressionCodec(CompressionCodecName.ZSTD)
+    checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED)


Unrelated to this pr, but why were only four types of Compression Codec tested here? Was the test case not modified when a new type was added?

I tested the other compression codec, the tests failed!
It seems not supported the others yet.

If I have time, I will try to cover these tests.

I got it now. lzo is supported by cloudera Hadoop. Spark doesn't have it built-in.

LuciferYang · 2023-10-26T08:54:16Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

-                |CREATE TABLE t(id int) USING hive
-                |OPTIONS(fileFormat '$fileFormat', compression '$compression')
-                |LOCATION '${path.toURI}'
+  Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)).foreach {


nit: Make Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)) a variable, and then use seq.foreach { case (fileFormat, compression) =>. Would the code below need to be reformatted?

For reduce the change here, let's use

Seq( ("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)).foreach { case (fileFormat, compression) =>

beliefer · 2023-10-27T02:50:45Z

The GA failure is unrelated.
Merged to master
@srowen @LuciferYang Thank you!

…ionCodec` ### What changes were proposed in this pull request? #43308 introduces a mapper for parquet compression codecs. There are many place call `toLowerCase(Locale.ROOT)` to get the lower case name of parquet compression codecs. ### Why are the changes needed? Add `lowerCaseName` for `ParquetCompressionCodec`. ### Does this PR introduce _any_ user-facing change? 'No'. New class. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43571 from beliefer/SPARK-45481_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…rings copy from parquet|orc|avro compression codes ### What changes were proposed in this pull request? This PR follows up #43562, #43528 and #43308. The aim of this PR is to avoid magic strings copy from `parquet|orc|avro` compression codes. This PR also simplify some test cases. ### Why are the changes needed? Avoid magic strings copy from parquet|orc|avro compression codes ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43604 from beliefer/parquet_orc_avro. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

beliefer force-pushed the SPARK-45481 branch from 57436ac to 0a2d3e4 Compare October 10, 2023 09:23

github-actions bot added the SQL label Oct 10, 2023

LuciferYang reviewed Oct 10, 2023

View reviewed changes

beliefer force-pushed the SPARK-45481 branch from 0a2d3e4 to 752fc7b Compare October 20, 2023 07:45

srowen reviewed Oct 21, 2023

View reviewed changes

beliefer requested a review from srowen October 24, 2023 01:47

beliefer requested a review from LuciferYang October 25, 2023 02:25

beliefer force-pushed the SPARK-45481 branch 2 times, most recently from 8fe92b6 to 3745581 Compare October 25, 2023 11:35

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs

a0ed406

beliefer force-pushed the SPARK-45481 branch from 3745581 to a0ed406 Compare October 25, 2023 11:37

LuciferYang approved these changes Oct 26, 2023

View reviewed changes

beliefer force-pushed the SPARK-45481 branch 2 times, most recently from 21b0527 to cdaa24c Compare October 26, 2023 11:33

Update code

d5a0269

beliefer force-pushed the SPARK-45481 branch from cdaa24c to d5a0269 Compare October 26, 2023 11:49

beliefer closed this in 62a3868 Oct 27, 2023

beliefer mentioned this pull request Oct 28, 2023

[SPARK-45481][SQL][FOLLOWUP] Add lowerCaseName for ParquetCompressionCodec. #43571

Closed

beliefer mentioned this pull request Oct 31, 2023

[SPARK-45481][SPARK-45664][SPARK-45711][SQL][FOLLOWUP] Avoid magic strings copy from parquet|orc|avro compression codes #43604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

beliefer commented Oct 10, 2023 •

edited

Loading

LuciferYang Oct 10, 2023 •

edited

Loading

beliefer Oct 10, 2023

srowen commented Oct 18, 2023

beliefer commented Oct 20, 2023

beliefer commented Oct 20, 2023

srowen Oct 21, 2023

beliefer Oct 23, 2023

srowen Oct 24, 2023

beliefer commented Oct 24, 2023

beliefer commented Oct 25, 2023

LuciferYang left a comment

LuciferYang Oct 26, 2023

beliefer Oct 26, 2023

LuciferYang Oct 26, 2023

beliefer Oct 26, 2023

beliefer Oct 26, 2023

LuciferYang Oct 26, 2023 •

edited

Loading

beliefer Oct 26, 2023

beliefer commented Oct 27, 2023

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

Conversation

beliefer commented Oct 10, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

LuciferYang Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Oct 18, 2023

beliefer commented Oct 20, 2023

beliefer commented Oct 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Oct 24, 2023

beliefer commented Oct 25, 2023

LuciferYang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Oct 27, 2023

beliefer commented Oct 10, 2023 •

edited

Loading

LuciferYang Oct 10, 2023 •

edited

Loading

LuciferYang Oct 26, 2023 •

edited

Loading