-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw #43310
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1014,12 +1014,12 @@ object SQLConf { | |
"`parquet.compression` is specified in the table-specific options/properties, the " + | ||
"precedence would be `compression`, `parquet.compression`, " + | ||
"`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " + | ||
"snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.") | ||
"snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.") | ||
.version("1.1.1") | ||
.stringConf | ||
.transform(_.toLowerCase(Locale.ROOT)) | ||
.checkValues( | ||
Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4raw", "zstd")) | ||
Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4_raw", "zstd")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, unfortunately, we cannot do like this because Apache Spark 3.5.0 is already released. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know that. But 3.5.0 released latest, could we fix it ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, if we remove this here, the production job with the existing configuration fails with Spark 3.5.1. |
||
.createWithDefault("snappy") | ||
|
||
val PARQUET_FILTER_PUSHDOWN_ENABLED = buildConf("spark.sql.parquet.filterPushdown") | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -96,7 +96,7 @@ object ParquetOptions extends DataSourceOptions { | |
"lzo" -> CompressionCodecName.LZO, | ||
"brotli" -> CompressionCodecName.BROTLI, | ||
"lz4" -> CompressionCodecName.LZ4, | ||
"lz4raw" -> CompressionCodecName.LZ4_RAW, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto. We cannot delete like this. Only we can add like next line for backward-compatibility. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good idea. |
||
"lz4_raw" -> CompressionCodecName.LZ4_RAW, | ||
"zstd" -> CompressionCodecName.ZSTD) | ||
|
||
def getParquetCompressionCodecName(name: String): String = { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,7 +59,7 @@ class ParquetCodecSuite extends FileSourceCodecSuite { | |
// Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available | ||
// on Maven Central. | ||
override protected def availableCodecs: Seq[String] = { | ||
Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4raw") | ||
Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4_raw") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case, it succeeds currently. Do you know the difference, @beliefer ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test case is success. But https://github.com/apache/spark/pull/43310/files#r1352405312 failed! |
||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,7 +29,16 @@ import org.apache.spark.sql.test.SharedSparkSession | |
|
||
class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSparkSession { | ||
test("Test `spark.sql.parquet.compression.codec` config") { | ||
Seq("NONE", "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO", "LZ4", "BROTLI", "ZSTD").foreach { c => | ||
Seq( | ||
"NONE", | ||
"UNCOMPRESSED", | ||
"SNAPPY", | ||
"GZIP", | ||
"LZO", | ||
"LZ4", | ||
"BROTLI", | ||
"ZSTD", | ||
"LZ4_RAW").foreach { c => | ||
withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> c) { | ||
val expected = if (c == "NONE") "UNCOMPRESSED" else c | ||
val option = new ParquetOptions(Map.empty[String, String], spark.sessionState.conf) | ||
|
@@ -105,7 +114,7 @@ class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSpar | |
|
||
test("Create parquet table with compression") { | ||
Seq(true, false).foreach { isPartitioned => | ||
val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4") | ||
val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4", "LZ4_RAW") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before this fix, if we use
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is only a test utility method, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. |
||
codecs.foreach { compressionCodec => | ||
checkCompressionCodec(compressionCodec, isPartitioned) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the suffix of the file after changing to
LZ4_RAW
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR:
part-00000-fc07d464-03b2-42d6-adc1-68a3adca1752.c000.lz4raw.parquet
After this PR:
part-00000-07244014-f31a-4097-8878-dd3630e721ce.c000.lz4raw.parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no change at file name layer, it's good.