-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358
Conversation
but if there are the codecs found, we support those compressions, no? |
docs/sql-programming-guide.md
Outdated
@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession | |||
Sets the compression codec used when writing Parquet files. If either `compression` or | |||
`parquet.compression` is specified in the table-specific options/properties, the precedence would be | |||
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: | |||
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer none, uncompressed, snappy, gzip, lzo, brotli(need install ...), lz4, zstd(need install ...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installation may not be able to solve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none, uncompressed, snappy, gzip, lzo, brotli(need install brotli-codec), lz4, zstd(since Hadoop 2.9.0)
https://jira.apache.org/jira/browse/HADOOP-13578
https://github.com/rdblue/brotli-codec
https://jira.apache.org/jira/browse/HADOOP-13126
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it,thanks @wangyum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hadoop-2.9.x is officially supported in Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok.
It is using reflection to acquire hadoop classes for compression which are not in the (hadoop-common-2.6.5.jar, hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar).
|
Thanks, if there are the codecs found, we support those compressions, but how do I find it? @HyukjinKwon |
That's probably something we should document, or improve the error message. Ideally, we should fix the error message from Parquet. Don't you think? |
yeah, the error message is output from external jar(parquet-common-1.10.0.jar), |
Test build #95785 has finished for PR 22358 at commit
|
If the codecs are found, then we support it. One thing we should do might be to document to explicitly provide the codec but I am not sure how many users are confused about it. |
just fyi about related talks: #21070 (comment) |
.stringConf | ||
.transform(_.toLowerCase(Locale.ROOT)) | ||
.checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", "brotli", "zstd")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought if you remove it from here the user would not be able to use zstd or brotli even if it is installed/enabled/available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, removing is not a good idea.
Thanks.
1db036a
to
5c478b9
Compare
Test build #95852 has finished for PR 22358 at commit
|
I am 0 on this since it is worth |
docs/sql-programming-guide.md
Outdated
@@ -964,7 +964,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession | |||
Sets the compression codec used when writing Parquet files. If either `compression` or | |||
`parquet.compression` is specified in the table-specific options/properties, the precedence would be | |||
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: | |||
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. | |||
none, uncompressed, snappy, gzip, lzo, brotli(need install brotliCodec), lz4, zstd(need install | |||
ZStandardCodec before Hadoop 2.9.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just add few lines for brotli
and zstd
below and leave the original text as is.
5c478b9
to
dd86d3f
Compare
Test build #95930 has finished for PR 22358 at commit
|
docs/sql-programming-guide.md
Outdated
@@ -965,6 +965,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession | |||
`parquet.compression` is specified in the table-specific options/properties, the precedence would be | |||
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: | |||
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. | |||
Note that `zstd` needs install `ZStandardCodec` before Hadoop 2.9.0, `brotli` needs install `brotliCodec`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs install
-> needs to install
I'm okay but I would close this if no committer agree with (approves) this for some long time. |
dd86d3f
to
64aef6b
Compare
Test build #95969 has finished for PR 22358 at commit
|
docs/sql-programming-guide.md
Outdated
@@ -965,6 +965,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession | |||
`parquet.compression` is specified in the table-specific options/properties, the precedence would be | |||
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: | |||
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. | |||
Note that `zstd` needs to install `ZStandardCodec` before Hadoop 2.9.0, `brotli` needs to install | |||
`brotliCodec`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon How about adding a link? Users may not know where to download it.
`brotliCodec` -> [`brotli-codec`](https://github.com/rdblue/brotli-codec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the link looks expected to be rather permanent, it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more clear to say "zstd
requires ZStandardCodec to be installed".
64aef6b
to
39eaf1d
Compare
docs/sql-programming-guide.md
Outdated
@@ -965,6 +965,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession | |||
`parquet.compression` is specified in the table-specific options/properties, the precedence would be | |||
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: | |||
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. | |||
Note that `zstd` requires `ZStandardCodec` to be installed before Hadoop 2.9.0, `brotli` requires | |||
`brotliCodec` to be installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brotliCodec
-> BrotliCodec
Test build #96312 has finished for PR 22358 at commit
|
39eaf1d
to
0e5d0bc
Compare
Test build #96314 has finished for PR 22358 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a bit of documentation is OK.
What changes were proposed in this pull request?
Hadoop2.6 and hadoop2.7 do not contain zstd and brotli compressioncodec ,hadoop 3.1 also contains only zstd compressioncodec .
So I think we should remove zstd and brotil for the time being.
set
spark.sql.parquet.compression.codec=brotli
:Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
set
spark.sql.parquet.compression.codec=zstd
:Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.ZStandardCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
How was this patch tested?
Exist unit test