Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358

Closed
wants to merge 1 commit into from

Conversation

10110346
Copy link
Contributor

@10110346 10110346 commented Sep 7, 2018

What changes were proposed in this pull request?

Hadoop2.6 and hadoop2.7 do not contain zstd and brotli compressioncodec ,hadoop 3.1 also contains only zstd compressioncodec .
So I think we should remove zstd and brotil for the time being.

set spark.sql.parquet.compression.codec=brotli:
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)

set spark.sql.parquet.compression.codec=zstd:
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.ZStandardCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)

How was this patch tested?

Exist unit test

@HyukjinKwon
Copy link
Member

but if there are the codecs found, we support those compressions, no?

@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
Sets the compression codec used when writing Parquet files. If either `compression` or
`parquet.compression` is specified in the table-specific options/properties, the precedence would be
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer none, uncompressed, snappy, gzip, lzo, brotli(need install ...), lz4, zstd(need install ...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installation may not be able to solve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none, uncompressed, snappy, gzip, lzo, brotli(need install brotli-codec), lz4, zstd(since Hadoop 2.9.0)

https://jira.apache.org/jira/browse/HADOOP-13578
https://github.com/rdblue/brotli-codec
https://jira.apache.org/jira/browse/HADOOP-13126

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it,thanks @wangyum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop-2.9.x is officially supported in Spark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so given the download page.
screen shot 2018-09-08 at 12 41 41 pm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok.

@10110346
Copy link
Contributor Author

10110346 commented Sep 7, 2018

It is using reflection to acquire hadoop classes for compression which are not in the (hadoop-common-2.6.5.jar, hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar).

BROTLI("org.apache.hadoop.io.compress.BrotliCodec", CompressionCodec.BROTLI, ".br"), ZSTD("org.apache.hadoop.io.compress.ZStandardCodec", CompressionCodec.ZSTD, ".zstd");

@10110346
Copy link
Contributor Author

10110346 commented Sep 7, 2018

Thanks, if there are the codecs found, we support those compressions, but how do I find it? @HyukjinKwon

@HyukjinKwon
Copy link
Member

That's probably something we should document, or improve the error message. Ideally, we should fix the error message from Parquet. Don't you think?

@10110346
Copy link
Contributor Author

10110346 commented Sep 7, 2018

yeah, the error message is output from external jar(parquet-common-1.10.0.jar),
I think spark + parquet should avoid the hadoop dependencies for zstd and brotil,.
But maybe we can't solve it right away.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95785 has finished for PR 22358 at commit 1db036a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

If the codecs are found, then we support it. One thing we should do might be to document to explicitly provide the codec but I am not sure how many users are confused about it.

@maropu
Copy link
Member

maropu commented Sep 8, 2018

just fyi about related talks: #21070 (comment)
cc: @rdblue

.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", "brotli", "zstd"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought if you remove it from here the user would not be able to use zstd or brotli even if it is installed/enabled/available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, removing is not a good idea.
Thanks.

@10110346 10110346 changed the title [SPARK-25366][SQL]Zstd and brotil CompressionCodec are not supported for parquet files [SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files Sep 10, 2018
@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95852 has finished for PR 22358 at commit 5c478b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I am 0 on this since it is worthClass org.apache.hadoop.io.compress.XXXCodec was not found error message vs need install ... message.

@@ -964,7 +964,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
Sets the compression codec used when writing Parquet files. If either `compression` or
`parquet.compression` is specified in the table-specific options/properties, the precedence would be
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
none, uncompressed, snappy, gzip, lzo, brotli(need install brotliCodec), lz4, zstd(need install
ZStandardCodec before Hadoop 2.9.0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add few lines for brotli and zstd below and leave the original text as is.

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95930 has finished for PR 22358 at commit dd86d3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -965,6 +965,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
`parquet.compression` is specified in the table-specific options/properties, the precedence would be
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
Note that `zstd` needs install `ZStandardCodec` before Hadoop 2.9.0, `brotli` needs install `brotliCodec`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs install -> needs to install

@HyukjinKwon
Copy link
Member

I'm okay but I would close this if no committer agree with (approves) this for some long time.

@SparkQA
Copy link

SparkQA commented Sep 12, 2018

Test build #95969 has finished for PR 22358 at commit 64aef6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -965,6 +965,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
`parquet.compression` is specified in the table-specific options/properties, the precedence would be
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
Note that `zstd` needs to install `ZStandardCodec` before Hadoop 2.9.0, `brotli` needs to install
`brotliCodec`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon How about adding a link? Users may not know where to download it.

`brotliCodec` -> [`brotli-codec`](https://github.com/rdblue/brotli-codec)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the link looks expected to be rather permanent, it's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more clear to say "zstd requires ZStandardCodec to be installed".

@@ -965,6 +965,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
`parquet.compression` is specified in the table-specific options/properties, the precedence would be
`compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
Note that `zstd` requires `ZStandardCodec` to be installed before Hadoop 2.9.0, `brotli` requires
`brotliCodec` to be installed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brotliCodec -> BrotliCodec

@SparkQA
Copy link

SparkQA commented Sep 20, 2018

Test build #96312 has finished for PR 22358 at commit 39eaf1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2018

Test build #96314 has finished for PR 22358 at commit 0e5d0bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@srowen and @vanxin WDYT?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a bit of documentation is OK.

@asfgit asfgit closed this in 4d114fc Sep 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants