[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358

10110346 · 2018-09-07T07:36:42Z

What changes were proposed in this pull request?

Hadoop2.6 and hadoop2.7 do not contain zstd and brotli compressioncodec ,hadoop 3.1 also contains only zstd compressioncodec .
So I think we should remove zstd and brotil for the time being.

set spark.sql.parquet.compression.codec=brotli:
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)

set spark.sql.parquet.compression.codec=zstd:
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.ZStandardCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)

How was this patch tested?

Exist unit test

HyukjinKwon · 2018-09-07T07:40:38Z

but if there are the codecs found, we support those compressions, no?

wangyum · 2018-09-07T07:51:53Z

docs/sql-programming-guide.md

@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
    Sets the compression codec used when writing Parquet files. If either `compression` or
    `parquet.compression` is specified in the table-specific options/properties, the precedence would be
    `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
-    none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.


I prefer none, uncompressed, snappy, gzip, lzo, brotli(need install ...), lz4, zstd(need install ...).

Installation may not be able to solve it.

none, uncompressed, snappy, gzip, lzo, brotli(need install brotli-codec), lz4, zstd(since Hadoop 2.9.0)

https://jira.apache.org/jira/browse/HADOOP-13578
https://github.com/rdblue/brotli-codec
https://jira.apache.org/jira/browse/HADOOP-13126

got it,thanks @wangyum

hadoop-2.9.x is officially supported in Spark?

I think so given the download page.

10110346 · 2018-09-07T08:43:28Z

It is using reflection to acquire hadoop classes for compression which are not in the (hadoop-common-2.6.5.jar, hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar).

BROTLI("org.apache.hadoop.io.compress.BrotliCodec", CompressionCodec.BROTLI, ".br"), ZSTD("org.apache.hadoop.io.compress.ZStandardCodec", CompressionCodec.ZSTD, ".zstd");

10110346 · 2018-09-07T08:47:03Z

Thanks， if there are the codecs found, we support those compressions, but how do I find it? @HyukjinKwon

HyukjinKwon · 2018-09-07T09:02:24Z

That's probably something we should document, or improve the error message. Ideally, we should fix the error message from Parquet. Don't you think?

10110346 · 2018-09-07T09:19:26Z

yeah， the error message is output from external jar(parquet-common-1.10.0.jar),
I think spark + parquet should avoid the hadoop dependencies for zstd and brotil,.
But maybe we can't solve it right away.

SparkQA · 2018-09-07T11:36:40Z

Test build #95785 has finished for PR 22358 at commit 1db036a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-08T04:44:12Z

If the codecs are found, then we support it. One thing we should do might be to document to explicitly provide the codec but I am not sure how many users are confused about it.

maropu · 2018-09-08T05:58:29Z

just fyi about related talks: #21070 (comment)
cc: @rdblue

felixcheung · 2018-09-09T18:33:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .stringConf
    .transform(_.toLowerCase(Locale.ROOT))
-    .checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", "brotli", "zstd"))


I thought if you remove it from here the user would not be able to use zstd or brotli even if it is installed/enabled/available?

I agree with you, removing is not a good idea.
Thanks.

SparkQA · 2018-09-10T02:03:00Z

Test build #95852 has finished for PR 22358 at commit 5c478b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-10T02:28:36Z

I am 0 on this since it is worthClass org.apache.hadoop.io.compress.XXXCodec was not found error message vs need install ... message.

HyukjinKwon · 2018-09-10T02:29:08Z

docs/sql-programming-guide.md

@@ -964,7 +964,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
    Sets the compression codec used when writing Parquet files. If either `compression` or
    `parquet.compression` is specified in the table-specific options/properties, the precedence would be
    `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
-    none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
+    none, uncompressed, snappy, gzip, lzo, brotli(need install brotliCodec), lz4, zstd(need install
+    ZStandardCodec before Hadoop 2.9.0).


I would just add few lines for brotli and zstd below and leave the original text as is.

SparkQA · 2018-09-11T09:20:13Z

Test build #95930 has finished for PR 22358 at commit dd86d3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-11T12:54:07Z

docs/sql-programming-guide.md

@@ -965,6 +965,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
    `parquet.compression` is specified in the table-specific options/properties, the precedence would be
    `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
    none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
+    Note that `zstd` needs install `ZStandardCodec` before Hadoop 2.9.0, `brotli` needs install `brotliCodec`.


needs install -> needs to install

HyukjinKwon · 2018-09-11T12:55:08Z

I'm okay but I would close this if no committer agree with (approves) this for some long time.

SparkQA · 2018-09-12T01:19:07Z

Test build #95969 has finished for PR 22358 at commit 64aef6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-09-12T01:30:39Z

docs/sql-programming-guide.md

@@ -965,6 +965,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
    `parquet.compression` is specified in the table-specific options/properties, the precedence would be
    `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
    none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
+    Note that `zstd` needs to install `ZStandardCodec` before Hadoop 2.9.0, `brotli` needs to install
+    `brotliCodec`.


@HyukjinKwon How about adding a link? Users may not know where to download it.

`brotliCodec` -> [`brotli-codec`](https://github.com/rdblue/brotli-codec)

If the link looks expected to be rather permanent, it's fine.

It is more clear to say "zstd requires ZStandardCodec to be installed".

HyukjinKwon · 2018-09-20T02:01:38Z

docs/sql-programming-guide.md

@@ -965,6 +965,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
    `parquet.compression` is specified in the table-specific options/properties, the precedence would be
    `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include:
    none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
+    Note that `zstd` requires `ZStandardCodec` to be installed before Hadoop 2.9.0, `brotli` requires
+    `brotliCodec` to be installed.


brotliCodec -> BrotliCodec

SparkQA · 2018-09-20T02:11:23Z

Test build #96312 has finished for PR 22358 at commit 39eaf1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-20T03:13:36Z

Test build #96314 has finished for PR 22358 at commit 0e5d0bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-20T03:29:38Z

@srowen and @vanxin WDYT?

srowen

I think a bit of documentation is OK.

wangyum reviewed Sep 7, 2018

View reviewed changes

felixcheung reviewed Sep 9, 2018

View reviewed changes

10110346 changed the title ~~[SPARK-25366][SQL]Zstd and brotil CompressionCodec are not supported for parquet files~~ [SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files Sep 10, 2018

10110346 force-pushed the notsupportzstdandbrotil branch from 1db036a to 5c478b9 Compare September 10, 2018 01:42

HyukjinKwon reviewed Sep 10, 2018

View reviewed changes

10110346 force-pushed the notsupportzstdandbrotil branch from 5c478b9 to dd86d3f Compare September 11, 2018 08:59

HyukjinKwon reviewed Sep 11, 2018

View reviewed changes

10110346 force-pushed the notsupportzstdandbrotil branch from dd86d3f to 64aef6b Compare September 12, 2018 00:58

wangyum reviewed Sep 12, 2018

View reviewed changes

10110346 force-pushed the notsupportzstdandbrotil branch from 64aef6b to 39eaf1d Compare September 20, 2018 01:49

HyukjinKwon reviewed Sep 20, 2018

View reviewed changes

fix

0e5d0bc

10110346 force-pushed the notsupportzstdandbrotil branch from 39eaf1d to 0e5d0bc Compare September 20, 2018 02:49

srowen approved these changes Sep 20, 2018

View reviewed changes

asfgit closed this in 4d114fc Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358

[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358

10110346 commented Sep 7, 2018 •

edited

Loading

HyukjinKwon commented Sep 7, 2018

wangyum Sep 7, 2018

10110346 Sep 7, 2018

wangyum Sep 7, 2018

10110346 Sep 7, 2018

maropu Sep 8, 2018

HyukjinKwon Sep 8, 2018

maropu Sep 8, 2018

10110346 commented Sep 7, 2018 •

edited

Loading

10110346 commented Sep 7, 2018

HyukjinKwon commented Sep 7, 2018

10110346 commented Sep 7, 2018 •

edited

Loading

SparkQA commented Sep 7, 2018

HyukjinKwon commented Sep 8, 2018

maropu commented Sep 8, 2018

felixcheung Sep 9, 2018

10110346 Sep 10, 2018

SparkQA commented Sep 10, 2018

HyukjinKwon commented Sep 10, 2018

HyukjinKwon Sep 10, 2018

SparkQA commented Sep 11, 2018

HyukjinKwon Sep 11, 2018

HyukjinKwon commented Sep 11, 2018

SparkQA commented Sep 12, 2018

wangyum Sep 12, 2018

HyukjinKwon Sep 12, 2018

rdblue Sep 19, 2018

HyukjinKwon Sep 20, 2018

SparkQA commented Sep 20, 2018

SparkQA commented Sep 20, 2018

HyukjinKwon commented Sep 20, 2018

srowen left a comment

[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358

[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358

Conversation

10110346 commented Sep 7, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Sep 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

10110346 commented Sep 7, 2018 • edited Loading

10110346 commented Sep 7, 2018

HyukjinKwon commented Sep 7, 2018

10110346 commented Sep 7, 2018 • edited Loading

SparkQA commented Sep 7, 2018

HyukjinKwon commented Sep 8, 2018

maropu commented Sep 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 10, 2018

HyukjinKwon commented Sep 10, 2018

Choose a reason for hiding this comment

SparkQA commented Sep 11, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Sep 11, 2018

SparkQA commented Sep 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 20, 2018

SparkQA commented Sep 20, 2018

HyukjinKwon commented Sep 20, 2018

srowen left a comment

Choose a reason for hiding this comment

10110346 commented Sep 7, 2018 •

edited

Loading

10110346 commented Sep 7, 2018 •

edited

Loading

10110346 commented Sep 7, 2018 •

edited

Loading