[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. #20076

fjh100456 · 2017-12-25T07:43:13Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered.

What changes were proposed in this pull request?

Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0.
We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.

Changes:
1.Increased acquiring 'compressionCodecClassName' from parquet.compression,and the precedence order is compression,parquet.compression,spark.sql.parquet.compression.codec, just like what we do in OrcOptions.

2.Change spark.sql.parquet.compression.codec to support "none".Actually in ParquetOptions,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

3.Change compressionCode to compressionCodecClassName.

How was this patch tested?

Add test.

…quetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.

…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.

…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Manual test.

HyukjinKwon · 2017-12-25T07:46:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
   * Acceptable values are defined in [[shortParquetCompressionCodecNames]].
   */
  val compressionCodecClassName: String = {


Can we change compressionCodecClassName to compressionCodec instead?

@HyukjinKwon Seems you're right.
@gatorsmile Are we mistaken, shouldn't we change ParquetOptions's compressionCodec to compressionCodecClassName ? Because OrcOptions and TextOptions are all using compressionCodec .

compressionCodecClassName is a better name. We should change all the others to this.

We could alternatively say compressionCodecName here. It's rather names like UNCOMPRESSED, LZO, etc in this case. For the text based sources, they are canonical class names so I am okay with compressionCodecClassName but for ORC and Parquet these are not classes.

compressionCodecName is also fine to me.

So, change all compressionCodecClassName and compressionCodec to compressionCodecName? In TextOptions ,JSONOptions and CSVOptions too ?

@gatorsmile @HyukjinKwon
In TextOptions ,JSONOptions and CSVOptions, it's "Option[String]", but in OrcOptions and ParquetOptions, it's a "String".
Just change compressionCodecClassName in OrcOptions and ParquetOptions to compressionCodecName is ok ?

Let's do Parquet and ORC ones here for now if that's also fine to @gatorsmile.

fjh100456 · 2017-12-25T07:48:33Z

cc @gatorsmile
No orc configuration found in "sql-programming-guide.md", so I did not add the precedence description to spark.sql.orc.compression.codec .

gatorsmile · 2017-12-25T08:33:59Z

ok to test

SparkQA · 2017-12-25T08:39:15Z

Test build #85372 has finished for PR 20076 at commit 5124f1b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

update compressionCodecClassName to compressionCodecName

SparkQA · 2017-12-25T12:47:43Z

Test build #85377 has finished for PR 20076 at commit 05e52b6.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ParquetOptions(

SparkQA · 2017-12-25T12:49:13Z

Test build #85379 has finished for PR 20076 at commit 3cf0c04.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-25T12:50:19Z

Test build #85378 has finished for PR 20076 at commit 0c0f55d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

Use ParquetOptions in test

SparkQA · 2017-12-25T14:47:57Z

Test build #85380 has finished for PR 20076 at commit 10e5462.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ParquetOptions(

Fix tesr error

SparkQA · 2017-12-25T17:08:26Z

Test build #85381 has finished for PR 20076 at commit 2ab2d29.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-26T01:08:31Z

Retest this please

rxin · 2017-12-26T03:23:59Z

Thanks for the PR. Why are we complicating the PR by doing the rename? Does this actually gain anything other than minor cosmetic changes? It makes the simple PR pretty long ...

gatorsmile · 2017-12-26T03:24:01Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+class CompressionCodecSuite extends TestHiveSingleton with SQLTestUtils {


This suite does not need TestHiveSingleton .

gatorsmile · 2017-12-26T03:24:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive


Move it to sql/core.

HyukjinKwon · 2017-12-26T04:20:26Z

Sure, let's revert back the rename then.

SparkQA · 2017-12-26T04:23:36Z

Test build #85388 has finished for PR 20076 at commit 2ab2d29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

fjh100456 · 2017-12-26T04:36:24Z

Well, I'll revert back the renaming. Any comments? @gatorsmile

2 Move the test case to sql/core

Rename the test file name and class name

SparkQA · 2017-12-26T06:19:25Z

Test build #85394 has finished for PR 20076 at commit e510b48.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CompressionCodecPrecedenceSuite extends SQLTestUtils with SharedSQLContext

gatorsmile · 2017-12-26T08:39:00Z

Also add an end-to-end test case? For example, the one using in the https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties ?

fjh100456 · 2017-12-26T09:54:42Z

Does it mean what we do in the test case of another pr #19218 ? @gatorsmile

SparkQA · 2017-12-26T11:37:20Z

Test build #85400 has finished for PR 20076 at commit 9229e6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

fjh100456 · 2017-12-26T12:09:05Z

@gatorsmile
I test it manually and found that table-level compression property coming from sqls like below still can not take effect, enven though passing table properties to a hadoopConf(just like what I do in #19218 ). Because it can not be found in properties of tableInfo. I am not familiar with the SQL parsing, where is the attribute information stored when parsing SQL?
CREATE table A using Parquet tblproperties (' parquet.compression ' = ' gzip ') ...

gatorsmile · 2017-12-29T14:02:06Z

Try this?

CREATE TABLE A USING Parquet
OPTIONS('parquet.compression' = 'gzip')
AS SELECT 1 as col1

viirya · 2017-12-31T09:46:58Z

sql/core/src/test/scala/org/apache/spark/sql/CompressionCodecPrecedenceSuite.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql


Should we move this to org.apache.spark.sql.execution.datasources.parquet? Seems this should not be in this package level.

Thank you. I had move it to org.apache.spark.sql.execution.datasources.parquet.

jaceklaskowski · 2017-12-31T15:49:06Z

docs/sql-programming-guide.md

@@ -953,8 +953,10 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
  <td><code>spark.sql.parquet.compression.codec</code></td>
  <td>snappy</td>
  <td>
-    Sets the compression codec use when writing Parquet files. Acceptable values include:
-    uncompressed, snappy, gzip, lzo.
+    Sets the compression codec use when writing Parquet files. If other compression codec


s/use when/used when

jaceklaskowski · 2017-12-31T15:49:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -323,11 +323,13 @@ object SQLConf {
    .createWithDefault(false)

  val PARQUET_COMPRESSION = buildConf("spark.sql.parquet.compression.codec")
-    .doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " +
-      "uncompressed, snappy, gzip, lzo.")
+    .doc("Sets the compression codec use when writing Parquet files. If other compression codec " +


s/use when/used when

jaceklaskowski · 2017-12-31T15:50:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -364,7 +366,9 @@ object SQLConf {
      .createWithDefault(true)

  val ORC_COMPRESSION = buildConf("spark.sql.orc.compression.codec")
-    .doc("Sets the compression codec use when writing ORC files. Acceptable values include: " +
+    .doc("Sets the compression codec use when writing ORC files. If other compression codec " +


s/use when/used when

Thank you. I had fixed them.

fjh100456 · 2018-01-02T12:38:30Z

@gatorsmile
I have added two test cases. Please review them. Thank you very much.

SparkQA · 2018-01-02T15:29:24Z

Test build #85594 has finished for PR 20076 at commit 253b2a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CompressionCodecPrecedenceSuite extends SQLTestUtils with SharedSQLContext

SparkQA · 2018-01-02T15:45:12Z

Test build #85595 has finished for PR 20076 at commit d60dcd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSQLContext

HyukjinKwon · 2018-01-04T13:37:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

@@ -27,7 +28,7 @@ import org.apache.spark.sql.internal.SQLConf
 /**
 * Options for the Parquet data source.
 */
-private[parquet] class ParquetOptions(


Can we revive private[parquet]?

Yes, It should be revived. Thanks.

SparkQA · 2018-01-05T05:55:19Z

Test build #85716 has finished for PR 20076 at commit b5cd809.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-05T14:21:00Z

.../apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala

+         |'parquet.compression'='$compressionCodec')""".stripMargin
+    val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else ""
+    sql(s"""CREATE TABLE $tableName USING Parquet $options $partitionCreate
+    |as select 1 as col1, 2 as p""".stripMargin)


val options = s""" |OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName', |'parquet.compression'='$compressionCodec') """.stripMargin val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else "" sql( s""" |CREATE TABLE $tableName USING Parquet $options $partitionCreate |AS SELECT 1 AS col1, 2 AS p """.stripMargin)

gatorsmile · 2018-01-05T14:31:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-    .doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " +
-      "uncompressed, snappy, gzip, lzo.")
+    .doc("Sets the compression codec used when writing Parquet files. If other compression codec " +
+      "configuration was found through hive or parquet, the precedence would be `compression`, " +


Sets the compression codec used when writing Parquet files. If either compression or parquet.compression is specified in the table-specific options/properties, the precedence would be compression, ...

Fix scala style

Change the describtion of spark.sql.parquet.compression

Change describtion

SparkQA · 2018-01-06T03:29:17Z

Test build #85741 has finished for PR 20076 at commit 1a8c654.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-06T06:18:49Z

Test build #85739 has finished for PR 20076 at commit 26c1c61.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-06T06:27:54Z

Test build #85740 has finished for PR 20076 at commit 9466797.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-06T10:16:10Z

LGTM

Thanks! Merged to master/2.3

…quetOptions', `parquet.compression` needs to be considered. [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Add test. Author: fjh100456 <[email protected]> Closes #20076 from fjh100456/ParquetOptionIssue. (cherry picked from commit 7b78041) Signed-off-by: gatorsmile <[email protected]>

fjh100456 added 4 commits December 25, 2017 10:29

HyukjinKwon reviewed Dec 25, 2017

View reviewed changes

fjh100456 added 5 commits December 25, 2017 16:59

Fix scala style

1b087df

Change compressionCodecClassName to compressionCodecName

05e52b6

Resume a mistaken.

0c0f55d

Change the compression description

4ab7ecb

update compressionCodecClassName to compressionCodecName

3cf0c04

update compressionCodecClassName to compressionCodecName

Use ParquetOptions in test, so change to pulbic

10e5462

Use ParquetOptions in test

Update test

2ab2d29

Fix tesr error

gatorsmile reviewed Dec 26, 2017

View reviewed changes

fjh100456 added 2 commits December 26, 2017 14:14

1 Resume the renaming

845dda7

2 Move the test case to sql/core

Rename the test file name and class name

e510b48

Rename the test file name and class name

gatorsmile mentioned this pull request Dec 30, 2017

[SPARK-22926] [SQL] Respect table-level conf compression codec Compression in multiple scenarios #20120

Closed

viirya reviewed Dec 31, 2017

View reviewed changes

jaceklaskowski suggested changes Dec 31, 2017

View reviewed changes

fjh100456 added 3 commits January 2, 2018 20:29

Mode doc describtion

37fe65e

Add test case and change the package level

253b2a2

Add test case and change the package level

d60dcd1

HyukjinKwon reviewed Jan 4, 2018

View reviewed changes

Revive private[parquet]

b5cd809

gatorsmile reviewed Jan 5, 2018

View reviewed changes

fjh100456 added 3 commits January 6, 2018 11:08

Fix scala style

26c1c61

Fix scala style

Change the describtion of spark.sql.parquet.compression

9466797

Change the describtion of spark.sql.parquet.compression

Change describtion

1a8c654

Change describtion

asfgit closed this in 7b78041 Jan 6, 2018

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered. #20076

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered. #20076

Conversation

fjh100456 commented Dec 25, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjh100456 commented Dec 25, 2017

gatorsmile commented Dec 25, 2017

SparkQA commented Dec 25, 2017

SparkQA commented Dec 25, 2017

SparkQA commented Dec 25, 2017

SparkQA commented Dec 25, 2017

SparkQA commented Dec 25, 2017

SparkQA commented Dec 25, 2017

dongjoon-hyun commented Dec 26, 2017

rxin commented Dec 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 26, 2017

SparkQA commented Dec 26, 2017

fjh100456 commented Dec 26, 2017

SparkQA commented Dec 26, 2017

gatorsmile commented Dec 26, 2017

fjh100456 commented Dec 26, 2017

SparkQA commented Dec 26, 2017

fjh100456 commented Dec 26, 2017

gatorsmile commented Dec 29, 2017

viirya Dec 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjh100456 commented Jan 2, 2018

SparkQA commented Jan 2, 2018

SparkQA commented Jan 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 6, 2018

SparkQA commented Jan 6, 2018

SparkQA commented Jan 6, 2018

gatorsmile commented Jan 6, 2018

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. #20076

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. #20076

fjh100456 commented Dec 25, 2017 •

edited

Loading

viirya Dec 31, 2017 •

edited

Loading