Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered. #20076

Closed
wants to merge 21 commits into from

Conversation

fjh100456
Copy link
Contributor

@fjh100456 fjh100456 commented Dec 25, 2017

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered.

What changes were proposed in this pull request?

Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0.
We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.

Changes:
1.Increased acquiring 'compressionCodecClassName' from parquet.compression,and the precedence order is compression,parquet.compression,spark.sql.parquet.compression.codec, just like what we do in OrcOptions.

2.Change spark.sql.parquet.compression.codec to support "none".Actually in ParquetOptions,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

3.Change compressionCode to compressionCodecClassName.

How was this patch tested?

Add test.

…quetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

## How was this patch tested?
Manual test.
…quetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

## How was this patch tested?
Manual test.
…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

## How was this patch tested?
Manual test.
…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".
3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Manual test.
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in [[shortParquetCompressionCodecNames]].
*/
val compressionCodecClassName: String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change compressionCodecClassName to compressionCodec instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Seems you're right.
@gatorsmile Are we mistaken, shouldn't we change ParquetOptions's compressionCodec to compressionCodecClassName ? Because OrcOptions and TextOptions are all using compressionCodec .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compressionCodecClassName is a better name. We should change all the others to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could alternatively say compressionCodecName here. It's rather names like UNCOMPRESSED, LZO, etc in this case. For the text based sources, they are canonical class names so I am okay with compressionCodecClassName but for ORC and Parquet these are not classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compressionCodecName is also fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, change all compressionCodecClassName and compressionCodec to compressionCodecName? In TextOptions ,JSONOptions and CSVOptions too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile @HyukjinKwon
In TextOptions ,JSONOptions and CSVOptions, it's "Option[String]", but in OrcOptions and ParquetOptions, it's a "String".
Just change compressionCodecClassName in OrcOptions and ParquetOptions to compressionCodecName is ok ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do Parquet and ORC ones here for now if that's also fine to @gatorsmile.

@fjh100456
Copy link
Contributor Author

cc @gatorsmile
No orc configuration found in "sql-programming-guide.md", so I did not add the precedence description to spark.sql.orc.compression.codec .

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85372 has finished for PR 20076 at commit 5124f1b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85377 has finished for PR 20076 at commit 05e52b6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ParquetOptions(

@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85379 has finished for PR 20076 at commit 3cf0c04.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85378 has finished for PR 20076 at commit 0c0f55d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85380 has finished for PR 20076 at commit 10e5462.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ParquetOptions(

Fix tesr error
@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85381 has finished for PR 20076 at commit 2ab2d29.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@rxin
Copy link
Contributor

rxin commented Dec 26, 2017

Thanks for the PR. Why are we complicating the PR by doing the rename? Does this actually gain anything other than minor cosmetic changes? It makes the simple PR pretty long ...

import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.test.SQLTestUtils

class CompressionCodecSuite extends TestHiveSingleton with SQLTestUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suite does not need TestHiveSingleton .

* limitations under the License.
*/

package org.apache.spark.sql.hive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it to sql/core.

@HyukjinKwon
Copy link
Member

Sure, let's revert back the rename then.

@SparkQA
Copy link

SparkQA commented Dec 26, 2017

Test build #85388 has finished for PR 20076 at commit 2ab2d29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fjh100456
Copy link
Contributor Author

Well, I'll revert back the renaming. Any comments? @gatorsmile

2 Move the test case to sql/core
Rename the test file name and class name
@SparkQA
Copy link

SparkQA commented Dec 26, 2017

Test build #85394 has finished for PR 20076 at commit e510b48.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CompressionCodecPrecedenceSuite extends SQLTestUtils with SharedSQLContext

@gatorsmile
Copy link
Member

Also add an end-to-end test case? For example, the one using in the https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties ?

@fjh100456
Copy link
Contributor Author

Does it mean what we do in the test case of another pr #19218 ? @gatorsmile

@SparkQA
Copy link

SparkQA commented Dec 26, 2017

Test build #85400 has finished for PR 20076 at commit 9229e6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fjh100456
Copy link
Contributor Author

@gatorsmile
I test it manually and found that table-level compression property coming from sqls like below still can not take effect, enven though passing table properties to a hadoopConf(just like what I do in #19218 ). Because it can not be found in properties of tableInfo. I am not familiar with the SQL parsing, where is the attribute information stored when parsing SQL?
CREATE table A using Parquet tblproperties (' parquet.compression ' = ' gzip ') ...

@gatorsmile
Copy link
Member

Try this?

CREATE TABLE A USING Parquet
OPTIONS('parquet.compression' = 'gzip')
AS SELECT 1 as col1

* limitations under the License.
*/

package org.apache.spark.sql
Copy link
Member

@viirya viirya Dec 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to org.apache.spark.sql.execution.datasources.parquet? Seems this should not be in this package level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I had move it to org.apache.spark.sql.execution.datasources.parquet.

@@ -953,8 +953,10 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
<td><code>spark.sql.parquet.compression.codec</code></td>
<td>snappy</td>
<td>
Sets the compression codec use when writing Parquet files. Acceptable values include:
uncompressed, snappy, gzip, lzo.
Sets the compression codec use when writing Parquet files. If other compression codec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/use when/used when

@@ -323,11 +323,13 @@ object SQLConf {
.createWithDefault(false)

val PARQUET_COMPRESSION = buildConf("spark.sql.parquet.compression.codec")
.doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " +
"uncompressed, snappy, gzip, lzo.")
.doc("Sets the compression codec use when writing Parquet files. If other compression codec " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/use when/used when

@@ -364,7 +366,9 @@ object SQLConf {
.createWithDefault(true)

val ORC_COMPRESSION = buildConf("spark.sql.orc.compression.codec")
.doc("Sets the compression codec use when writing ORC files. Acceptable values include: " +
.doc("Sets the compression codec use when writing ORC files. If other compression codec " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/use when/used when

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I had fixed them.

@fjh100456
Copy link
Contributor Author

@gatorsmile
I have added two test cases. Please review them. Thank you very much.

@SparkQA
Copy link

SparkQA commented Jan 2, 2018

Test build #85594 has finished for PR 20076 at commit 253b2a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CompressionCodecPrecedenceSuite extends SQLTestUtils with SharedSQLContext

@SparkQA
Copy link

SparkQA commented Jan 2, 2018

Test build #85595 has finished for PR 20076 at commit d60dcd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSQLContext

@@ -27,7 +28,7 @@ import org.apache.spark.sql.internal.SQLConf
/**
* Options for the Parquet data source.
*/
private[parquet] class ParquetOptions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revive private[parquet]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It should be revived. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 5, 2018

Test build #85716 has finished for PR 20076 at commit b5cd809.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|'parquet.compression'='$compressionCodec')""".stripMargin
val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else ""
sql(s"""CREATE TABLE $tableName USING Parquet $options $partitionCreate
|as select 1 as col1, 2 as p""".stripMargin)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    val options =
      s"""
        |OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName',
        |'parquet.compression'='$compressionCodec')
       """.stripMargin
    val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else ""
    sql(
      s"""
        |CREATE TABLE $tableName USING Parquet $options $partitionCreate
        |AS SELECT 1 AS col1, 2 AS p
      """.stripMargin)

.doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " +
"uncompressed, snappy, gzip, lzo.")
.doc("Sets the compression codec used when writing Parquet files. If other compression codec " +
"configuration was found through hive or parquet, the precedence would be `compression`, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sets the compression codec used when writing Parquet files. If either compression or parquet.compression is specified in the table-specific options/properties, the precedence would be compression, ...

Fix scala style
Change the describtion of spark.sql.parquet.compression
Change describtion
@SparkQA
Copy link

SparkQA commented Jan 6, 2018

Test build #85741 has finished for PR 20076 at commit 1a8c654.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2018

Test build #85739 has finished for PR 20076 at commit 26c1c61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2018

Test build #85740 has finished for PR 20076 at commit 9466797.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Jan 6, 2018
…quetOptions', `parquet.compression` needs to be considered.

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0.
We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.

Changes:
1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.

2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Add test.

Author: fjh100456 <[email protected]>

Closes #20076 from fjh100456/ParquetOptionIssue.

(cherry picked from commit 7b78041)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in 7b78041 Jan 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants