-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression
needs to be considered.
#20076
Conversation
…quetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.
…quetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.
…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.
…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Manual test.
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions( | |||
* Acceptable values are defined in [[shortParquetCompressionCodecNames]]. | |||
*/ | |||
val compressionCodecClassName: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change compressionCodecClassName
to compressionCodec
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Seems you're right.
@gatorsmile Are we mistaken, shouldn't we change ParquetOptions's compressionCodec
to compressionCodecClassName
? Because OrcOptions
and TextOptions
are all using compressionCodec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressionCodecClassName
is a better name. We should change all the others to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could alternatively say compressionCodecName
here. It's rather names like UNCOMPRESSED
, LZO
, etc in this case. For the text based sources, they are canonical class names so I am okay with compressionCodecClassName
but for ORC and Parquet these are not classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressionCodecName
is also fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, change all compressionCodecClassName
and compressionCodec
to compressionCodecName
? In TextOptions
,JSONOptions
and CSVOptions
too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile @HyukjinKwon
In TextOptions
,JSONOptions
and CSVOptions
, it's "Option[String]", but in OrcOptions
and ParquetOptions
, it's a "String".
Just change compressionCodecClassName
in OrcOptions
and ParquetOptions
to compressionCodecName
is ok ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do Parquet and ORC ones here for now if that's also fine to @gatorsmile.
cc @gatorsmile |
ok to test |
Test build #85372 has finished for PR 20076 at commit
|
update compressionCodecClassName to compressionCodecName
Test build #85377 has finished for PR 20076 at commit
|
Test build #85379 has finished for PR 20076 at commit
|
Test build #85378 has finished for PR 20076 at commit
|
Use ParquetOptions in test
Test build #85380 has finished for PR 20076 at commit
|
Fix tesr error
Test build #85381 has finished for PR 20076 at commit
|
Retest this please |
Thanks for the PR. Why are we complicating the PR by doing the rename? Does this actually gain anything other than minor cosmetic changes? It makes the simple PR pretty long ... |
import org.apache.spark.sql.internal.SQLConf | ||
import org.apache.spark.sql.test.SQLTestUtils | ||
|
||
class CompressionCodecSuite extends TestHiveSingleton with SQLTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suite does not need TestHiveSingleton
.
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.hive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it to sql/core.
Sure, let's revert back the rename then. |
Test build #85388 has finished for PR 20076 at commit
|
Well, I'll revert back the renaming. Any comments? @gatorsmile |
2 Move the test case to sql/core
Rename the test file name and class name
Test build #85394 has finished for PR 20076 at commit
|
Also add an end-to-end test case? For example, the one using in the https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties ? |
Does it mean what we do in the test case of another pr #19218 ? @gatorsmile |
Test build #85400 has finished for PR 20076 at commit
|
@gatorsmile |
Try this? CREATE TABLE A USING Parquet
OPTIONS('parquet.compression' = 'gzip')
AS SELECT 1 as col1 |
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to org.apache.spark.sql.execution.datasources.parquet
? Seems this should not be in this package level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I had move it to org.apache.spark.sql.execution.datasources.parquet
.
docs/sql-programming-guide.md
Outdated
@@ -953,8 +953,10 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession | |||
<td><code>spark.sql.parquet.compression.codec</code></td> | |||
<td>snappy</td> | |||
<td> | |||
Sets the compression codec use when writing Parquet files. Acceptable values include: | |||
uncompressed, snappy, gzip, lzo. | |||
Sets the compression codec use when writing Parquet files. If other compression codec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/use when/used when
@@ -323,11 +323,13 @@ object SQLConf { | |||
.createWithDefault(false) | |||
|
|||
val PARQUET_COMPRESSION = buildConf("spark.sql.parquet.compression.codec") | |||
.doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " + | |||
"uncompressed, snappy, gzip, lzo.") | |||
.doc("Sets the compression codec use when writing Parquet files. If other compression codec " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/use when/used when
@@ -364,7 +366,9 @@ object SQLConf { | |||
.createWithDefault(true) | |||
|
|||
val ORC_COMPRESSION = buildConf("spark.sql.orc.compression.codec") | |||
.doc("Sets the compression codec use when writing ORC files. Acceptable values include: " + | |||
.doc("Sets the compression codec use when writing ORC files. If other compression codec " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/use when/used when
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I had fixed them.
@gatorsmile |
Test build #85594 has finished for PR 20076 at commit
|
Test build #85595 has finished for PR 20076 at commit
|
@@ -27,7 +28,7 @@ import org.apache.spark.sql.internal.SQLConf | |||
/** | |||
* Options for the Parquet data source. | |||
*/ | |||
private[parquet] class ParquetOptions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we revive private[parquet]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It should be revived. Thanks.
Test build #85716 has finished for PR 20076 at commit
|
|'parquet.compression'='$compressionCodec')""".stripMargin | ||
val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else "" | ||
sql(s"""CREATE TABLE $tableName USING Parquet $options $partitionCreate | ||
|as select 1 as col1, 2 as p""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val options =
s"""
|OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName',
|'parquet.compression'='$compressionCodec')
""".stripMargin
val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else ""
sql(
s"""
|CREATE TABLE $tableName USING Parquet $options $partitionCreate
|AS SELECT 1 AS col1, 2 AS p
""".stripMargin)
.doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " + | ||
"uncompressed, snappy, gzip, lzo.") | ||
.doc("Sets the compression codec used when writing Parquet files. If other compression codec " + | ||
"configuration was found through hive or parquet, the precedence would be `compression`, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sets the compression codec used when writing Parquet files. If either
compression
orparquet.compression
is specified in the table-specific options/properties, the precedence would becompression
, ...
Fix scala style
Change the describtion of spark.sql.parquet.compression
Change describtion
Test build #85741 has finished for PR 20076 at commit
|
Test build #85739 has finished for PR 20076 at commit
|
Test build #85740 has finished for PR 20076 at commit
|
LGTM Thanks! Merged to master/2.3 |
…quetOptions', `parquet.compression` needs to be considered. [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Add test. Author: fjh100456 <[email protected]> Closes #20076 from fjh100456/ParquetOptionIssue. (cherry picked from commit 7b78041) Signed-off-by: gatorsmile <[email protected]>
[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions',
parquet.compression
needs to be considered.What changes were proposed in this pull request?
Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0.
We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.
Changes:
1.Increased acquiring 'compressionCodecClassName' from
parquet.compression
,and the precedence order iscompression
,parquet.compression
,spark.sql.parquet.compression.codec
, just like what we do inOrcOptions
.2.Change
spark.sql.parquet.compression.codec
to support "none".Actually inParquetOptions
,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".3.Change
compressionCode
tocompressionCodecClassName
.How was this patch tested?
Add test.