-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464
Conversation
cc @rxin |
@@ -487,6 +487,12 @@ def parquet(self, path, mode=None, partitionBy=None): | |||
* ``error`` (default case): Throw an exception if data already exists. | |||
:param partitionBy: names of partitioning columns | |||
|
|||
You can set the following Parquet-specific option(s) for writing Parquet files: | |||
* ``compression`` (default ``None``): compression codec to use when saving to file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just put this as an argument?
How does this relate to #11408? |
Test build #52305 has finished for PR 11464 at commit
|
Test build #52306 has finished for PR 11464 at commit
|
Test build #52307 has finished for PR 11464 at commit
|
I will handle the consistent stuff in a new PR based on #11408. |
Test build #52312 has finished for PR 11464 at commit
|
@rxin Could you maybe merge this first before dealing with consistent compression name stuff if it looks good? ++Let me fix the conflicts first. |
There is a conflict. Maybe just make stuff consistent here? |
@rxin Sure. |
…pressed for test-based datasources
Test build #52349 has finished for PR 11464 at commit
|
Hm.. This will not explicitly set |
@rxin I think this is ready to be reviewed. |
Test build #52360 has finished for PR 11464 at commit
|
Test build #52368 has finished for PR 11464 at commit
|
retest this please |
@davies can you help review this? Thanks. |
LGTM |
@@ -396,6 +402,46 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { | |||
} | |||
} | |||
|
|||
test("SPARK-13543 Set explicitly the output as uncompressed") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly does this test do? it's not super clear here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it to test whether uncompressed mode would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, which one would be better?
Should I just write like write the output as uncompressed via option
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - i think that's better
Test build #52378 has finished for PR 11464 at commit
|
Test build #52385 has finished for PR 11464 at commit
|
Test build #52384 has finished for PR 11464 at commit
|
Thanks - merging this in master. |
Changes like this should probably also update the programming guide. |
…et/ORC via option() ## What changes were proposed in this pull request? This PR adds the support to specify compression codecs for both ORC and Parquet. ## How was this patch tested? unittests within IDE and code style tests with `dev/run_tests`. Author: hyukjinkwon <[email protected]> Closes apache#11464 from HyukjinKwon/SPARK-13543.
…quetOptions', `parquet.compression` needs to be considered. [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Add test. Author: fjh100456 <[email protected]> Closes #20076 from fjh100456/ParquetOptionIssue. (cherry picked from commit 7b78041) Signed-off-by: gatorsmile <[email protected]>
…quetOptions', `parquet.compression` needs to be considered. [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by apache#11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Add test. Author: fjh100456 <[email protected]> Closes apache#20076 from fjh100456/ParquetOptionIssue.
What changes were proposed in this pull request?
This PR adds the support to specify compression codecs for both ORC and Parquet.
How was this patch tested?
unittests within IDE and code style tests with
dev/run_tests
.