[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464

HyukjinKwon · 2016-03-02T07:29:51Z

What changes were proposed in this pull request?

This PR adds the support to specify compression codecs for both ORC and Parquet.

How was this patch tested?

unittests within IDE and code style tests with dev/run_tests.

HyukjinKwon · 2016-03-02T07:37:46Z

cc @rxin

rxin · 2016-03-02T07:46:52Z

python/pyspark/sql/readwriter.py

@@ -487,6 +487,12 @@ def parquet(self, path, mode=None, partitionBy=None):
            * ``error`` (default case): Throw an exception if data already exists.
        :param partitionBy: names of partitioning columns

+        You can set the following Parquet-specific option(s) for writing Parquet files:
+            * ``compression`` (default ``None``): compression codec to use when saving to file.


maybe just put this as an argument?

rxin · 2016-03-02T07:48:33Z

How does this relate to #11408?

SparkQA · 2016-03-02T09:11:56Z

Test build #52305 has finished for PR 11464 at commit a96e510.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-02T09:21:35Z

@rxin I think I should change the codec names to lower cases here.

SparkQA · 2016-03-02T09:34:41Z

Test build #52306 has finished for PR 11464 at commit d107d50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-02T09:36:48Z

Test build #52307 has finished for PR 11464 at commit 9c53d10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-02T10:06:50Z

I will handle the consistent stuff in a new PR based on #11408.

SparkQA · 2016-03-02T11:46:43Z

Test build #52312 has finished for PR 11464 at commit 12e7275.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-02T23:47:13Z

@rxin Could you maybe merge this first before dealing with consistent compression name stuff if it looks good?

++Let me fix the conflicts first.

rxin · 2016-03-03T00:01:17Z

There is a conflict. Maybe just make stuff consistent here?

HyukjinKwon · 2016-03-03T00:13:24Z

@rxin Sure.

…pressed for test-based datasources

SparkQA · 2016-03-03T03:39:49Z

Test build #52349 has finished for PR 11464 at commit baf7a63.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-03T04:05:31Z

Hm.. This will not explicitly set uncompressed or none for JSON, CSV and TEXT. Let me correct them and add some tests.

HyukjinKwon · 2016-03-03T04:39:26Z

@rxin I think this is ready to be reviewed.

SparkQA · 2016-03-03T05:48:03Z

Test build #52360 has finished for PR 11464 at commit 4e212b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T06:15:41Z

Test build #52368 has finished for PR 11464 at commit 2817c9a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-03T06:19:14Z

retest this please

rxin · 2016-03-03T06:27:48Z

@davies can you help review this? Thanks.

davies · 2016-03-03T06:50:29Z

LGTM

rxin · 2016-03-03T07:08:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

@@ -396,6 +402,46 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
    }
  }

+  test("SPARK-13543 Set explicitly the output as uncompressed") {


what exactly does this test do? it's not super clear here

is it to test whether uncompressed mode would work?

Yes, which one would be better?
Should I just write like write the output as uncompressed via option?

Yea - i think that's better

SparkQA · 2016-03-03T08:20:43Z

Test build #52378 has finished for PR 11464 at commit 2817c9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T10:04:24Z

Test build #52385 has finished for PR 11464 at commit e4d4cfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T10:07:28Z

Test build #52384 has finished for PR 11464 at commit d2ace37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-03T18:31:07Z

Thanks - merging this in master.

marmbrus · 2016-03-03T22:16:58Z

Changes like this should probably also update the programming guide.

…et/ORC via option() ## What changes were proposed in this pull request? This PR adds the support to specify compression codecs for both ORC and Parquet. ## How was this patch tested? unittests within IDE and code style tests with `dev/run_tests`. Author: hyukjinkwon <[email protected]> Closes apache#11464 from HyukjinKwon/SPARK-13543.

…quetOptions', `parquet.compression` needs to be considered. [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Add test. Author: fjh100456 <[email protected]> Closes #20076 from fjh100456/ParquetOptionIssue. (cherry picked from commit 7b78041) Signed-off-by: gatorsmile <[email protected]>

…quetOptions', `parquet.compression` needs to be considered. [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by apache#11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Add test. Author: fjh100456 <[email protected]> Closes apache#20076 from fjh100456/ParquetOptionIssue.

HyukjinKwon added 6 commits March 2, 2016 09:10

Add the support to specify compression codec for Parquet and Orc

9a43e72

Add some tests for testing compressions

9ce0ab7

Merge upstream

59a77d4

Add tests and some comments

04e4a51

Correct comments

a96e510

Remove the duplicated test

d107d50

Remove unused imports

9c53d10

rxin reviewed Mar 2, 2016
View reviewed changes

HyukjinKwon mentioned this pull request Mar 2, 2016

[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408

Closed

Add compression as an argument and make shorten names as lower cases

12e7275

HyukjinKwon added 2 commits March 3, 2016 09:14

Resolve conflicts

2304bfb

Add some comments for consistent compression names and add none/uncom…

baf7a63

…pressed for test-based datasources

HyukjinKwon added 2 commits March 3, 2016 12:46

Use lower-cases for compression codec names

99c39c6

Update error meesages for TextSuite

4e212b1

HyukjinKwon added 3 commits March 3, 2016 13:32

Add some tests and functionality for expliclty setting no compression

4c1ffc5

Remove unused variable

29a8d61

Add removed test

2817c9a

rxin reviewed Mar 3, 2016
View reviewed changes

HyukjinKwon added 2 commits March 3, 2016 17:06

Update test names

d2ace37

Make the options function

e4d4cfb

asfgit closed this in cf95d72 Mar 3, 2016

HyukjinKwon mentioned this pull request Mar 4, 2016

Support for specifying custom date format for date and timestamp types. databricks/spark-csv#280

Closed

HyukjinKwon deleted the SPARK-13543 branch October 1, 2016 06:42

gatorsmile mentioned this pull request Dec 26, 2017

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered. #20076

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464

[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin Mar 2, 2016

rxin commented Mar 2, 2016

SparkQA commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

SparkQA commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

SparkQA commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

rxin commented Mar 3, 2016

davies commented Mar 3, 2016

rxin Mar 3, 2016

rxin Mar 3, 2016

HyukjinKwon Mar 3, 2016

rxin Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

rxin commented Mar 3, 2016

marmbrus commented Mar 3, 2016

[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464

[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464

Conversation

HyukjinKwon commented Mar 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Mar 2, 2016

rxin Mar 2, 2016

Choose a reason for hiding this comment

rxin commented Mar 2, 2016

SparkQA commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

SparkQA commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

SparkQA commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

HyukjinKwon commented Mar 3, 2016

rxin commented Mar 3, 2016

davies commented Mar 3, 2016

rxin Mar 3, 2016

Choose a reason for hiding this comment

rxin Mar 3, 2016

Choose a reason for hiding this comment

HyukjinKwon Mar 3, 2016

Choose a reason for hiding this comment

rxin Mar 3, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

rxin commented Mar 3, 2016

marmbrus commented Mar 3, 2016