SPARK-22833 [Improvement] in SparkHive Scala Examples #20018

chetkhatri · 2017-12-19T11:37:43Z

What changes were proposed in this pull request?

SparkHive Scala Examples Improvement made:

Writing DataFrame / DataSet to Hive Managed , Hive External table using different storage format.
Implementation of Partition, Reparition, Coalesce with appropriate example.

How was this patch tested?

Patch has been tested manually and by running ./dev/run-tests.

chetkhatri · 2017-12-19T11:39:43Z

@holdenk @sameeragarwal Please do review and do needful .

AmplabJenkins · 2017-12-19T11:42:23Z

Can one of the admins verify this patch?

srowen · 2017-12-19T13:44:00Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

@@ -104,6 +103,60 @@ object SparkHiveExample {
    // ...
    // $example off:spark_hive$


Do you not want the code below to render in the docs as part of the example? maybe not, just checking if that's intentional.

@srowen Thank you for valueable feedback review, I have added that so it can help other develoeprs.

@srowen Can you please review this cc\ @holdenk @sameeragarwal

@srowen I have updated DDL when storing data with parititoning in Hive.
cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra

Why do you turn the example listing off then on again? just remove those two lines

@srowen I mis-understood your first comment. I have reverted as suggested. Please check now

…ion rephrased

chetkhatri · 2017-12-20T18:29:27Z

@srowen Can you please review and if everything seems correct then run test build

chetkhatri · 2017-12-20T18:38:23Z

Adding other contributor of the same file for review. cc
@cloud-fan
@aokolnychyi
@liancheng
@HyukjinKwon

srowen

Not sure how much this helps, but if it's just showing correct usage of some APIs, fine. Not as sure about the comments.

srowen · 2017-12-20T20:56:33Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala


+    /*


Oh just noticed this. You're using javadoc style comments here, but they won't have effect.
just use the // block style for comments that you see above, for consistency.

@srowen Done, changes addressed

srowen · 2017-12-20T20:58:03Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+      .parquet(hiveExternalTableLocation)
+    /*
+    If Data volume is very huge, then every partitions would have many small-small files which may harm


This is more stuff that should go in docs, not comments in an example. It kind of duplicates existing documentation. Is this commentary really needed to illustrate usage of the API? that's the only goal right here.

What are small-small files? You have some inconsistent capitalization; Parquet should be capitalized but not file, bandwidth, etc.

@srowen I totally agree with you. I will rephrase content for docs. from here: i have removed as of now. please check and do needful.

srowen · 2017-12-20T20:58:49Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+
+    /*
+     You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal
+     data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without


Sentences need some cleanup here. What do you mean by 'Int' argument? maybe it's best to point people to the API docs rather than incompletely repeat it.

@srowen done.

cloud-fan · 2017-12-21T07:22:37Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+     * 2. Create Hive Managed table with storage format as 'Parquet'
+     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+     */
+    val hiveTableDF = sql("SELECT * FROM records").toDF()


.toDF is not needed

actually, I think spark.table("records") is a better example.

@srowen done cc\ @cloud-fan removed toDF()

cloud-fan · 2017-12-21T07:25:19Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    /*
+     * Save DataFrame to Hive External table as compatible parquet format.
+     * 1. Create Hive External table with storage format as parquet.
+     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;


it's weird to create an external table without a location. User may be confused between the difference between managed table and external table.

@cloud-fan we'll keep all comments description at documentation with user friendly lines. I have added location also.

cloud-fan · 2017-12-21T07:25:47Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
+    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
+     */
+    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)


This is not a standard usage, let's not put it in the example.

@cloud-fan removed all comments , as discussed with @srowen it does really make sense to have at docs with removed inconsitency.

cloud-fan · 2017-12-21T07:26:06Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+     */
+    // coalesce of 10 could create 10 parquet files under each partitions,
+    // if data is huge and make sense to do partitioning.
+    hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite)


…ents made

… in external hive table

HyukjinKwon

English is not my native language but let's keep it clean and consistent as they are examples.

HyukjinKwon · 2017-12-22T00:25:04Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala


+    // Create Hive managed table with parquet


parquet -> Parquet

@HyukjinKwon Thanks for highlight, improved the same.

HyukjinKwon · 2017-12-22T00:25:21Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala


+    // Create Hive managed table with parquet
+    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+    // Save DataFrame to Hive Managed table as Parquet format


Managed -> managed

@HyukjinKwon Thanks for highlight, improved the same.

HyukjinKwon · 2017-12-22T00:26:22Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    // Multiple parquet files could be created accordingly to volume of data under directory given.
+    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
+
+    // Save DataFrame to Hive External table as compatible parquet format


parquet ->Parquet.

@HyukjinKwon Thanks for highlight, improved the same.

HyukjinKwon · 2017-12-22T00:26:35Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    // Create External Hive table with parquet
+    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+    // to make Hive parquet format compatible with spark parquet format


parquet ->Parquet

spark -> Spark

@HyukjinKwon Thanks for highlight, improved the same.

HyukjinKwon · 2017-12-22T00:26:52Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    // to make Hive parquet format compatible with spark parquet format
+    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+    // Multiple parquet files could be created accordingly to volume of data under directory given.


parquet -> Parquet.

@HyukjinKwon Thanks for highlight, improved the same.

HyukjinKwon · 2017-12-22T00:27:47Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    // Save DataFrame to Hive External table as compatible parquet format
+    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+    // turn on flag for Dynamic Partitioning


turn -> Turn.

@HyukjinKwon Thanks for highlight, improved the same.

HyukjinKwon · 2017-12-22T00:28:10Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+      .parquet(hiveExternalTableLocation)
+
+    // reduce number of files for each partition by repartition


reduce -> Reduce.

@HyukjinKwon Thanks for highlight, improved the same.

HyukjinKwon · 2017-12-22T00:28:28Z

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

+    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+      .partitionBy("key").parquet(hiveExternalTableLocation)
+
+    // Control number of files in each partition by coalesce


Control number of files -> Control the number of files

@HyukjinKwon Thanks for highlight, improved the same.

…ased

chetkhatri · 2017-12-22T13:18:12Z

@HyukjinKwon @srowen Kindly review now, if looks good do merge. Thanks

srowen · 2017-12-22T17:20:26Z

@chetkhatri no need to keep pinging. We intentionally leave these changes open for review for a day or more to make sure everyone has seen it who wants to.

chetkhatri · 2017-12-22T17:27:16Z

@srowen Apologize, i was not aware with that PMC member gets auto notification for the same.

srowen · 2017-12-23T14:13:33Z

Merged to master

HyukjinKwon · 2017-12-23T15:53:14Z

Seems this did not passed the test .. this causes a build failure:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85343/console

========================================================================
Running Scala style checks
========================================================================
Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala:138:0: Whitespace at end of line
[error] Total time: 13 s, completed Dec 23, 2017 7:34:15 AM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received return code 1

HyukjinKwon · 2017-12-23T15:56:56Z

I just opened a quick hotfix - #20065 as I think we don't run examples in the tests and all we need would just be the style.

Reverting works also fine to me @srowen. I can close mine.

wangyum · 2017-12-23T16:03:35Z

Thanks @HyukjinKwon

HyukjinKwon · 2017-12-23T16:04:54Z

Thank you @wangyum :D.

chetkhatri · 2017-12-23T17:43:30Z

Thanks @HyukjinKwon @wangyum

SPARK-22833 [Improvement] in SparkHive Scala Examples

9d9b42b

srowen reviewed Dec 19, 2017

View reviewed changes

chetkhatri added 2 commits December 19, 2017 21:31

[SPARK-22833] [Improvement] Example code enabled for documentation

0bbad8c

SPARK-22833 [Improvement] in SparkHive Scala Example - DDL for partit…

ee53208

…ion rephrased

SPARK-22833 [Improvement] - reverted mistake on on-off example

2f98a3c

srowen requested changes Dec 20, 2017

View reviewed changes

cloud-fan reviewed Dec 21, 2017

View reviewed changes

chetkhatri added 3 commits December 22, 2017 01:33

SPARK-22833 [Improvement] in SparkHive Scala example - change in comm…

69a4145

…ents made

SPARK-22833 [Improvement] in SparkHive Scala example - DDL used in sql

b95587d

SPARK-22833 [Improvement] in SparkHive Scala example - location added…

9b8d188

… in external hive table

HyukjinKwon reviewed Dec 22, 2017

View reviewed changes

SPARK-22833 [Improvement] in SparkHive Scala Example - comments rephr…

c3dda1b

…ased

asfgit closed this in 86db9b2 Dec 23, 2017

wangyum mentioned this pull request Dec 23, 2017

[SPARK-22833][Examples][FOLLOWUP] Remove whitespace to fix scalastyle checks failed #20066

Closed

		@@ -104,6 +103,60 @@ object SparkHiveExample {
		// ...
		// $example off:spark_hive$

SPARK-22833 [Improvement] in SparkHive Scala Examples #20018

SPARK-22833 [Improvement] in SparkHive Scala Examples #20018

Conversation

chetkhatri commented Dec 19, 2017

What changes were proposed in this pull request?

How was this patch tested?

chetkhatri commented Dec 19, 2017

AmplabJenkins commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chetkhatri commented Dec 20, 2017

chetkhatri commented Dec 20, 2017

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chetkhatri commented Dec 22, 2017

srowen commented Dec 22, 2017

chetkhatri commented Dec 22, 2017

srowen commented Dec 23, 2017

HyukjinKwon commented Dec 23, 2017

HyukjinKwon commented Dec 23, 2017 • edited Loading

wangyum commented Dec 23, 2017

HyukjinKwon commented Dec 23, 2017

chetkhatri commented Dec 23, 2017

HyukjinKwon left a comment •

edited

Loading

HyukjinKwon commented Dec 23, 2017 •

edited

Loading