Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-22833 [Improvement] in SparkHive Scala Examples #20018

Closed
wants to merge 8 commits into from

Conversation

chetkhatri
Copy link
Contributor

What changes were proposed in this pull request?

SparkHive Scala Examples Improvement made:

  • Writing DataFrame / DataSet to Hive Managed , Hive External table using different storage format.
  • Implementation of Partition, Reparition, Coalesce with appropriate example.

How was this patch tested?

  • Patch has been tested manually and by running ./dev/run-tests.

@chetkhatri
Copy link
Contributor Author

@holdenk @sameeragarwal Please do review and do needful .

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -104,6 +103,60 @@ object SparkHiveExample {
// ...
// $example off:spark_hive$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not want the code below to render in the docs as part of the example? maybe not, just checking if that's intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Thank you for valueable feedback review, I have added that so it can help other develoeprs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Can you please review this cc\ @holdenk @sameeragarwal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I have updated DDL when storing data with parititoning in Hive.
cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you turn the example listing off then on again? just remove those two lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I mis-understood your first comment. I have reverted as suggested. Please check now

@chetkhatri
Copy link
Contributor Author

@srowen Can you please review and if everything seems correct then run test build

@chetkhatri
Copy link
Contributor Author

Adding other contributor of the same file for review. cc
@cloud-fan
@aokolnychyi
@liancheng
@HyukjinKwon

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much this helps, but if it's just showing correct usage of some APIs, fine. Not as sure about the comments.


/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh just noticed this. You're using javadoc style comments here, but they won't have effect.
just use the // block style for comments that you see above, for consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Done, changes addressed

hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
.parquet(hiveExternalTableLocation)
/*
If Data volume is very huge, then every partitions would have many small-small files which may harm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more stuff that should go in docs, not comments in an example. It kind of duplicates existing documentation. Is this commentary really needed to illustrate usage of the API? that's the only goal right here.

What are small-small files? You have some inconsistent capitalization; Parquet should be capitalized but not file, bandwidth, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I totally agree with you. I will rephrase content for docs. from here: i have removed as of now. please check and do needful.


/*
You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal
data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentences need some cleanup here. What do you mean by 'Int' argument? maybe it's best to point people to the API docs rather than incompletely repeat it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen done.

* 2. Create Hive Managed table with storage format as 'Parquet'
* Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
*/
val hiveTableDF = sql("SELECT * FROM records").toDF()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.toDF is not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I think spark.table("records") is a better example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen done cc\ @cloud-fan removed toDF()

/*
* Save DataFrame to Hive External table as compatible parquet format.
* 1. Create Hive External table with storage format as parquet.
* Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird to create an external table without a location. User may be confused between the difference between managed table and external table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan we'll keep all comments description at documentation with user friendly lines. I have added location also.

on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
*/
hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a standard usage, let's not put it in the example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan removed all comments , as discussed with @srowen it does really make sense to have at docs with removed inconsitency.

*/
// coalesce of 10 could create 10 parquet files under each partitions,
// if data is huge and make sense to do partitioning.
hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English is not my native language but let's keep it clean and consistent as they are examples.


// Create Hive managed table with parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet -> Parquet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.


// Create Hive managed table with parquet
sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
// Save DataFrame to Hive Managed table as Parquet format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed -> managed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.

// Multiple parquet files could be created accordingly to volume of data under directory given.
val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"

// Save DataFrame to Hive External table as compatible parquet format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet ->Parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.

// Create External Hive table with parquet
sql("CREATE EXTERNAL TABLE records(key int, value string) " +
"STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
// to make Hive parquet format compatible with spark parquet format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet ->Parquet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark -> Spark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.

// to make Hive parquet format compatible with spark parquet format
spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")

// Multiple parquet files could be created accordingly to volume of data under directory given.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet -> Parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.

// Save DataFrame to Hive External table as compatible parquet format
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)

// turn on flag for Dynamic Partitioning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn -> Turn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.

hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
.parquet(hiveExternalTableLocation)

// reduce number of files for each partition by repartition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce -> Reduce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.

hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
.partitionBy("key").parquet(hiveExternalTableLocation)

// Control number of files in each partition by coalesce
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Control number of files -> Control the number of files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for highlight, improved the same.

@chetkhatri
Copy link
Contributor Author

@HyukjinKwon @srowen Kindly review now, if looks good do merge. Thanks

@srowen
Copy link
Member

srowen commented Dec 22, 2017

@chetkhatri no need to keep pinging. We intentionally leave these changes open for review for a day or more to make sure everyone has seen it who wants to.

@chetkhatri
Copy link
Contributor Author

@srowen Apologize, i was not aware with that PMC member gets auto notification for the same.

@srowen
Copy link
Member

srowen commented Dec 23, 2017

Merged to master

@asfgit asfgit closed this in 86db9b2 Dec 23, 2017
@HyukjinKwon
Copy link
Member

Seems this did not passed the test .. this causes a build failure:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85343/console

========================================================================
Running Scala style checks
========================================================================
Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala:138:0: Whitespace at end of line
[error] Total time: 13 s, completed Dec 23, 2017 7:34:15 AM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received return code 1

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 23, 2017

I just opened a quick hotfix - #20065 as I think we don't run examples in the tests and all we need would just be the style.

Reverting works also fine to me @srowen. I can close mine.

@wangyum
Copy link
Member

wangyum commented Dec 23, 2017

Thanks @HyukjinKwon

@HyukjinKwon
Copy link
Member

Thank you @wangyum :D.

@chetkhatri
Copy link
Contributor Author

Thanks @HyukjinKwon @wangyum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants