-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-22833 [Improvement] in SparkHive Scala Examples #20018
Conversation
@holdenk @sameeragarwal Please do review and do needful . |
Can one of the admins verify this patch? |
@@ -104,6 +103,60 @@ object SparkHiveExample { | |||
// ... | |||
// $example off:spark_hive$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you not want the code below to render in the docs as part of the example? maybe not, just checking if that's intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Thank you for valueable feedback review, I have added that so it can help other develoeprs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Can you please review this cc\ @holdenk @sameeragarwal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I have updated DDL when storing data with parititoning in Hive.
cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you turn the example listing off then on again? just remove those two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I mis-understood your first comment. I have reverted as suggested. Please check now
@srowen Can you please review and if everything seems correct then run test build |
Adding other contributor of the same file for review. cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much this helps, but if it's just showing correct usage of some APIs, fine. Not as sure about the comments.
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh just noticed this. You're using javadoc style comments here, but they won't have effect.
just use the //
block style for comments that you see above, for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Done, changes addressed
hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") | ||
.parquet(hiveExternalTableLocation) | ||
/* | ||
If Data volume is very huge, then every partitions would have many small-small files which may harm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more stuff that should go in docs, not comments in an example. It kind of duplicates existing documentation. Is this commentary really needed to illustrate usage of the API? that's the only goal right here.
What are small-small files? You have some inconsistent capitalization; Parquet should be capitalized but not file, bandwidth, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I totally agree with you. I will rephrase content for docs. from here: i have removed as of now. please check and do needful.
|
||
/* | ||
You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal | ||
data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentences need some cleanup here. What do you mean by 'Int' argument? maybe it's best to point people to the API docs rather than incompletely repeat it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen done.
* 2. Create Hive Managed table with storage format as 'Parquet' | ||
* Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; | ||
*/ | ||
val hiveTableDF = sql("SELECT * FROM records").toDF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toDF
is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I think spark.table("records")
is a better example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen done cc\ @cloud-fan removed toDF()
/* | ||
* Save DataFrame to Hive External table as compatible parquet format. | ||
* 1. Create Hive External table with storage format as parquet. | ||
* Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird to create an external table without a location. User may be confused between the difference between managed table and external table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan we'll keep all comments description at documentation with user friendly lines. I have added location also.
on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. | ||
Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; | ||
*/ | ||
hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a standard usage, let's not put it in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan removed all comments , as discussed with @srowen it does really make sense to have at docs with removed inconsitency.
*/ | ||
// coalesce of 10 could create 10 parquet files under each partitions, | ||
// if data is huge and make sense to do partitioning. | ||
hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
English is not my native language but let's keep it clean and consistent as they are examples.
|
||
// Create Hive managed table with parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet
-> Parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
|
||
// Create Hive managed table with parquet | ||
sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") | ||
// Save DataFrame to Hive Managed table as Parquet format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managed
-> managed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
// Multiple parquet files could be created accordingly to volume of data under directory given. | ||
val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" | ||
|
||
// Save DataFrame to Hive External table as compatible parquet format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet
->Parquet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
// Create External Hive table with parquet | ||
sql("CREATE EXTERNAL TABLE records(key int, value string) " + | ||
"STORED AS PARQUET LOCATION '/user/hive/warehouse/'") | ||
// to make Hive parquet format compatible with spark parquet format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet
->Parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark
-> Spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
// to make Hive parquet format compatible with spark parquet format | ||
spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") | ||
|
||
// Multiple parquet files could be created accordingly to volume of data under directory given. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet
-> Parquet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
// Save DataFrame to Hive External table as compatible parquet format | ||
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) | ||
|
||
// turn on flag for Dynamic Partitioning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turn
-> Turn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") | ||
.parquet(hiveExternalTableLocation) | ||
|
||
// reduce number of files for each partition by repartition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce
-> Reduce
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) | ||
.partitionBy("key").parquet(hiveExternalTableLocation) | ||
|
||
// Control number of files in each partition by coalesce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Control number of files
-> Control the number of files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for highlight, improved the same.
@HyukjinKwon @srowen Kindly review now, if looks good do merge. Thanks |
@chetkhatri no need to keep pinging. We intentionally leave these changes open for review for a day or more to make sure everyone has seen it who wants to. |
@srowen Apologize, i was not aware with that PMC member gets auto notification for the same. |
Merged to master |
Seems this did not passed the test .. this causes a build failure: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85343/console
|
Thanks @HyukjinKwon |
Thank you @wangyum :D. |
Thanks @HyukjinKwon @wangyum |
What changes were proposed in this pull request?
SparkHive Scala Examples Improvement made:
How was this patch tested?