-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation #4677
Conversation
Test build #27689 has started for PR 4677 at commit
|
07c8f12
to
7534d14
Compare
@jkbradley I just wanted to know if this is in the right direction. |
Test build #27690 has started for PR 4677 at commit
|
Test FAILed. |
Test FAILed. |
jenkins, test this please |
Test build #27691 has started for PR 4677 at commit
|
My recommendations:
We had discussed providing a helper method evaluateEachIteration() in the JIRA, but I'd prefer to have that be a separate JIRA and PR. Does that sound good? Thanks! |
Test build #27691 has finished for PR 4677 at commit
|
Test PASSed. |
7534d14
to
e008936
Compare
Test build #27711 has started for PR 4677 at commit
|
e008936
to
77549a9
Compare
Test build #27712 has started for PR 4677 at commit
|
Test build #27711 has finished for PR 4677 at commit
|
Test PASSed. |
Test build #27712 has finished for PR 4677 at commit
|
Test PASSed. |
Test build #27716 has started for PR 4677 at commit
|
@jkbradley I have fixed up your comments. Btw, why are there are both a train and a run, which seems to me do the same thing. Is it not better to have one way of doing things. Also a doubt in the case of the Classification problem. It seems to me for each iteration, the problem is changed explicitly to a Regression problem with labels mapped to {-1, 1}. Is it okay to break when this regression error no longer reduces on the validation data for a classification problem (which seems slightly awkward to me)? (https://github.com/apache/spark/pull/4677/files#diff-7b5c1db0b1926a36f418b53fcf807db0R227) Note that I had to explicitly set it to Regression to make sure that this test passes, (https://github.com/apache/spark/pull/4677/files#diff-d3159b88ae0ed6ff096ff8850ecac26eR207) . Otherwise, the classification error seems to be the same for both with and without validation. |
Test build #27716 has finished for PR 4677 at commit
|
Test PASSed. |
Test build #27720 has started for PR 4677 at commit
|
Test build #27720 has finished for PR 4677 at commit
|
Test PASSed. |
|
||
val gbtValidate = new GradientBoostedTrees(boostingStrategy).runWithValidation( | ||
trainRdd, validateRdd) | ||
assert(gbtValidate.numTrees != numIterations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use !==
(handles types better)
OK, I've made a close pass, so hopefully those are my final comments. |
52a080b
to
e4d799b
Compare
Test build #27847 has started for PR 4677 at commit
|
@jkbradley Addressed all your comments except the inline one. |
Test build #27847 has finished for PR 4677 at commit
|
Test PASSed. |
|
||
// Test that it stops early. | ||
val gbtValidate = new GradientBoostedTrees(boostingStrategy). | ||
runWithValidation(trainRdd, validateRdd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put period (.) on line with runWithValidation:
val gbtValidate = new GradientBoostedTrees(boostingStrategy)
.runWithValidation(trainRdd, validateRdd)
Also, what about shortening the code to combine the Classification and Regression tests? Did that not work out? |
Test build #27889 has started for PR 4677 at commit
|
@jkbradley I have fixed up your comments ! Hopefully good to go. [off-topic] |
Test build #27889 has finished for PR 4677 at commit
|
Test PASSed. |
@MechCoder Thanks for ping us about GSoC! Please check my reply on the dev list. |
LGTM I'll merge it into master |
Done---thanks for the PR! |
Iceberg 0.13.0.3 - ADT 1.1.7 2022-05-20 PRs Merged * Internal: Parquet bloom filter support (apache#594 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/594)) * Internal: AWS Kms Client (apache#630 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/630)) * Internal: Core: Add client-side check of encryption properties (apache#626 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/626)) * Core: Align snapshot summary property names for delete files (apache#4766 (apache/iceberg#4766)) * Core: Add eq and pos delete file counts to snapshot summary (apache#4677 (apache/iceberg#4677)) * Spark 3.2: Clean static vars in SparkTableUtil (apache#4765 (apache/iceberg#4765)) * Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil (apache#4758 (apache/iceberg#4758)) * Core: Fix query failure when using projection on top of partitions metadata table (apache#4720) (apache#619 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/619)) Key Notes Bloom filter support and Client Side Encryption Features can be used in this release. Both features are only enabled with explicit flags and will not effect existing tables or jobs.
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.
This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.