Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation #4677

Closed
wants to merge 8 commits into from

Conversation

MechCoder
Copy link
Contributor

One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.

This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.

@SparkQA
Copy link

SparkQA commented Feb 18, 2015

Test build #27689 has started for PR 4677 at commit 07c8f12.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

@jkbradley I just wanted to know if this is in the right direction.

@SparkQA
Copy link

SparkQA commented Feb 18, 2015

Test build #27690 has started for PR 4677 at commit 7534d14.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27689/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27690/
Test FAILed.

@shaneknapp
Copy link
Contributor

jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 18, 2015

Test build #27691 has started for PR 4677 at commit 7534d14.

  • This patch merges cleanly.

@jkbradley
Copy link
Member

My recommendations:

  • Name the new run() method something more explicit like runWithValidation().
  • Don't add the extra train() methods since users can use run().
  • Rename convergenceTol to something like validationTol (since convergenceTol applies to the training set in general).
  • Generalize the boost() method rather than duplicating it to avoid code duplication.

We had discussed providing a helper method evaluateEachIteration() in the JIRA, but I'd prefer to have that be a separate JIRA and PR. Does that sound good?

Thanks!

@SparkQA
Copy link

SparkQA commented Feb 18, 2015

Test build #27691 has finished for PR 4677 at commit 7534d14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27691/
Test PASSed.

@MechCoder MechCoder changed the title [SPARK-5436] [MLlib] Validate GradientBoostedTrees during train [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation Feb 19, 2015
@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27711 has started for PR 4677 at commit e008936.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27712 has started for PR 4677 at commit 77549a9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27711 has finished for PR 4677 at commit e008936.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27711/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27712 has finished for PR 4677 at commit 77549a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27712/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27716 has started for PR 4677 at commit 3e74372.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

@jkbradley I have fixed up your comments.

Btw, why are there are both a train and a run, which seems to me do the same thing. Is it not better to have one way of doing things.

Also a doubt in the case of the Classification problem. It seems to me for each iteration, the problem is changed explicitly to a Regression problem with labels mapped to {-1, 1}. Is it okay to break when this regression error no longer reduces on the validation data for a classification problem (which seems slightly awkward to me)? (https://github.com/apache/spark/pull/4677/files#diff-7b5c1db0b1926a36f418b53fcf807db0R227)

Note that I had to explicitly set it to Regression to make sure that this test passes, (https://github.com/apache/spark/pull/4677/files#diff-d3159b88ae0ed6ff096ff8850ecac26eR207) . Otherwise, the classification error seems to be the same for both with and without validation.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27716 has finished for PR 4677 at commit 3e74372.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Partitioner(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27716/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27720 has started for PR 4677 at commit 55e5c3b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27720 has finished for PR 4677 at commit 55e5c3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27720/
Test PASSed.


val gbtValidate = new GradientBoostedTrees(boostingStrategy).runWithValidation(
trainRdd, validateRdd)
assert(gbtValidate.numTrees != numIterations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use !== (handles types better)

@jkbradley
Copy link
Member

OK, I've made a close pass, so hopefully those are my final comments.

@SparkQA
Copy link

SparkQA commented Feb 23, 2015

Test build #27847 has started for PR 4677 at commit e4d799b.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

@jkbradley Addressed all your comments except the inline one.

@SparkQA
Copy link

SparkQA commented Feb 23, 2015

Test build #27847 has finished for PR 4677 at commit e4d799b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27847/
Test PASSed.


// Test that it stops early.
val gbtValidate = new GradientBoostedTrees(boostingStrategy).
runWithValidation(trainRdd, validateRdd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put period (.) on line with runWithValidation:

    val gbtValidate = new GradientBoostedTrees(boostingStrategy)
      .runWithValidation(trainRdd, validateRdd)

@jkbradley
Copy link
Member

Also, what about shortening the code to combine the Classification and Regression tests? Did that not work out?

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #27889 has started for PR 4677 at commit 1bb21d4.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

@jkbradley I have fixed up your comments ! Hopefully good to go.

[off-topic]
It would be really great and helpful if Spark would be interested in taking students under Apache Software Foundation (https://community.apache.org/gsoc.html) for Google Summer of Code (https://www.google-melange.com/gsoc/homepage/google/gsoc2015) (assuming ASF gets selected of course), considering the fact that there is some interest right now. I had posted it in the developers list a few days ago, but it seems that all the developers are busy. Would you or @mengxr (or other MLlib developers) be interested in mentoring a MLlib related project this summer? If yes, then we could brainstorm on the JIRA issues on what could possibly be worked on this summer and I could start writing a proposal.
Thanks. :)

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #27889 has finished for PR 4677 at commit 1bb21d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27889/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Feb 24, 2015

@MechCoder Thanks for ping us about GSoC! Please check my reply on the dev list.

@jkbradley
Copy link
Member

LGTM I'll merge it into master

@asfgit asfgit closed this in 2a0fe34 Feb 24, 2015
@jkbradley
Copy link
Member

Done---thanks for the PR!

@MechCoder MechCoder deleted the spark-5436 branch February 25, 2015 04:01
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
Iceberg 0.13.0.3 - ADT 1.1.7

2022-05-20

PRs Merged

* Internal: Parquet bloom filter support (apache#594 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/594))
* Internal: AWS Kms Client (apache#630 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/630))
* Internal: Core: Add client-side check of encryption properties (apache#626 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/626))
* Core: Align snapshot summary property names for delete files (apache#4766 (apache/iceberg#4766))
* Core: Add eq and pos delete file counts to snapshot summary (apache#4677 (apache/iceberg#4677))
* Spark 3.2: Clean static vars in SparkTableUtil (apache#4765 (apache/iceberg#4765))
* Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil (apache#4758 (apache/iceberg#4758))
* Core: Fix query failure when using projection on top of partitions metadata table (apache#4720) (apache#619 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/619))

Key Notes

Bloom filter support and Client Side Encryption Features can be used in this release. Both features are only enabled with explicit flags and will not effect existing tables or jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants