[SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation #4677

MechCoder · 2015-02-18T21:36:15Z

One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.

This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.

SparkQA · 2015-02-18T21:37:52Z

Test build #27689 has started for PR 4677 at commit 07c8f12.

This patch merges cleanly.

MechCoder · 2015-02-18T21:39:46Z

@jkbradley I just wanted to know if this is in the right direction.

SparkQA · 2015-02-18T21:42:33Z

Test build #27690 has started for PR 4677 at commit 7534d14.

This patch merges cleanly.

AmplabJenkins · 2015-02-18T22:00:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27689/
Test FAILed.

AmplabJenkins · 2015-02-18T22:00:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27690/
Test FAILed.

shaneknapp · 2015-02-18T22:04:29Z

jenkins, test this please

SparkQA · 2015-02-18T22:07:35Z

Test build #27691 has started for PR 4677 at commit 7534d14.

This patch merges cleanly.

jkbradley · 2015-02-18T23:02:47Z

My recommendations:

Name the new run() method something more explicit like runWithValidation().
Don't add the extra train() methods since users can use run().
Rename convergenceTol to something like validationTol (since convergenceTol applies to the training set in general).
Generalize the boost() method rather than duplicating it to avoid code duplication.

We had discussed providing a helper method evaluateEachIteration() in the JIRA, but I'd prefer to have that be a separate JIRA and PR. Does that sound good?

Thanks!

SparkQA · 2015-02-18T23:26:04Z

Test build #27691 has finished for PR 4677 at commit 7534d14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-18T23:26:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27691/
Test PASSed.

SparkQA · 2015-02-19T07:37:32Z

Test build #27711 has started for PR 4677 at commit e008936.

This patch merges cleanly.

SparkQA · 2015-02-19T07:43:00Z

Test build #27712 has started for PR 4677 at commit 77549a9.

This patch merges cleanly.

SparkQA · 2015-02-19T08:55:03Z

Test build #27711 has finished for PR 4677 at commit e008936.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-19T08:55:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27711/
Test PASSed.

SparkQA · 2015-02-19T09:02:23Z

Test build #27712 has finished for PR 4677 at commit 77549a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-19T09:02:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27712/
Test PASSed.

SparkQA · 2015-02-19T09:02:35Z

Test build #27716 has started for PR 4677 at commit 3e74372.

This patch merges cleanly.

MechCoder · 2015-02-19T10:07:31Z

@jkbradley I have fixed up your comments.

Btw, why are there are both a train and a run, which seems to me do the same thing. Is it not better to have one way of doing things.

Also a doubt in the case of the Classification problem. It seems to me for each iteration, the problem is changed explicitly to a Regression problem with labels mapped to {-1, 1}. Is it okay to break when this regression error no longer reduces on the validation data for a classification problem (which seems slightly awkward to me)? (https://github.com/apache/spark/pull/4677/files#diff-7b5c1db0b1926a36f418b53fcf807db0R227)

Note that I had to explicitly set it to Regression to make sure that this test passes, (https://github.com/apache/spark/pull/4677/files#diff-d3159b88ae0ed6ff096ff8850ecac26eR207) . Otherwise, the classification error seems to be the same for both with and without validation.

SparkQA · 2015-02-19T10:29:09Z

Test build #27716 has finished for PR 4677 at commit 3e74372.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Partitioner(object):

AmplabJenkins · 2015-02-19T10:29:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27716/
Test PASSed.

SparkQA · 2015-02-19T12:12:34Z

Test build #27720 has started for PR 4677 at commit 55e5c3b.

This patch merges cleanly.

SparkQA · 2015-02-19T13:36:27Z

Test build #27720 has finished for PR 4677 at commit 55e5c3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-19T13:36:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27720/
Test PASSed.

jkbradley · 2015-02-23T01:39:01Z

mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala

+
+      val gbtValidate = new GradientBoostedTrees(boostingStrategy).runWithValidation(
+        trainRdd, validateRdd)
+      assert(gbtValidate.numTrees != numIterations)


Use !== (handles types better)

jkbradley · 2015-02-23T01:39:34Z

OK, I've made a close pass, so hopefully those are my final comments.

SparkQA · 2015-02-23T04:17:38Z

Test build #27847 has started for PR 4677 at commit e4d799b.

This patch merges cleanly.

MechCoder · 2015-02-23T04:22:39Z

@jkbradley Addressed all your comments except the inline one.

SparkQA · 2015-02-23T05:40:27Z

Test build #27847 has finished for PR 4677 at commit e4d799b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-23T05:40:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27847/
Test PASSed.

jkbradley · 2015-02-23T21:55:12Z

mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala

+
+    // Test that it stops early.
+    val gbtValidate = new GradientBoostedTrees(boostingStrategy).
+      runWithValidation(trainRdd, validateRdd)


please put period (.) on line with runWithValidation:

val gbtValidate = new GradientBoostedTrees(boostingStrategy) .runWithValidation(trainRdd, validateRdd)

jkbradley · 2015-02-23T21:58:26Z

Also, what about shortening the code to combine the Classification and Regression tests? Did that not work out?

SparkQA · 2015-02-24T11:17:36Z

Test build #27889 has started for PR 4677 at commit 1bb21d4.

This patch merges cleanly.

MechCoder · 2015-02-24T12:07:07Z

@jkbradley I have fixed up your comments ! Hopefully good to go.

[off-topic]
It would be really great and helpful if Spark would be interested in taking students under Apache Software Foundation (https://community.apache.org/gsoc.html) for Google Summer of Code (https://www.google-melange.com/gsoc/homepage/google/gsoc2015) (assuming ASF gets selected of course), considering the fact that there is some interest right now. I had posted it in the developers list a few days ago, but it seems that all the developers are busy. Would you or @mengxr (or other MLlib developers) be interested in mentoring a MLlib related project this summer? If yes, then we could brainstorm on the JIRA issues on what could possibly be worked on this summer and I could start writing a proposal.
Thanks. :)

SparkQA · 2015-02-24T12:35:00Z

Test build #27889 has finished for PR 4677 at commit 1bb21d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-24T12:35:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27889/
Test PASSed.

mengxr · 2015-02-24T21:35:27Z

@MechCoder Thanks for ping us about GSoC! Please check my reply on the dev list.

jkbradley · 2015-02-24T23:12:38Z

LGTM I'll merge it into master

jkbradley · 2015-02-24T23:14:44Z

Done---thanks for the PR!

Iceberg 0.13.0.3 - ADT 1.1.7 2022-05-20 PRs Merged * Internal: Parquet bloom filter support (apache#594 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/594)) * Internal: AWS Kms Client (apache#630 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/630)) * Internal: Core: Add client-side check of encryption properties (apache#626 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/626)) * Core: Align snapshot summary property names for delete files (apache#4766 (apache/iceberg#4766)) * Core: Add eq and pos delete file counts to snapshot summary (apache#4677 (apache/iceberg#4677)) * Spark 3.2: Clean static vars in SparkTableUtil (apache#4765 (apache/iceberg#4765)) * Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil (apache#4758 (apache/iceberg#4758)) * Core: Fix query failure when using projection on top of partitions metadata table (apache#4720) (apache#619 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/619)) Key Notes Bloom filter support and Client Side Encryption Features can be used in this release. Both features are only enabled with explicit flags and will not effect existing tables or jobs.

MechCoder force-pushed the spark-5436 branch from 07c8f12 to 7534d14 Compare February 18, 2015 21:38

MechCoder force-pushed the spark-5436 branch from 7534d14 to e008936 Compare February 19, 2015 07:36

MechCoder changed the title ~~[SPARK-5436] [MLlib] Validate GradientBoostedTrees during train~~ [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation Feb 19, 2015

[SPARK-5436] Validate GradientBoostedTrees using runWithValidation

77549a9

MechCoder force-pushed the spark-5436 branch from e008936 to 77549a9 Compare February 19, 2015 07:41

TST: Add test for classification

3e74372

One liner for prevValidateError

55e5c3b

jkbradley reviewed Feb 23, 2015
View reviewed changes

Addresses indentation and doc comments

e4d799b

MechCoder force-pushed the spark-5436 branch from 52a080b to e4d799b Compare February 23, 2015 04:15

jkbradley reviewed Feb 23, 2015
View reviewed changes

Combine regression and classification tests into a single one

1bb21d4

asfgit closed this in 2a0fe34 Feb 24, 2015

MechCoder deleted the spark-5436 branch February 25, 2015 04:01

[SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation #4677

[SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation #4677

Conversation

MechCoder commented Feb 18, 2015

SparkQA commented Feb 18, 2015

MechCoder commented Feb 18, 2015

SparkQA commented Feb 18, 2015

AmplabJenkins commented Feb 18, 2015

AmplabJenkins commented Feb 18, 2015

shaneknapp commented Feb 18, 2015

SparkQA commented Feb 18, 2015

jkbradley commented Feb 18, 2015

SparkQA commented Feb 18, 2015

AmplabJenkins commented Feb 18, 2015

SparkQA commented Feb 19, 2015

SparkQA commented Feb 19, 2015

SparkQA commented Feb 19, 2015

AmplabJenkins commented Feb 19, 2015

SparkQA commented Feb 19, 2015

AmplabJenkins commented Feb 19, 2015

SparkQA commented Feb 19, 2015

MechCoder commented Feb 19, 2015

SparkQA commented Feb 19, 2015

AmplabJenkins commented Feb 19, 2015

SparkQA commented Feb 19, 2015

SparkQA commented Feb 19, 2015

AmplabJenkins commented Feb 19, 2015

jkbradley Feb 23, 2015

Choose a reason for hiding this comment

jkbradley commented Feb 23, 2015

SparkQA commented Feb 23, 2015

MechCoder commented Feb 23, 2015

SparkQA commented Feb 23, 2015

AmplabJenkins commented Feb 23, 2015

jkbradley Feb 23, 2015

Choose a reason for hiding this comment

jkbradley commented Feb 23, 2015

SparkQA commented Feb 24, 2015

MechCoder commented Feb 24, 2015

SparkQA commented Feb 24, 2015

AmplabJenkins commented Feb 24, 2015

mengxr commented Feb 24, 2015

jkbradley commented Feb 24, 2015

jkbradley commented Feb 24, 2015