[SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve #4906

MechCoder · 2015-03-05T09:01:27Z

Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.

SparkQA · 2015-03-05T09:02:39Z

Test build #28285 has started for PR 4906 at commit 26b2e91.

This patch merges cleanly.

…ing curve

MechCoder · 2015-03-05T09:05:16Z

cc @jkbradley Sorry for the mess! This should be what we want.

SparkQA · 2015-03-05T09:07:48Z

Test build #28286 has started for PR 4906 at commit dbda033.

This patch merges cleanly.

SparkQA · 2015-03-05T10:20:26Z

Test build #28285 has finished for PR 4906 at commit 26b2e91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-05T10:20:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28285/
Test PASSed.

SparkQA · 2015-03-05T10:25:48Z

Test build #28286 has finished for PR 4906 at commit dbda033.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-05T10:25:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28286/
Test PASSed.

jkbradley · 2015-03-09T17:28:53Z

mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala

+    var predictionRDD = remappedData.map(i => initialTree.predict(i.features))
+    evaluationArray(0) = loss.computeError(remappedData, predictionRDD)
+
+    (1 until numIterations).map {nTree =>


This does numIterations maps, broadcasting the model numIterations times. I'd recommend using a broadcast variable for the model to make sure it's only sent once.

You could keep the current approach pretty much as-is, but it does numIterations actions, so it's a bit inefficient. You could optimize it by using only 1 map, but that would require modifying the computeError method as follows:

computeError could be overloaded to take (prediction: Double, datum: LabeledPoint). This could replace the computeError method you implemented.

Here, in evaluateEachIteration, you could call predictionRDD.map, and within the map, for each data point, you could evaluate each tree on the data point, compute the prediction from each iteration via a cumulative sum, and then calling computeError on each prediction.

MechCoder · 2015-03-10T19:40:53Z

@jkbradley Fixed ! Should look better now..

SparkQA · 2015-03-10T19:43:08Z

Test build #28435 has started for PR 4906 at commit 035f78f.

This patch merges cleanly.

SparkQA · 2015-03-10T19:48:10Z

Test build #28436 has started for PR 4906 at commit bc99ac6.

This patch merges cleanly.

SparkQA · 2015-03-10T21:04:32Z

Test build #28435 has finished for PR 4906 at commit 035f78f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-10T21:04:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28435/
Test PASSed.

SparkQA · 2015-03-10T21:09:49Z

Test build #28436 has finished for PR 4906 at commit bc99ac6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-10T21:09:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28436/
Test PASSed.

jkbradley · 2015-03-11T00:20:39Z

mllib/src/main/scala/org/apache/spark/mllib/tree/loss/AbsoluteError.scala

@@ -61,4 +61,18 @@ object AbsoluteError extends Loss {
      math.abs(err)
    }.mean()
  }
+
+  /**


No need for doc; it will be inherited from the overridden method (here and in other 2 loss classes)

But the doc for the return is different, no?

I think it's OK for the doc for gradient() and computeError() to be generic as long as the doc for the loss classes describes the specific loss function.

ok so should I remove it?

SparkQA · 2015-03-14T19:33:58Z

Test build #28611 has finished for PR 4906 at commit c04a430.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-14T19:34:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28611/
Test PASSed.

jkbradley · 2015-03-14T23:12:10Z

mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala

+    val broadcastWeights = sc.broadcast(treeWeights)
+
+    (1 until numIterations).map { nTree =>
+      val currentTree = broadcastTrees.value(nTree)


OOPS! I didn't even notice that this and the next line were outside of the mapPartitions. They need to be inside the closure (before "iter.map") for broadcasting to accomplish anything.

jkbradley · 2015-03-14T23:13:27Z

I think that's the last to-do item.

MechCoder · 2015-03-15T03:39:03Z

@jkbradley done

SparkQA · 2015-03-15T03:43:09Z

Test build #28621 has started for PR 4906 at commit 352001f.

This patch merges cleanly.

SparkQA · 2015-03-15T05:04:05Z

Test build #28621 has finished for PR 4906 at commit 352001f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-15T05:04:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28621/
Test PASSed.

MechCoder · 2015-03-15T17:51:45Z

But are accessing values from the broadcasted variables from DecisionTree and DecisionTreeWeights expensive enough that placing the lines under mapPartitions give enough benefit?

jkbradley · 2015-03-17T00:03:34Z

mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala

+        val currentTreeWeight = broadcastWeights.value(nTree)
+        iter.map {
+          case (point, (pred, error)) => {
+            val newPred = pred + currentTree.predict(point.features) * currentTreeWeight


I just realized: This is correct for regression but not for classification. For classification, it should threshold as in [https://github.com/apache/spark/blob/e3f315ac358dfe4f5b9705c3eac76e8b1e24f82a/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L194]

It's also a problem that the test suite didn't find this error. Could you please first fix the test suite so that it fails because of this error and then fix it here?

Thanks! Sorry I didn't realize it before.

I think this is more of a design problem. Do we want evaluateEachIteration to do the same thing as what the boost in GradientBoostingModel does internally (since the algo is set to Regression explicitly)? I also think it might be confusing if users see that during classification problems, this method behaves in a different way as compared to internally.

And also the fact that runWithValidation breaks according to the Regression loss and not the Classification loss. This might lead to different solutions when runWithValidation and evaluateEachIteration is used. I suggest we keep this as it is and maybe add a comment?

You're right; I was getting confused. It's correct to use the raw prediction for classification, as you're doing.

jkbradley · 2015-03-17T00:04:53Z

It's hard to state a hard cutoff for task size, but the Spark programming guide recommends "tasks larger than about 20 KB are probably worth optimizing [by broadcasting]." I think it's reasonable here.

SparkQA · 2015-03-17T06:18:09Z

Test build #28705 has started for PR 4906 at commit 67146ab.

This patch merges cleanly.

SparkQA · 2015-03-17T07:34:41Z

Test build #28705 has finished for PR 4906 at commit 67146ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-17T07:34:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28705/
Test PASSed.

MechCoder · 2015-03-18T13:52:27Z

@jkbradley I have a silly query.

If a point is predicted correctly, then the contribution of log loss due to that point should be zero right?
However, if we take the log loss function, log(1+exp(−2y * F(x))) , let us say that y is 1, and is predicted correctly,(i.e F(x) = 1) then it contributes log(1 + exp(-2)) ~ 0.126 to the loss.

The log loss that I am familiar with, i.e log(1 + exp(-y * w' * x)) (where w is the weight and x is the feature vector), the contribution to the log loss approaches zero as w * x approaches infinity.

However, here w * x is continuous as compared to F(x) which is discrete.

jkbradley · 2015-03-20T23:51:34Z

Your query may be caused by my confusion above. To answer your query, the prediction F(x) should be the raw prediction, not the discrete -1/+1 value.

jkbradley · 2015-03-20T23:59:44Z

LGTM; I'll merge this into master. Thanks!

MechCoder · 2015-03-21T05:36:38Z

Just to clarify, the raw prediction value (w' * x) should vary from -infinity to +infinity? So that when it's as large as possible, the contribution to log loss is zero?

MechCoder · 2015-03-21T09:36:43Z

Also, I could wait for the internal optimization till the tree API PR is finished. Also let me know, if you need a help with reviewing the tree API PR (though admittedly, it will help me more by reading the code rather than you :P)

jkbradley · 2015-03-21T19:45:04Z

When y * w' * x (including multiplying by the -1/+1 label y) is very large, the contribution is close to zero.

It might be good to wait for the internal optimization. Thanks!

I'll ping you on the PR once I update it again.

…ing and validation The previous PR #4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation. Author: MechCoder <[email protected]> Closes #5330 from MechCoder/spark-5972 and squashes the following commits: 0b5d659 [MechCoder] minor 32d409d [MechCoder] EvaluateeachIteration and training cache should follow different paths d542bb0 [MechCoder] Remove unused imports and docs 58f4932 [MechCoder] Remove unpersist 70d3b4c [MechCoder] Broadcast for each tree 5869533 [MechCoder] Access broadcasted values locally and other minor changes 923dbf6 [MechCoder] [SPARK-5972] Cache residuals and gradient in GBT during training and validation

[SPARK-6025] Add helper method evaluateEachIteration to extract learn…

dbda033

…ing curve

MechCoder force-pushed the spark-6025 branch from 26b2e91 to dbda033 Compare March 5, 2015 09:04

jkbradley reviewed Mar 9, 2015
View reviewed changes

Refactor the method and stuff

bc99ac6

MechCoder force-pushed the spark-6025 branch from 035f78f to bc99ac6 Compare March 10, 2015 19:42

jkbradley reviewed Mar 11, 2015
View reviewed changes

jkbradley reviewed Mar 14, 2015
View reviewed changes

Minor

352001f

MechCoder force-pushed the spark-6025 branch from c04a430 to 352001f Compare March 15, 2015 03:38

jkbradley reviewed Mar 17, 2015
View reviewed changes

Minor

67146ab

asfgit closed this in 25e271d Mar 21, 2015

MechCoder deleted the spark-6025 branch March 21, 2015 04:57

MechCoder mentioned this pull request Apr 2, 2015

[SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation #5330

Closed

[SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve #4906

[SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve #4906

Conversation

MechCoder commented Mar 5, 2015

SparkQA commented Mar 5, 2015

MechCoder commented Mar 5, 2015

SparkQA commented Mar 5, 2015

SparkQA commented Mar 5, 2015

AmplabJenkins commented Mar 5, 2015

SparkQA commented Mar 5, 2015

AmplabJenkins commented Mar 5, 2015

Choose a reason for hiding this comment

MechCoder commented Mar 10, 2015

SparkQA commented Mar 10, 2015

SparkQA commented Mar 10, 2015

SparkQA commented Mar 10, 2015

AmplabJenkins commented Mar 10, 2015

SparkQA commented Mar 10, 2015

AmplabJenkins commented Mar 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 14, 2015

AmplabJenkins commented Mar 14, 2015

Choose a reason for hiding this comment

jkbradley commented Mar 14, 2015

MechCoder commented Mar 15, 2015

SparkQA commented Mar 15, 2015

SparkQA commented Mar 15, 2015

AmplabJenkins commented Mar 15, 2015

MechCoder commented Mar 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Mar 17, 2015

SparkQA commented Mar 17, 2015

SparkQA commented Mar 17, 2015

AmplabJenkins commented Mar 17, 2015

MechCoder commented Mar 18, 2015

jkbradley commented Mar 20, 2015

jkbradley commented Mar 20, 2015

MechCoder commented Mar 21, 2015

MechCoder commented Mar 21, 2015

jkbradley commented Mar 21, 2015