[SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation #5330

MechCoder · 2015-04-02T09:41:17Z

The previous PR #4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation.

MechCoder · 2015-04-02T09:41:32Z

ping @jkbradley

SparkQA · 2015-04-02T10:48:48Z

Test build #29605 has finished for PR 5330 at commit 100850a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-02T14:21:47Z

Test build #29609 has finished for PR 5330 at commit 57cd906.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

MechCoder · 2015-04-02T14:23:06Z

This unrelated test failure related to YARN keeps recurring for me.

…validation

SparkQA · 2015-04-02T17:04:25Z

Test build #29612 has finished for PR 5330 at commit 923dbf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-02T18:47:13Z

@MechCoder Thanks! I'll make a pass through this soon; at first glance, it looks good. Have you tested this vs. the old implementation? I'm wondering how big a difference there is, and also how big the problem has to be for that difference to be evident.

MechCoder · 2015-04-02T19:56:32Z

I do not have access to a cluster as said before. It would be great if you had some old benchmarks. However it seems it should not matter a lot at least for small n_iterations. But I suppose it would be good to have it anyway to avoid unnecessary re computation (trivial or not) just like before.
Thanks :)

jkbradley · 2015-04-02T21:44:19Z

mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala

-    logDebug("error of gbt = " + loss.computeError(startingModel, input))
+
+    var predError: RDD[(Double, Double)] = GradientBoostedTreesModel.
+      computeInitialPredictionAndError(input, 1.0, firstTreeModel, loss)


Use "baseLearnerWeights(0)" instead of "1.0"

jkbradley · 2015-04-02T21:48:33Z

@MechCoder For this, I feel like local tests might be sufficient since they should show the speedup and since this isn't changing the communication that much. My main worry is about RDD having long lineages; I made a JIRA today about that, but that can be addressed later on: [https://issues.apache.org/jira/browse/SPARK-6684]

MechCoder · 2015-04-03T08:20:40Z

I've fixed up your comments.

It seems that the local tests seem to run in the same time (+ or - 1s), this may due to the fact that numIterations and data size are comparatively low, to take advantage of this.

I can work on the other issue after this is merged.

SparkQA · 2015-04-03T09:12:24Z

Test build #29660 has finished for PR 5330 at commit c0869e7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-11T15:55:37Z

After that one small change, I think this will be ready to merge. Thanks!

MechCoder · 2015-04-11T16:27:54Z

@jkbradley I've fixed up your comment!

Thanks for the info. Just to clarify, does the previous code work because a copy of the broadcast variable in the driver node persists even after unpersisting and it broadcasts repeatedly for each action to PredError . Source (http://stackoverflow.com/a/24587558/1170730)?

SparkQA · 2015-04-11T18:08:50Z

Test build #30083 has finished for PR 5330 at commit 58f4932.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-12T23:13:06Z

mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala

@@ -27,6 +27,7 @@ import org.json4s.jackson.JsonMethods._
 import org.apache.spark.{Logging, SparkContext}
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.broadcast.Broadcast


no longer needed

jkbradley · 2015-04-12T23:14:21Z

Your explanation for why the previous code worked is correct. It's just doing extra communication.

I just noticed a few things when opening up this PR in IntelliJ (yay error highlighting). That really should be it, though.

I'm running a speed test to see if I can tell a difference between this and the previous code. I'll post again later today.

MechCoder · 2015-04-13T03:26:09Z

@jkbradley fixed! hopefully that should be it :P

jkbradley · 2015-04-13T03:53:02Z

That is it, but I ran some timing tests locally and found essentially no difference between the two implementations, like you reported. I think the issue is the overhead of broadcast variables. I tried broadcasting the full arrays for evaluateEachIteration(), rather than each element separately, and it made evaluateEachIteration() take about 2/3 of the original time. This was with depth-2 trees and 100 iterations of regression on a tiny test dataset ("abalone" from libsvm's copy of the UCI dataset.

Based on this bit of testing, I would guess the best solution will be to handle learning and evaluateEachIteration separately:

Learning: Do not broadcast trees or weights (but do use the caching you implemented here).
- Communicating 1 tree should be a tiny cost compared to the cost of learning the tree.
evaluateEachIteration: Broadcast full tree array. Compute errors for all iterations in a single map() call, and aggregate arrays of errors rather than individual errors.
- Don't broadcast the weights array since it is small.

I'm OK with merging this PR for now and making those items a future to-do. But if you'd prefer to make these updates to this PR, that works too.

Do you agree with this assessment? What path would you prefer?

MechCoder · 2015-04-13T04:13:47Z

Thanks for the tests. I get the gist of what you mean.

I'd be happy to merge this PR and work on this as a future JIRA. If I have any queries, I shall comment on that.

MechCoder · 2015-04-13T04:52:20Z

Never mind, figured out, it should not take much effort, working on an update to this PR itself.

SparkQA · 2015-04-13T04:55:10Z

Test build #30137 has finished for PR 5330 at commit d542bb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

MechCoder · 2015-04-13T05:08:16Z

@jkbradley I pushed an update to the same PR. I agree with the observation, that it would have a much higher impact on evaluateEachIteration, because during training, prediction (and computing the residuals) is not really the bottleneck

SparkQA · 2015-04-13T06:36:33Z

Test build #30142 has finished for PR 5330 at commit 32d409d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-13T18:57:22Z

mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala

-            val newError = loss.computeError(newPred, point.label)
-            (newPred, newError)
-          }
+        val currentTreeWeight = treeWeights(nTree)


We should make a local (shallow) copy of treeWeights before the map, within this method:

val localTreeWeights = treeWeights

Referencing treeWeights, a member of the class, will actually make the entire class get serialized by the ClosureCleaner. Assigning it to a local val fixes that.

jkbradley · 2015-04-13T18:58:00Z

@MechCoder Thanks a lot for working through all of these tweaks with me! The updates look good except for those 2 items

MechCoder · 2015-04-13T19:07:55Z

@jkbradley fixed!

jkbradley · 2015-04-13T20:02:32Z

LGTM once tests pass. Thanks!

SparkQA · 2015-04-13T20:38:40Z

Test build #30185 has finished for PR 5330 at commit 0b5d659.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-13T22:37:40Z

Merged into master

MechCoder force-pushed the spark-5972 branch from 100850a to 57cd906 Compare April 2, 2015 13:14

[SPARK-5972] Cache residuals and gradient in GBT during training and …

923dbf6

…validation

MechCoder force-pushed the spark-5972 branch from 57cd906 to 923dbf6 Compare April 2, 2015 15:34

jkbradley reviewed Apr 2, 2015
View reviewed changes

Access broadcasted values locally and other minor changes

5869533

MechCoder force-pushed the spark-5972 branch from c0869e7 to 5869533 Compare April 3, 2015 09:16

Remove unpersist

58f4932

jkbradley reviewed Apr 12, 2015
View reviewed changes

Remove unused imports and docs

d542bb0

EvaluateeachIteration and training cache should follow different paths

32d409d

jkbradley reviewed Apr 13, 2015
View reviewed changes

minor

0b5d659

asfgit closed this in 2a55cb4 Apr 13, 2015

MechCoder deleted the spark-5972 branch April 14, 2015 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation #5330

[SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation #5330

MechCoder commented Apr 2, 2015

MechCoder commented Apr 2, 2015

SparkQA commented Apr 2, 2015

SparkQA commented Apr 2, 2015

MechCoder commented Apr 2, 2015

SparkQA commented Apr 2, 2015

jkbradley commented Apr 2, 2015

MechCoder commented Apr 2, 2015

jkbradley Apr 2, 2015

jkbradley commented Apr 2, 2015

MechCoder commented Apr 3, 2015

SparkQA commented Apr 3, 2015

jkbradley commented Apr 11, 2015

MechCoder commented Apr 11, 2015

SparkQA commented Apr 11, 2015

jkbradley Apr 12, 2015

jkbradley commented Apr 12, 2015

MechCoder commented Apr 13, 2015

jkbradley commented Apr 13, 2015

MechCoder commented Apr 13, 2015

MechCoder commented Apr 13, 2015

SparkQA commented Apr 13, 2015

MechCoder commented Apr 13, 2015

SparkQA commented Apr 13, 2015

jkbradley Apr 13, 2015

jkbradley commented Apr 13, 2015

MechCoder commented Apr 13, 2015

jkbradley commented Apr 13, 2015

SparkQA commented Apr 13, 2015

jkbradley commented Apr 13, 2015

[SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation #5330

[SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation #5330

Conversation

MechCoder commented Apr 2, 2015

MechCoder commented Apr 2, 2015

SparkQA commented Apr 2, 2015

SparkQA commented Apr 2, 2015

MechCoder commented Apr 2, 2015

SparkQA commented Apr 2, 2015

jkbradley commented Apr 2, 2015

MechCoder commented Apr 2, 2015

jkbradley Apr 2, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 2, 2015

MechCoder commented Apr 3, 2015

SparkQA commented Apr 3, 2015

jkbradley commented Apr 11, 2015

MechCoder commented Apr 11, 2015

SparkQA commented Apr 11, 2015

jkbradley Apr 12, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 12, 2015

MechCoder commented Apr 13, 2015

jkbradley commented Apr 13, 2015

MechCoder commented Apr 13, 2015

MechCoder commented Apr 13, 2015

SparkQA commented Apr 13, 2015

MechCoder commented Apr 13, 2015

SparkQA commented Apr 13, 2015

jkbradley Apr 13, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 13, 2015

MechCoder commented Apr 13, 2015

jkbradley commented Apr 13, 2015

SparkQA commented Apr 13, 2015

jkbradley commented Apr 13, 2015