-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve #4906
Conversation
Test build #28285 has started for PR 4906 at commit
|
cc @jkbradley Sorry for the mess! This should be what we want. |
Test build #28286 has started for PR 4906 at commit
|
Test build #28285 has finished for PR 4906 at commit
|
Test PASSed. |
Test build #28286 has finished for PR 4906 at commit
|
Test PASSed. |
var predictionRDD = remappedData.map(i => initialTree.predict(i.features)) | ||
evaluationArray(0) = loss.computeError(remappedData, predictionRDD) | ||
|
||
(1 until numIterations).map {nTree => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does numIterations maps, broadcasting the model numIterations times. I'd recommend using a broadcast variable for the model to make sure it's only sent once.
You could keep the current approach pretty much as-is, but it does numIterations actions, so it's a bit inefficient. You could optimize it by using only 1 map, but that would require modifying the computeError method as follows:
- computeError could be overloaded to take (prediction: Double, datum: LabeledPoint). This could replace the computeError method you implemented.
- Here, in evaluateEachIteration, you could call predictionRDD.map, and within the map, for each data point, you could evaluate each tree on the data point, compute the prediction from each iteration via a cumulative sum, and then calling computeError on each prediction.
@jkbradley Fixed ! Should look better now.. |
Test build #28435 has started for PR 4906 at commit
|
Test build #28436 has started for PR 4906 at commit
|
Test build #28435 has finished for PR 4906 at commit
|
Test PASSed. |
Test build #28436 has finished for PR 4906 at commit
|
Test PASSed. |
@@ -61,4 +61,18 @@ object AbsoluteError extends Loss { | |||
math.abs(err) | |||
}.mean() | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for doc; it will be inherited from the overridden method (here and in other 2 loss classes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the doc for the return is different, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK for the doc for gradient() and computeError() to be generic as long as the doc for the loss classes describes the specific loss function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so should I remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please
Test build #28611 has finished for PR 4906 at commit
|
Test PASSed. |
val broadcastWeights = sc.broadcast(treeWeights) | ||
|
||
(1 until numIterations).map { nTree => | ||
val currentTree = broadcastTrees.value(nTree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOPS! I didn't even notice that this and the next line were outside of the mapPartitions. They need to be inside the closure (before "iter.map") for broadcasting to accomplish anything.
I think that's the last to-do item. |
@jkbradley done |
Test build #28621 has started for PR 4906 at commit
|
Test build #28621 has finished for PR 4906 at commit
|
Test PASSed. |
But are accessing values from the broadcasted variables from DecisionTree and DecisionTreeWeights expensive enough that placing the lines under |
val currentTreeWeight = broadcastWeights.value(nTree) | ||
iter.map { | ||
case (point, (pred, error)) => { | ||
val newPred = pred + currentTree.predict(point.features) * currentTreeWeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized: This is correct for regression but not for classification. For classification, it should threshold as in [https://github.com/apache/spark/blob/e3f315ac358dfe4f5b9705c3eac76e8b1e24f82a/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L194]
It's also a problem that the test suite didn't find this error. Could you please first fix the test suite so that it fails because of this error and then fix it here?
Thanks! Sorry I didn't realize it before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more of a design problem. Do we want evaluateEachIteration
to do the same thing as what the boost
in GradientBoostingModel
does internally (since the algo is set to Regression explicitly)? I also think it might be confusing if users see that during classification problems, this method behaves in a different way as compared to internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also the fact that runWithValidation breaks according to the Regression loss and not the Classification loss. This might lead to different solutions when runWithValidation and evaluateEachIteration is used. I suggest we keep this as it is and maybe add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; I was getting confused. It's correct to use the raw prediction for classification, as you're doing.
It's hard to state a hard cutoff for task size, but the Spark programming guide recommends "tasks larger than about 20 KB are probably worth optimizing [by broadcasting]." I think it's reasonable here. |
Test build #28705 has started for PR 4906 at commit
|
Test build #28705 has finished for PR 4906 at commit
|
Test PASSed. |
@jkbradley I have a silly query. If a point is predicted correctly, then the contribution of log loss due to that point should be zero right? The log loss that I am familiar with, i.e However, here w * x is continuous as compared to F(x) which is discrete. |
Your query may be caused by my confusion above. To answer your query, the prediction F(x) should be the raw prediction, not the discrete -1/+1 value. |
LGTM; I'll merge this into master. Thanks! |
Just to clarify, the raw prediction value (w' * x) should vary from -infinity to +infinity? So that when it's as large as possible, the contribution to log loss is zero? |
Also, I could wait for the internal optimization till the tree API PR is finished. Also let me know, if you need a help with reviewing the tree API PR (though admittedly, it will help me more by reading the code rather than you :P) |
When It might be good to wait for the internal optimization. Thanks! I'll ping you on the PR once I update it again. |
…ing and validation The previous PR #4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation. Author: MechCoder <[email protected]> Closes #5330 from MechCoder/spark-5972 and squashes the following commits: 0b5d659 [MechCoder] minor 32d409d [MechCoder] EvaluateeachIteration and training cache should follow different paths d542bb0 [MechCoder] Remove unused imports and docs 58f4932 [MechCoder] Remove unpersist 70d3b4c [MechCoder] Broadcast for each tree 5869533 [MechCoder] Access broadcasted values locally and other minor changes 923dbf6 [MechCoder] [SPARK-5972] Cache residuals and gradient in GBT during training and validation
Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.