Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve #4906

Closed
wants to merge 5 commits into from

Conversation

MechCoder
Copy link
Contributor

Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28285 has started for PR 4906 at commit 26b2e91.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

cc @jkbradley Sorry for the mess! This should be what we want.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28286 has started for PR 4906 at commit dbda033.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28285 has finished for PR 4906 at commit 26b2e91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28285/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28286 has finished for PR 4906 at commit dbda033.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28286/
Test PASSed.

var predictionRDD = remappedData.map(i => initialTree.predict(i.features))
evaluationArray(0) = loss.computeError(remappedData, predictionRDD)

(1 until numIterations).map {nTree =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does numIterations maps, broadcasting the model numIterations times. I'd recommend using a broadcast variable for the model to make sure it's only sent once.

You could keep the current approach pretty much as-is, but it does numIterations actions, so it's a bit inefficient. You could optimize it by using only 1 map, but that would require modifying the computeError method as follows:

  • computeError could be overloaded to take (prediction: Double, datum: LabeledPoint). This could replace the computeError method you implemented.
  • Here, in evaluateEachIteration, you could call predictionRDD.map, and within the map, for each data point, you could evaluate each tree on the data point, compute the prediction from each iteration via a cumulative sum, and then calling computeError on each prediction.

@MechCoder
Copy link
Contributor Author

@jkbradley Fixed ! Should look better now..

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28435 has started for PR 4906 at commit 035f78f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28436 has started for PR 4906 at commit bc99ac6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28435 has finished for PR 4906 at commit 035f78f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28435/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28436 has finished for PR 4906 at commit bc99ac6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28436/
Test PASSed.

@@ -61,4 +61,18 @@ object AbsoluteError extends Loss {
math.abs(err)
}.mean()
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for doc; it will be inherited from the overridden method (here and in other 2 loss classes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the doc for the return is different, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK for the doc for gradient() and computeError() to be generic as long as the doc for the loss classes describes the specific loss function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so should I remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please

@SparkQA
Copy link

SparkQA commented Mar 14, 2015

Test build #28611 has finished for PR 4906 at commit c04a430.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28611/
Test PASSed.

val broadcastWeights = sc.broadcast(treeWeights)

(1 until numIterations).map { nTree =>
val currentTree = broadcastTrees.value(nTree)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOPS! I didn't even notice that this and the next line were outside of the mapPartitions. They need to be inside the closure (before "iter.map") for broadcasting to accomplish anything.

@jkbradley
Copy link
Member

I think that's the last to-do item.

@MechCoder
Copy link
Contributor Author

@jkbradley done

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28621 has started for PR 4906 at commit 352001f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28621 has finished for PR 4906 at commit 352001f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28621/
Test PASSed.

@MechCoder
Copy link
Contributor Author

But are accessing values from the broadcasted variables from DecisionTree and DecisionTreeWeights expensive enough that placing the lines under mapPartitions give enough benefit?

val currentTreeWeight = broadcastWeights.value(nTree)
iter.map {
case (point, (pred, error)) => {
val newPred = pred + currentTree.predict(point.features) * currentTreeWeight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized: This is correct for regression but not for classification. For classification, it should threshold as in [https://github.com/apache/spark/blob/e3f315ac358dfe4f5b9705c3eac76e8b1e24f82a/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L194]

It's also a problem that the test suite didn't find this error. Could you please first fix the test suite so that it fails because of this error and then fix it here?

Thanks! Sorry I didn't realize it before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more of a design problem. Do we want evaluateEachIteration to do the same thing as what the boost in GradientBoostingModel does internally (since the algo is set to Regression explicitly)? I also think it might be confusing if users see that during classification problems, this method behaves in a different way as compared to internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also the fact that runWithValidation breaks according to the Regression loss and not the Classification loss. This might lead to different solutions when runWithValidation and evaluateEachIteration is used. I suggest we keep this as it is and maybe add a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; I was getting confused. It's correct to use the raw prediction for classification, as you're doing.

@jkbradley
Copy link
Member

It's hard to state a hard cutoff for task size, but the Spark programming guide recommends "tasks larger than about 20 KB are probably worth optimizing [by broadcasting]." I think it's reasonable here.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28705 has started for PR 4906 at commit 67146ab.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28705 has finished for PR 4906 at commit 67146ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28705/
Test PASSed.

@MechCoder
Copy link
Contributor Author

@jkbradley I have a silly query.

If a point is predicted correctly, then the contribution of log loss due to that point should be zero right?
However, if we take the log loss function, log(1+exp(−2y * F(x))) , let us say that y is 1, and is predicted correctly,(i.e F(x) = 1) then it contributes log(1 + exp(-2)) ~ 0.126 to the loss.

The log loss that I am familiar with, i.e log(1 + exp(-y * w' * x)) (where w is the weight and x is the feature vector), the contribution to the log loss approaches zero as w * x approaches infinity.

However, here w * x is continuous as compared to F(x) which is discrete.

@jkbradley
Copy link
Member

Your query may be caused by my confusion above. To answer your query, the prediction F(x) should be the raw prediction, not the discrete -1/+1 value.

@jkbradley
Copy link
Member

LGTM; I'll merge this into master. Thanks!

@asfgit asfgit closed this in 25e271d Mar 21, 2015
@MechCoder MechCoder deleted the spark-6025 branch March 21, 2015 04:57
@MechCoder
Copy link
Contributor Author

Just to clarify, the raw prediction value (w' * x) should vary from -infinity to +infinity? So that when it's as large as possible, the contribution to log loss is zero?

@MechCoder
Copy link
Contributor Author

Also, I could wait for the internal optimization till the tree API PR is finished. Also let me know, if you need a help with reviewing the tree API PR (though admittedly, it will help me more by reading the code rather than you :P)

@jkbradley
Copy link
Member

When y * w' * x (including multiplying by the -1/+1 label y) is very large, the contribution is close to zero.

It might be good to wait for the internal optimization. Thanks!

I'll ping you on the PR once I update it again.

asfgit pushed a commit that referenced this pull request Apr 13, 2015
…ing and validation

The previous PR #4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation.

Author: MechCoder <[email protected]>

Closes #5330 from MechCoder/spark-5972 and squashes the following commits:

0b5d659 [MechCoder] minor
32d409d [MechCoder] EvaluateeachIteration and training cache should follow different paths
d542bb0 [MechCoder] Remove unused imports and docs
58f4932 [MechCoder] Remove unpersist
70d3b4c [MechCoder] Broadcast for each tree
5869533 [MechCoder] Access broadcasted values locally and other minor changes
923dbf6 [MechCoder] [SPARK-5972] Cache residuals and gradient in GBT during training and validation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants