Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16718][MLlib] gbm-style treeboost #14547

Closed
wants to merge 26 commits into from
Closed

Conversation

vlad17
Copy link

@vlad17 vlad17 commented Aug 8, 2016

What changes were proposed in this pull request?

TreeBoost

This change adds TreeBoost functionality to GBTClassifer and GBTRegressor. The main change is that leaf nodes now make a prediction which optimizes the loss function, rather than always using the mean label (which is only optimal in the case of variance-based impurity).

Changes: L2 and Logistic loss-based impurities

This changes the defaults to use the loss-based impurity rather than the required variance.

I made this change only for L2 loss and logistic loss (adding some aliases to the names as well for parity with R's implementation, GBM). These two functions have leaf predictions that can be computed within the framework of the current impurity API. Other loss functions will require API modification, which should be its own change, SPARK-16728.

Note that because loss-based impurity with L1 loss is NOT supported, the default behavior for the GBTRegressor is to use the variance-based impurity (since the aformentioned combination throws).

How was this patch tested?

Correctness

I tested defaults parameter values and new settings for the parameters with new unit tests.

Accuracy

Example prediction problem with UCI half of million-song dataset.

This code is a pretty aesthetic change: the only algorithm that differed is the logistic loss one, and the accuracy is identical to that of variance. The difference could only visible in the raw prediction (and thus AUC), since leaf predictions for minimal logistic loss and the mean produced equivalent outcomes after thresholding.

With 700 trees and 0.001 shrinkage, runtime is nearly equivalent (within 2%) script and output here.

GBM run for comparison Everything run on my machine

@jkbradley
Copy link
Member

ok to test

@vlad17
Copy link
Author

vlad17 commented Aug 8, 2016

@hhbyyh Would you mind reviewing this?

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63388 has finished for PR 14547 at commit 06fc4a9.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Aug 8, 2016

@vlad17 I do not get alerted when you comment on the squashed PR, as an FYI. I was using the databricks spark-perf package for performance testing.

I'd be interested to see that TreeBoost algorithm provides "better" results than the non-TreeBoost version if that's possible. I think we need some provable improvement to show before we proceed with merging this patch. (It sounds like you are working on that currently).

Thanks for the PR! I'll try to have a look sometime, but it may not be immediately.

@vlad17
Copy link
Author

vlad17 commented Aug 9, 2016

@sethah Thanks for the FYI. I'm pretty confident that it'll help the actual loss since now we're directly optimizing the loss function. However, this is only going to help significantly if we e.g., use MAE for L1 loss (not implemented) or bernoulli for logistic (we automatically threshold, so I can't do that). For most datasets, accuracy won't demonstrate the difference between bernoulli-based loss leaf predictions and mean ones.

The only estimator whose behavior changed is GBTClassifier (now the bernoulli predictions use an NR step rather than guess the mean). And since the raw prediction column is unavailable for the GBTClassifier, I can't really compare the classifiers sensibly on skewed datasets since AUC is out of the question.

I'm going to have to spend some time trying to find a "real" dataset that's not skewed but large enough to be meaningful or just make an artificial one. And also spark-perf will need to be re-run.

So it looks like for now I have my work cut out for me. A couple of questions in the meantime (@jkbradley), though:
(1) Regarding the binary incompatibility failure - part of that was my fault, part of it was due to an incompatibility with a package-private method. I added an exception for the binary incompatibility for the package-private method - is that OK?
(2) This failure was due to exactly the lack of default L1 loss-based impurity support - I suppose I should take that as a hint that the default should be variance for the GBTRegressor until L1 loss-based is supported? -> OK, done

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63407 has finished for PR 14547 at commit cfaee0f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63428 has finished for PR 14547 at commit b4e5e6c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63447 has finished for PR 14547 at commit fe256f7.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

CC: @hhbyyh Would you mind taking a look at this since you're familiar with GBTs? Thanks in advance! This should be one of the most important improvements in terms of accuracy, especially once we get soft predictions (for AUC measurements) from GBTs.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #3210 has finished for PR 14547 at commit fe256f7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63500 has finished for PR 14547 at commit 233c6cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2016

Test build #63875 has finished for PR 14547 at commit a040da5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2016

Test build #3224 has finished for PR 14547 at commit a040da5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vlad17 vlad17 changed the title [SPARK-16718][MLlib] gbm-style treeboost [WIP] [SPARK-16718][MLlib] gbm-style treeboost Aug 18, 2016
*
* [[GBTClassifier]] will use the usual `"loss-based"` impurity by default, conforming to
* TreeBoost behavior. For SGB, set impurity to `"variance"`.
* use of TreeBoost, set impurity to `"loss-based"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67858 has finished for PR 14547 at commit 5f54f4d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vlad17
Copy link
Author

vlad17 commented Nov 1, 2016

@jkbradley it seems I can only deprecate setImpurity: the value can't be deprecated since it's used internally, which triggers a fatal warning, and getImpurity has scaladoc shared between other classes where it's valid to use. In any case, setImpurity is the only one that needs to have the warning.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67908 has finished for PR 14547 at commit 4e20a70.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vlad17
Copy link
Author

vlad17 commented Nov 1, 2016

@jkbradley There seems to be more issues with deprecating impurity:

[error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala:114: method setImpurity overrides concrete, non-deprecated symbol(s):    setImpurity
[error] [warn]   override def setImpurity(value: String): this.type = super.setImpurity(value)
[error] [warn] 
[error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala:111: method setImpurity overrides concrete, non-deprecated symbol(s):    setImpurity
[error] [warn]   override def setImpurity(value: String): this.type = super.setImpurity(value)
[error] [warn]

The shared superclass for GBT* (Tree*Params) can't have setImpurity deprecated because it's shared with derived classes that should allow impurity-setting, and therefore can't have the base class method deprecated. I find it weird that a derived class can't add a deprecation, though. Why is that rule there? Can I disable it?

@jkbradley
Copy link
Member

I'd recommend overriding setImpurity in the relevant concrete classes. In those, you can add warnings in the Scala doc and also add logWarning messages about deprecation. That's almost as good as deprecation annotations.

} else {
// Per Friedman 1999, we use a single Newton-Raphson step from gamma = 0 to find the
// optimal leaf prediction, the solution gamma to the minimization problem:
// L = sum((p_i, y_i) in leaf) 2 log(1 + exp(-2 y_i (p_i + gamma)))
Copy link
Contributor

@facaiy facaiy Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
sum((p_i, y_i) in leaf) is confusing, as it is not appropriate format in LaTex.

How about:
L = sum_{x_i in leaf} 2 log(1 + exp(-2 y_i (p_i + gamma))) , where p_i = F(x_i) ?

Copy link
Contributor

@facaiy facaiy Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way,
how about the explanation without knowing of Newton's optimization?

as:
gamma = \argmin L
= \argmin sum_{x_i in leaf} 2 log(1 + exp(-2 y_i (p_i + gamma)))
= 2 \argmin sum_{x_i in leaf} log(1 + exp(-2 y_i (p_i + gamma)))
= \argmin sum_{x_i in leaf} log(1 + exp(-2 y_i (p_i + gamma)))
= original formula (Eq. 23) in Friedman paper
namely,
the optimal value of gamma is not affected by 2 in our LogLoss definition.

However, as our gradient y' of LogLoss is -2 times than \slide{y} in (Eq. 22): y' = -2 \slide{y}
hence, the final formula need be modified as:
r_jm = \sum y' / ( 2 \sum |y'| - \sum y'^2 / 2 )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. Are you saying I should use LaTeX formatting or not? Either way, it doesn't seem clear to me what's the most lucid. Should it all be latex (i.e., use \exp not just exp as well)? That might get too confusing. I tried to find a middle ground.

Deferring to Friedman might also lead to issues. During implementation I had a very annoying bug due to a mistake in the math here. That's why I was being very explicit in the comments, since it reduces the chance that the math is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, vlad17.

  1. Find a middle ground is a better solution, I agree with you. LaTex format is not required. As for sum operation, perhaps sum_{} is a little clear than sum(()). Anyway, it's up to you.

  2. A simpler (also correct) explanation might be easy to understand and verify.

Seriously, all the work is pretty good, I shouldn't nitpick.

@HyukjinKwon
Copy link
Member

@vlad17 any update and opinion for the last review comment?

@vlad17
Copy link
Author

vlad17 commented Jul 17, 2017

@HyukjinKwon sorry for the inactivity (I have some free time now). @jkbradley is SPARK-4240 still on the roadmap? I can resume work on this (and the subsequent GBT work)

@thesuperzapper
Copy link

@vlad17 sorry to bump, but what is the status of this, and by proxy.

https://issues.apache.org/jira/browse/SPARK-4240
AND
https://issues.apache.org/jira/browse/SPARK-16718

We have suggested to the community that TreeBoost (Friedman, 1999), [Which this effectively implements] will be added to SparkML for some time.

@vlad17
Copy link
Author

vlad17 commented May 10, 2018

@thesuperzapper unfortunately I haven't been able to keep up-to-date with Spark over the past year (first year of grad school has been occupying me). I don't think I can make any contributions right now or for a while. Are you thinking about taking over?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants