-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16718][MLlib] gbm-style treeboost #14547
Conversation
ok to test |
@hhbyyh Would you mind reviewing this? |
Test build #63388 has finished for PR 14547 at commit
|
@vlad17 I do not get alerted when you comment on the squashed PR, as an FYI. I was using the databricks spark-perf package for performance testing. I'd be interested to see that TreeBoost algorithm provides "better" results than the non-TreeBoost version if that's possible. I think we need some provable improvement to show before we proceed with merging this patch. (It sounds like you are working on that currently). Thanks for the PR! I'll try to have a look sometime, but it may not be immediately. |
@sethah Thanks for the FYI. I'm pretty confident that it'll help the actual loss since now we're directly optimizing the loss function. However, this is only going to help significantly if we e.g., use MAE for L1 loss (not implemented) or bernoulli for logistic (we automatically threshold, so I can't do that). For most datasets, accuracy won't demonstrate the difference between bernoulli-based loss leaf predictions and mean ones. The only estimator whose behavior changed is GBTClassifier (now the bernoulli predictions use an NR step rather than guess the mean). And since the raw prediction column is unavailable for the GBTClassifier, I can't really compare the classifiers sensibly on skewed datasets since AUC is out of the question. I'm going to have to spend some time trying to find a "real" dataset that's not skewed but large enough to be meaningful or just make an artificial one. And also spark-perf will need to be re-run. So it looks like for now I have my work cut out for me. A couple of questions in the meantime (@jkbradley), though: |
Test build #63407 has finished for PR 14547 at commit
|
Test build #63428 has finished for PR 14547 at commit
|
Test build #63447 has finished for PR 14547 at commit
|
CC: @hhbyyh Would you mind taking a look at this since you're familiar with GBTs? Thanks in advance! This should be one of the most important improvements in terms of accuracy, especially once we get soft predictions (for AUC measurements) from GBTs. |
Test build #3210 has finished for PR 14547 at commit
|
Test build #63500 has finished for PR 14547 at commit
|
Test build #63875 has finished for PR 14547 at commit
|
Test build #3224 has finished for PR 14547 at commit
|
* | ||
* [[GBTClassifier]] will use the usual `"loss-based"` impurity by default, conforming to | ||
* TreeBoost behavior. For SGB, set impurity to `"variance"`. | ||
* use of TreeBoost, set impurity to `"loss-based"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Test build #67858 has finished for PR 14547 at commit
|
@jkbradley it seems I can only deprecate |
Test build #67908 has finished for PR 14547 at commit
|
@jkbradley There seems to be more issues with deprecating impurity:
The shared superclass for GBT* (Tree*Params) can't have setImpurity deprecated because it's shared with derived classes that should allow impurity-setting, and therefore can't have the base class method deprecated. I find it weird that a derived class can't add a deprecation, though. Why is that rule there? Can I disable it? |
I'd recommend overriding setImpurity in the relevant concrete classes. In those, you can add warnings in the Scala doc and also add logWarning messages about deprecation. That's almost as good as deprecation annotations. |
} else { | ||
// Per Friedman 1999, we use a single Newton-Raphson step from gamma = 0 to find the | ||
// optimal leaf prediction, the solution gamma to the minimization problem: | ||
// L = sum((p_i, y_i) in leaf) 2 log(1 + exp(-2 y_i (p_i + gamma))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
sum((p_i, y_i) in leaf) is confusing, as it is not appropriate format in LaTex.
How about:
L = sum_{x_i in leaf} 2 log(1 + exp(-2 y_i (p_i + gamma))) , where p_i = F(x_i) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way,
how about the explanation without knowing of Newton's optimization?
as:
gamma = \argmin L
= \argmin sum_{x_i in leaf} 2 log(1 + exp(-2 y_i (p_i + gamma)))
= 2 \argmin sum_{x_i in leaf} log(1 + exp(-2 y_i (p_i + gamma)))
= \argmin sum_{x_i in leaf} log(1 + exp(-2 y_i (p_i + gamma)))
= original formula (Eq. 23) in Friedman paper
namely,
the optimal value of gamma is not affected by 2 in our LogLoss definition.
However, as our gradient y' of LogLoss is -2 times than \slide{y} in (Eq. 22): y' = -2 \slide{y}
hence, the final formula need be modified as:
r_jm = \sum y' / ( 2 \sum |y'| - \sum y'^2 / 2 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. Are you saying I should use LaTeX formatting or not? Either way, it doesn't seem clear to me what's the most lucid. Should it all be latex (i.e., use \exp
not just exp
as well)? That might get too confusing. I tried to find a middle ground.
Deferring to Friedman might also lead to issues. During implementation I had a very annoying bug due to a mistake in the math here. That's why I was being very explicit in the comments, since it reduces the chance that the math is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, vlad17.
-
Find a middle ground is a better solution, I agree with you. LaTex format is not required. As for sum operation, perhaps sum_{} is a little clear than sum(()). Anyway, it's up to you.
-
A simpler (also correct) explanation might be easy to understand and verify.
Seriously, all the work is pretty good, I shouldn't nitpick.
@vlad17 any update and opinion for the last review comment? |
@HyukjinKwon sorry for the inactivity (I have some free time now). @jkbradley is SPARK-4240 still on the roadmap? I can resume work on this (and the subsequent GBT work) |
@vlad17 sorry to bump, but what is the status of this, and by proxy. https://issues.apache.org/jira/browse/SPARK-4240 We have suggested to the community that TreeBoost (Friedman, 1999), [Which this effectively implements] will be added to SparkML for some time. |
@thesuperzapper unfortunately I haven't been able to keep up-to-date with Spark over the past year (first year of grad school has been occupying me). I don't think I can make any contributions right now or for a while. Are you thinking about taking over? |
What changes were proposed in this pull request?
TreeBoost
This change adds TreeBoost functionality to
GBTClassifer
andGBTRegressor
. The main change is that leaf nodes now make a prediction which optimizes the loss function, rather than always using the mean label (which is only optimal in the case of variance-based impurity).Changes: L2 and Logistic loss-based impurities
This changes the defaults to use the loss-based impurity rather than the required variance.
I made this change only for L2 loss and logistic loss (adding some aliases to the names as well for parity with R's implementation, GBM). These two functions have leaf predictions that can be computed within the framework of the current impurity API. Other loss functions will require API modification, which should be its own change, SPARK-16728.
Note that because loss-based impurity with L1 loss is NOT supported, the default behavior for the GBTRegressor is to use the variance-based impurity (since the aformentioned combination throws).
How was this patch tested?
Correctness
I tested defaults parameter values and new settings for the parameters with new unit tests.
Accuracy
Example prediction problem with UCI half of million-song dataset.
This code is a pretty aesthetic change: the only algorithm that differed is the logistic loss one, and the accuracy is identical to that of variance. The difference could only visible in the raw prediction (and thus AUC), since leaf predictions for minimal logistic loss and the mean produced equivalent outcomes after thresholding.
With 700 trees and 0.001 shrinkage, runtime is nearly equivalent (within 2%) script and output here.
GBM run for comparison Everything run on my machine