[SPARK-16718][MLlib] gbm-style treeboost #14547

vlad17 · 2016-08-08T22:14:59Z

What changes were proposed in this pull request?

TreeBoost

This change adds TreeBoost functionality to GBTClassifer and GBTRegressor. The main change is that leaf nodes now make a prediction which optimizes the loss function, rather than always using the mean label (which is only optimal in the case of variance-based impurity).

Changes: L2 and Logistic loss-based impurities

This changes the defaults to use the loss-based impurity rather than the required variance.

I made this change only for L2 loss and logistic loss (adding some aliases to the names as well for parity with R's implementation, GBM). These two functions have leaf predictions that can be computed within the framework of the current impurity API. Other loss functions will require API modification, which should be its own change, SPARK-16728.

Note that because loss-based impurity with L1 loss is NOT supported, the default behavior for the GBTRegressor is to use the variance-based impurity (since the aformentioned combination throws).

How was this patch tested?

Correctness

I tested defaults parameter values and new settings for the parameters with new unit tests.

Accuracy

Example prediction problem with UCI half of million-song dataset.

This code is a pretty aesthetic change: the only algorithm that differed is the logistic loss one, and the accuracy is identical to that of variance. The difference could only visible in the raw prediction (and thus AUC), since leaf predictions for minimal logistic loss and the mean produced equivalent outcomes after thresholding.

With 700 trees and 0.001 shrinkage, runtime is nearly equivalent (within 2%) script and output here.

GBM run for comparison Everything run on my machine

jkbradley · 2016-08-08T22:20:22Z

ok to test

vlad17 · 2016-08-08T22:36:46Z

@hhbyyh Would you mind reviewing this?

SparkQA · 2016-08-08T22:39:15Z

Test build #63388 has finished for PR 14547 at commit 06fc4a9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-08-08T22:50:32Z

@vlad17 I do not get alerted when you comment on the squashed PR, as an FYI. I was using the databricks spark-perf package for performance testing.

I'd be interested to see that TreeBoost algorithm provides "better" results than the non-TreeBoost version if that's possible. I think we need some provable improvement to show before we proceed with merging this patch. (It sounds like you are working on that currently).

Thanks for the PR! I'll try to have a look sometime, but it may not be immediately.

vlad17 · 2016-08-09T02:51:55Z

@sethah Thanks for the FYI. I'm pretty confident that it'll help the actual loss since now we're directly optimizing the loss function. However, this is only going to help significantly if we e.g., use MAE for L1 loss (not implemented) or bernoulli for logistic (we automatically threshold, so I can't do that). For most datasets, accuracy won't demonstrate the difference between bernoulli-based loss leaf predictions and mean ones.

The only estimator whose behavior changed is GBTClassifier (now the bernoulli predictions use an NR step rather than guess the mean). And since the raw prediction column is unavailable for the GBTClassifier, I can't really compare the classifiers sensibly on skewed datasets since AUC is out of the question.

I'm going to have to spend some time trying to find a "real" dataset that's not skewed but large enough to be meaningful or just make an artificial one. And also spark-perf will need to be re-run.

So it looks like for now I have my work cut out for me. A couple of questions in the meantime (@jkbradley), though:
(1) Regarding the binary incompatibility failure - part of that was my fault, part of it was due to an incompatibility with a package-private method. I added an exception for the binary incompatibility for the package-private method - is that OK?
(2) This failure was due to exactly the lack of default L1 loss-based impurity support - I suppose I should take that as a hint that the default should be variance for the GBTRegressor until L1 loss-based is supported? -> OK, done

SparkQA · 2016-08-09T04:39:16Z

Test build #63407 has finished for PR 14547 at commit cfaee0f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T09:48:16Z

Test build #63428 has finished for PR 14547 at commit b4e5e6c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T22:17:45Z

Test build #63447 has finished for PR 14547 at commit fe256f7.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-08-09T22:30:16Z

CC: @hhbyyh Would you mind taking a look at this since you're familiar with GBTs? Thanks in advance! This should be one of the most important improvements in terms of accuracy, especially once we get soft predictions (for AUC measurements) from GBTs.

SparkQA · 2016-08-10T00:23:18Z

Test build #3210 has finished for PR 14547 at commit fe256f7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T06:35:14Z

Test build #63500 has finished for PR 14547 at commit 233c6cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-16T23:10:30Z

Test build #63875 has finished for PR 14547 at commit a040da5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-17T02:20:53Z

Test build #3224 has finished for PR 14547 at commit a040da5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-08-19T17:50:17Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

+ *
+ * [[GBTClassifier]] will use the usual `"loss-based"` impurity by default, conforming to
+ * TreeBoost behavior. For SGB, set impurity to `"variance"`.
+ * use of TreeBoost, set impurity to `"loss-based"`.


SparkQA · 2016-11-01T01:46:33Z

Test build #67858 has finished for PR 14547 at commit 5f54f4d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

vlad17 · 2016-11-01T17:03:47Z

@jkbradley it seems I can only deprecate setImpurity: the value can't be deprecated since it's used internally, which triggers a fatal warning, and getImpurity has scaladoc shared between other classes where it's valid to use. In any case, setImpurity is the only one that needs to have the warning.

SparkQA · 2016-11-01T17:16:20Z

Test build #67908 has finished for PR 14547 at commit 4e20a70.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

vlad17 · 2016-11-01T21:24:20Z

@jkbradley There seems to be more issues with deprecating impurity:

[error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala:114: method setImpurity overrides concrete, non-deprecated symbol(s):    setImpurity
[error] [warn]   override def setImpurity(value: String): this.type = super.setImpurity(value)
[error] [warn] 
[error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala:111: method setImpurity overrides concrete, non-deprecated symbol(s):    setImpurity
[error] [warn]   override def setImpurity(value: String): this.type = super.setImpurity(value)
[error] [warn]

The shared superclass for GBT* (Tree*Params) can't have setImpurity deprecated because it's shared with derived classes that should allow impurity-setting, and therefore can't have the base class method deprecated. I find it weird that a derived class can't add a deprecation, though. Why is that rule there? Can I disable it?

jkbradley · 2017-01-19T18:56:59Z

I'd recommend overriding setImpurity in the relevant concrete classes. In those, you can add warnings in the Scala doc and also add logWarning messages about deprecation. That's almost as good as deprecation annotations.

facaiy · 2017-03-13T01:38:41Z

mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala

+  } else {
+    // Per Friedman 1999, we use a single Newton-Raphson step from gamma = 0 to find the
+    // optimal leaf prediction, the solution gamma to the minimization problem:
+    // L = sum((p_i, y_i) in leaf) 2 log(1 + exp(-2 y_i (p_i + gamma)))


Hi,
sum((p_i, y_i) in leaf) is confusing, as it is not appropriate format in LaTex.

How about:
L = sum_{x_i in leaf} 2 log(1 + exp(-2 y_i (p_i + gamma))) , where p_i = F(x_i) ?

By the way,
how about the explanation without knowing of Newton's optimization?

as:
gamma = \argmin L
= \argmin sum_{x_i in leaf} 2 log(1 + exp(-2 y_i (p_i + gamma)))
= 2 \argmin sum_{x_i in leaf} log(1 + exp(-2 y_i (p_i + gamma)))
= \argmin sum_{x_i in leaf} log(1 + exp(-2 y_i (p_i + gamma)))
= original formula (Eq. 23) in Friedman paper
namely,
the optimal value of gamma is not affected by 2 in our LogLoss definition.

However, as our gradient y' of LogLoss is -2 times than \slide{y} in (Eq. 22): y' = -2 \slide{y}
hence, the final formula need be modified as:
r_jm = \sum y' / ( 2 \sum |y'| - \sum y'^2 / 2 )

I'm a bit confused. Are you saying I should use LaTeX formatting or not? Either way, it doesn't seem clear to me what's the most lucid. Should it all be latex (i.e., use \exp not just exp as well)? That might get too confusing. I tried to find a middle ground.

Deferring to Friedman might also lead to issues. During implementation I had a very annoying bug due to a mistake in the math here. That's why I was being very explicit in the comments, since it reduces the chance that the math is wrong.

Hi, vlad17.

Find a middle ground is a better solution, I agree with you. LaTex format is not required. As for sum operation, perhaps sum_{} is a little clear than sum(()). Anyway, it's up to you.

A simpler (also correct) explanation might be easy to understand and verify.

Seriously, all the work is pretty good, I shouldn't nitpick.

HyukjinKwon · 2017-05-11T13:04:58Z

@vlad17 any update and opinion for the last review comment?

vlad17 · 2017-07-17T19:47:42Z

@HyukjinKwon sorry for the inactivity (I have some free time now). @jkbradley is SPARK-4240 still on the roadmap? I can resume work on this (and the subsequent GBT work)

thesuperzapper · 2018-05-08T22:35:24Z

@vlad17 sorry to bump, but what is the status of this, and by proxy.

https://issues.apache.org/jira/browse/SPARK-4240
AND
https://issues.apache.org/jira/browse/SPARK-16718

We have suggested to the community that TreeBoost (Friedman, 1999), [Which this effectively implements] will be added to SparkML for some time.

vlad17 · 2018-05-10T17:17:44Z

@thesuperzapper unfortunately I haven't been able to keep up-to-date with Spark over the past year (first year of grad school has been occupying me). I don't think I can make any contributions right now or for a while. Are you thinking about taking over?

vlad17 force-pushed the GBT-1 branch from 7e39ada to 06fc4a9 Compare August 8, 2016 22:15

vlad17 force-pushed the GBT-1 branch from 233c6cc to a040da5 Compare August 16, 2016 21:15

vlad17 changed the title ~~[SPARK-16718][MLlib] gbm-style treeboost [WIP]~~ [SPARK-16718][MLlib] gbm-style treeboost Aug 18, 2016

jkbradley reviewed Aug 19, 2016
View reviewed changes

vlad17 and others added 16 commits October 31, 2016 17:57

Completed tests for reading/writing new impurities

347f220

Finished tests

7731a8f

Added R's gbm as a direct comparison to GBTClassifier

89acdfc

Got rid of direct R comparison

b44f2b1

Direct behavior-checking test (for GBTClassifier)

8449f9c

Added analogous test for GBTReressor

bc696ee

Cleaned up style-related things

775d991

Removed weight requirement

da39cec

Fixed or ignored binary incompat issues

ecf08c6

Changed to variance impurity as default for GBTRegressor

2d5036f

defined default python behavior to be mllib (was undefined before)

29b1158

Modified defaults on pyspark side, too

d36dab3

Addressed partial-pass comments

ca2f505

Addressed comments (except auto thing)

361606d

re-added Mima excludes that got squashed in merge

d3b948b

More mima excludes, added lots of warnings to not use impurity

5f54f4d

vlad17 force-pushed the GBT-1 branch from 66d3396 to 5f54f4d Compare November 1, 2016 01:31

Removed depreciation of impurity value for GBTs

4e20a70

facaiy reviewed Mar 13, 2017

View reviewed changes

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16718][MLlib] gbm-style treeboost #14547

[SPARK-16718][MLlib] gbm-style treeboost #14547

vlad17 commented Aug 8, 2016 •

edited

Loading

jkbradley commented Aug 8, 2016

vlad17 commented Aug 8, 2016

SparkQA commented Aug 8, 2016

sethah commented Aug 8, 2016

vlad17 commented Aug 9, 2016 •

edited

Loading

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

jkbradley commented Aug 9, 2016

SparkQA commented Aug 10, 2016

SparkQA commented Aug 10, 2016

SparkQA commented Aug 16, 2016

SparkQA commented Aug 17, 2016

jkbradley Aug 19, 2016

vlad17 Sep 13, 2016

SparkQA commented Nov 1, 2016

vlad17 commented Nov 1, 2016

SparkQA commented Nov 1, 2016

vlad17 commented Nov 1, 2016

jkbradley commented Jan 19, 2017

facaiy Mar 13, 2017 •

edited

Loading

facaiy Mar 14, 2017 •

edited

Loading

vlad17 Mar 15, 2017

facaiy Mar 21, 2017

HyukjinKwon commented May 11, 2017

vlad17 commented Jul 17, 2017

thesuperzapper commented May 8, 2018

vlad17 commented May 10, 2018

[SPARK-16718][MLlib] gbm-style treeboost #14547

[SPARK-16718][MLlib] gbm-style treeboost #14547

Conversation

vlad17 commented Aug 8, 2016 • edited Loading

What changes were proposed in this pull request?

TreeBoost

Changes: L2 and Logistic loss-based impurities

How was this patch tested?

Correctness

Accuracy

jkbradley commented Aug 8, 2016

vlad17 commented Aug 8, 2016

SparkQA commented Aug 8, 2016

sethah commented Aug 8, 2016

vlad17 commented Aug 9, 2016 • edited Loading

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

jkbradley commented Aug 9, 2016

SparkQA commented Aug 10, 2016

SparkQA commented Aug 10, 2016

SparkQA commented Aug 16, 2016

SparkQA commented Aug 17, 2016

jkbradley Aug 19, 2016

Choose a reason for hiding this comment

vlad17 Sep 13, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 1, 2016

vlad17 commented Nov 1, 2016

SparkQA commented Nov 1, 2016

vlad17 commented Nov 1, 2016

jkbradley commented Jan 19, 2017

facaiy Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

facaiy Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

vlad17 Mar 15, 2017

Choose a reason for hiding this comment

facaiy Mar 21, 2017

Choose a reason for hiding this comment

HyukjinKwon commented May 11, 2017

vlad17 commented Jul 17, 2017

thesuperzapper commented May 8, 2018

vlad17 commented May 10, 2018

vlad17 commented Aug 8, 2016 •

edited

Loading

vlad17 commented Aug 9, 2016 •

edited

Loading

facaiy Mar 13, 2017 •

edited

Loading

facaiy Mar 14, 2017 •

edited

Loading