-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] Use type
argument to control prediction types
#5133
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for this proposal!
I'm open to considering this breaking change, but since the goal is just consistency with other projects, I need some more information.
Can you please tell me specifically which packages you're referring to in this statement?
Some popular packages do take booleans
R-package/R/lgb.Booster.R
Outdated
, header = header | ||
, params = params | ||
) | ||
type <- head(type, 1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of having type
default to a vector with 5 elements, and then always taking only the first thing provided? It seems to me that it would be simpler to just default to "link"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to have the allowed values visible in the function signature so that they are easily seen by the user and easy to autocomplete, in the same way as for example base R's predict.glm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thank you. This project has documentation for the purpose of describing which values are supported, and I believe the pattern of having a default value which is not the value that will be directly used will be confusing for both developers and users of the package. Please remove this and set the default to "link"
.
Among others, |
Cross-reference: dmlc/xgboost#7947. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I'm generally supportive of this proposal, but left a few suggestions on the implementation.
Ideally we'd do a round of deprecation warnings first (as {xgboost}
is planning to do, dmlc/xgboost#7947 (comment)), but the reality in this project right now is that due to a lack of maintainer attention, we're only doing 1-2 releases a year. In addition, {lightgbm}
has very few reverse dependencies on CRAN, and the next release will be a large major-version release with breaking changes (#5153).
@jmoralez @StrikerRUS what do you think?
@mayer79 we'd also welcome your opinion on this if you have one.
R-package/R/lgb.Booster.R
Outdated
#' \code{objective="binary"}, this corresponds to log-odds. For many objectives such as "regression", | ||
#' since no transformation is applied, the output will be the same as for "link". | ||
#' \item \code{"leaf"}: will output the index of the terminal node / leaf at which each observations falls | ||
#' in each tree in the model, outputted as as integers, with one column per tree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' in each tree in the model, outputted as as integers, with one column per tree. | |
#' in each tree in the model, outputted as integers, with one column per tree. |
R-package/R/lgb.Booster.R
Outdated
, header = header | ||
, params = params | ||
) | ||
type <- head(type, 1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thank you. This project has documentation for the purpose of describing which values are supported, and I believe the pattern of having a default value which is not the value that will be directly used will be confusing for both developers and users of the package. Please remove this and set the default to "link"
.
R-package/R/lgb.Booster.R
Outdated
if (type == "response") { | ||
if (object$params$objective == "binary") { | ||
pred <- as.integer(pred >= 0.5) | ||
} else if (object$params$objective %in% c("multiclass", "multiclassova")) { | ||
pred <- max.col(pred) - 1L | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is new behavior being added to the package, please add unit tests confirming that it works as expected.
@@ -713,6 +713,23 @@ Booster <- R6::R6Class( | |||
#' @param object Object of class \code{lgb.Booster} | |||
#' @param newdata a \code{matrix} object, a \code{dgCMatrix} object or | |||
#' a character representing a path to a text file (CSV, TSV, or LibSVM) | |||
#' @param type Type of prediction to output. Allowed types are:\itemize{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a note to this documentation that when choosing "link"
and "response"
, if you're using a custom objective function they'll be ignored and "raw"
predictions will be returned?
On the Python side, lightgbm
raises a warning in such situations.
LightGBM/python-package/lightgbm/sklearn.py
Lines 1064 to 1067 in f715645
if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib): | |
_log_warning("Cannot compute class probabilities or labels " | |
"due to the usage of customized objective function.\n" | |
"Returning raw scores instead.") |
The R package should probably do that to, but that could be deferred to a later PR.
rawscore = FALSE, | ||
predleaf = FALSE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down in the section where this function catches arguments that fall into ...
LightGBM/R-package/R/lgb.Booster.R
Lines 822 to 824 in f715645
if ("reshape" %in% names(additional_params)) { | |
stop("'reshape' argument is no longer supported.") | |
} |
Please add the following:
if (isTRUE(additional_params[["rawscore"]])) {
stop("Argument 'rawscore' is no longer supported. Use type = 'raw' instead.")
}
if (isTRUE(additional_params[["predleaf"]])) {
stop("Argument 'predleaf' is no longer supported. Use type = 'leaf' instead.")
}
if (isTRUE(additional_params[["predcontrib"]])) {
stop("Argument 'predcontrib' is no longer supported. Use type = 'contrib' instead.")
}
I'm ok with breaking users' code in the next release in exchange for making the package's interface more compatible with other packages for modeling in R, but I think we should provide specific, actionable error messages when possible to reduce the effort required for affected users to alter their code.
@jameslamb : the proposal is as it should have been from the beginning. Unfortunately, now it seems too late to change key arguments of a key function. It will break multiple R packages (SHAPforxgboost, fastshap, plus one I am currently writing) and a lot of existing code, especially in the field of XAI. For what Version is the change planned? |
Updated. |
I think we can omit deprecation cycle in the upcoming |
@mayer79 thanks very much for your feedback! Hearing you say that this is "how it should have been from the beginning" echoes how I was feeling and I think what @david-cortes is thinking as well. I understand that the breaking change will cause some pain, but I think it is necessary to get the package to the state we that we want. The reality we're facing in LightGBM right now is a significant lack of maintainer attention in this project + the fact that it has been almost 8 months since the previous substantial release (v3.3.1, in October 2021). We simply don't have the development velocity in this project to do deprecation cycles the way that healthier projects like scikit-learn can. These changes are planned for a v4.0.0 major release (#5153), after which we will be much more skeptical of breaking changes anywhere in the project.
I'm willing to go submit PRs to all of those projects listed as reverse dependencies of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more very small suggestions, otherwise I am good with these changes. Thanks very much for the work!
Since this is such a significant breaking change, @jmoralez would you also please help with a review?
expect_true(is.matrix(pred)) | ||
expect_equal(nrow(pred), nrow(X)) | ||
expect_equal(ncol(pred), 3L) | ||
}) | ||
|
||
test_that("predict type='response' returns integers for classification objectives", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_that("predict type='response' returns integers for classification objectives", { | |
test_that("predict type='response' returns predicted class for classification objectives", { |
The significant thing here is that the result is the predicted classes, not just that they're integers. Would you please consider this rewording?
expect_true(all(pred %in% c(0L, 1L, 2L))) | ||
}) | ||
|
||
test_that("predict type='response' returns decimals for regression objectives", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_that("predict type='response' returns decimals for regression objectives", { | |
test_that("predict type='response' returns values in the target's range for regression objectives", { |
Sure. I'll add my review tomorrow |
Thanks for your detailed answer, James. The packages I have mentioned use |
R-package/R/lgb.Booster.R
Outdated
#' in each tree in the model, outputted as integers, with one column per tree. | ||
#' \item \code{"contrib"}: will return the per-feature contributions for each prediction, including an | ||
#' intercept (each feature will produce one column). If there are multiple classes, each class will | ||
#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the | |
#' have separate feature contributions (thus the number of columns is features+1 multiplied by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb @david-cortes @jmoralez I like the new "type" argument. However, if I correctly understood the intended changes, they are in gross contrast what usual prediction functions involving link functions do. Sorry for this late input. I think it is important to consider it anyway.
According to predict.glm()
, type "response" returns values on the scale of the response. A Poisson regression with log link would return positive values, a binary logistic regression would return probabilities.
Type "link" means, on the other hand, that predictions are made in the link space. For a logistic regression, it would be the logit of the probabilities, for a Poisson regression the log expected counts etc.
LightGBM seems to use "link" in the exact opposite, i.e., in the sense that the inverse link is being applied. But this is what is usually be called "response"...
My suggestions:
- The default type is "response". Binary logistic will return probabilities.
- There is either "raw" or "link" but not both. For historic reasons, I tend to "raw", but "link" is fine as well.
- There should be a type = "class" in order to get the majority class for those objectives where it makes sense.
glm.predict()
does not have this, but for instance "predict.glmnet()". - Type "leaf" and "contrib" are okay.
Put differently: "link" should be called "response", "response" should be called "class", and "raw" could be called "link".
Side note: predict.glm()
uses type = "link"
as the default, i.e., it predicts log odds instead of probabilities for a logistic regression. We should not do it similarly because then every logistic/Poisson/Gamma/Tweedie model would be affected from the change.
@mayer79 Thanks for spotting the mismatch w.r.t. base R. I've updated it by changing the names to "response" for what's obtained after applying the link function and "class" for the type that predicts class in classification objectives. |
The linter check doesn't say what exactly failed, don't know what it is complaining about: https://github.com/microsoft/LightGBM/runs/6887268528?check_suite_focus=true |
@david-cortes that error you're facing comes from
There was a new major release of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much to this, and to @mayer79 for the excellent review comments!
I re-reviewed tonight. Please see a few more suggestions that I believe should be addressed before this is merged.
preds_raw_r6_param <- bst$predict(X, params = params) | ||
expect_equal(preds_raw_s3_keyword, preds_raw_s3_param) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just noticed this. @david-cortes , why are you proposing removing support for passing something like "predict_raw_score"
through params?
That effectively reverts part of #5122, which was a fix for an issue you opened (#4670).
I feel that {lightgbm}
should continue to support changing the type of prediction via params
entries, and that params
entries should take precedence over the type
keyword argument in predict.lgb.Booster()
, unless you can provide a compelling reason to remove that support as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored them.
Co-authored-by: James Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, thanks very much! If you can resolve the merge conflicts with master
and get this updated, we'll merge it.
Updated. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
The
predict
function for lightgbm model objects has three mutually-exclusive parameters (rawscore
,predcontrib
,predleaf
) which control the type of output to produce in the predictions.The convention in R is to make these types controllable through a
type
parameter, as done in base R'spredict.glm
and in many core packages for decision trees such asrpart
,randomForest
,gbm
, and other popular decision tree packages such asranger
,C50
, etc.Some popular packages do take booleans, but mostly when they output more than one prediction type per function call, which lightgbm doesn't do.
In order to make the R interface more similar to base R and to other packages, this PR replaces those boolean arguments with a
type
argument.Additionally, it adds a
"response"
type which will output the predicted class in classification objectives (something that's present in the Python interface but missing from the R interface).