Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Use type argument to control prediction types #5133

Merged
merged 18 commits into from
Jun 27, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions R-package/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ importFrom(jsonlite,fromJSON)
importFrom(methods,is)
importFrom(parallel,detectCores)
importFrom(stats,quantile)
importFrom(utils,head)
importFrom(utils,modifyList)
importFrom(utils,read.delim)
useDynLib(lib_lightgbm , .registration = TRUE)
78 changes: 54 additions & 24 deletions R-package/R/lgb.Booster.R
Original file line number Diff line number Diff line change
Expand Up @@ -713,6 +713,23 @@ Booster <- R6::R6Class(
#' @param object Object of class \code{lgb.Booster}
#' @param newdata a \code{matrix} object, a \code{dgCMatrix} object or
#' a character representing a path to a text file (CSV, TSV, or LibSVM)
#' @param type Type of prediction to output. Allowed types are:\itemize{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a note to this documentation that when choosing "link" and "response", if you're using a custom objective function they'll be ignored and "raw" predictions will be returned?

On the Python side, lightgbm raises a warning in such situations.

if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib):
_log_warning("Cannot compute class probabilities or labels "
"due to the usage of customized objective function.\n"
"Returning raw scores instead.")

The R package should probably do that to, but that could be deferred to a later PR.

#' \item \code{"link"}: will output the predicted score according to the objective function being
#' optimized (depending on the link function that the objective uses), after applying any necessary
#' transformations - for example, for \code{objective="binary"}, it will output class probabilities.
#' \item \code{"response"}: for classification objectives, will output the class with the highest predicted
#' probability. For other objectives, will output the same as "link".
#' \item \code{"raw"}: will output the non-transformed numbers (sum of predictions from boosting iterations'
#' results) from which the "link" number is produced for a given objective function - for example, for
#' \code{objective="binary"}, this corresponds to log-odds. For many objectives such as "regression",
#' since no transformation is applied, the output will be the same as for "link".
#' \item \code{"leaf"}: will output the index of the terminal node / leaf at which each observations falls
#' in each tree in the model, outputted as as integers, with one column per tree.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' in each tree in the model, outputted as as integers, with one column per tree.
#' in each tree in the model, outputted as integers, with one column per tree.

#' \item \code{"contrib"}: will return the per-feature contributions for each prediction, including an
#' intercept (each feature will produce one column). If there are multiple classes, each class will
#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the
#' have separate feature contributions (thus the number of columns is features+1 multiplied by the

#' number of classes).
#' }
#' @param start_iteration int or None, optional (default=None)
#' Start index of the iteration to predict.
#' If None or <= 0, starts from the first iteration.
Expand All @@ -721,22 +738,19 @@ Booster <- R6::R6Class(
#' If None, if the best iteration exists and start_iteration is None or <= 0, the
#' best iteration is used; otherwise, all iterations from start_iteration are used.
#' If <= 0, all iterations from start_iteration are used (no limits).
#' @param rawscore whether the prediction should be returned in the for of original untransformed
#' sum of predictions from boosting iterations' results. E.g., setting \code{rawscore=TRUE}
#' for logistic regression would result in predictions for log-odds instead of probabilities.
#' @param predleaf whether predict leaf index instead.
#' @param predcontrib return per-feature contributions for each record.
#' @param header only used for prediction for text file. True if text file has header
#' @param params a list of additional named parameters. See
#' \href{https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict-parameters}{
#' the "Predict Parameters" section of the documentation} for a list of parameters and
#' valid values.
#' @param ... ignored
#' @return For regression or binary classification, it returns a vector of length \code{nrows(data)}.
#' For multiclass classification, it returns a matrix of dimensions \code{(nrows(data), num_class)}.
#' @return For prediction types that are meant to always return one output per observation (e.g. when predicting
#' \code{type="link"} on a binary classification or regression objective), will return a vector with one
#' row per observation in \code{newdata}.
#'
#' When passing \code{predleaf=TRUE} or \code{predcontrib=TRUE}, the output will always be
#' returned as a matrix.
#' For prediction types that are meant to return more than one output per observation (e.g. when predicting
#' \code{type="link"} on a multi-class objective, or when predicting \code{type="leaf"}, regardless of
#' objective), will return a matrix with one row per observation in \code{newdata} and one column per output.
#'
#' @examples
#' \donttest{
Expand Down Expand Up @@ -770,15 +784,13 @@ Booster <- R6::R6Class(
#' )
#' )
#' }
#' @importFrom utils modifyList
#' @importFrom utils modifyList head
#' @export
predict.lgb.Booster <- function(object,
newdata,
type = c("link", "response", "raw", "leaf", "contrib"),
start_iteration = NULL,
num_iteration = NULL,
rawscore = FALSE,
predleaf = FALSE,
Comment on lines -809 to -810
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down in the section where this function catches arguments that fall into ...

if ("reshape" %in% names(additional_params)) {
stop("'reshape' argument is no longer supported.")
}

Please add the following:

if (isTRUE(additional_params[["rawscore"]])) {
    stop("Argument 'rawscore' is no longer supported. Use type = 'raw' instead.")
}
if (isTRUE(additional_params[["predleaf"]])) {
    stop("Argument 'predleaf' is no longer supported. Use type = 'leaf' instead.")
}
if (isTRUE(additional_params[["predcontrib"]])) {
    stop("Argument 'predcontrib' is no longer supported. Use type = 'contrib' instead.")
}

I'm ok with breaking users' code in the next release in exchange for making the package's interface more compatible with other packages for modeling in R, but I think we should provide specific, actionable error messages when possible to reduce the effort required for affected users to alter their code.

predcontrib = FALSE,
header = FALSE,
params = list(),
...) {
Expand All @@ -799,18 +811,36 @@ predict.lgb.Booster <- function(object,
))
}

return(
object$predict(
data = newdata
, start_iteration = start_iteration
, num_iteration = num_iteration
, rawscore = rawscore
, predleaf = predleaf
, predcontrib = predcontrib
, header = header
, params = params
)
type <- head(type, 1L)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of having type default to a vector with 5 elements, and then always taking only the first thing provided? It seems to me that it would be simpler to just default to "link".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to have the allowed values visible in the function signature so that they are easily seen by the user and easy to autocomplete, in the same way as for example base R's predict.glm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thank you. This project has documentation for the purpose of describing which values are supported, and I believe the pattern of having a default value which is not the value that will be directly used will be confusing for both developers and users of the package. Please remove this and set the default to "link".

rawscore <- FALSE
predleaf <- FALSE
predcontrib <- FALSE
if (type == "raw") {
rawscore <- TRUE
} else if (type == "leaf") {
predleaf <- TRUE
} else if (type == "contrib") {
predcontrib <- TRUE
}

pred <- object$predict(
data = newdata
, start_iteration = start_iteration
, num_iteration = num_iteration
, rawscore = rawscore
, predleaf = predleaf
, predcontrib = predcontrib
, header = header
, params = params
)
if (type == "response") {
if (object$params$objective == "binary") {
pred <- as.integer(pred >= 0.5)
} else if (object$params$objective %in% c("multiclass", "multiclassova")) {
pred <- max.col(pred) - 1L
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is new behavior being added to the package, please add unit tests confirming that it works as expected.

return(pred)
}

#' @name print.lgb.Booster
Expand Down
4 changes: 2 additions & 2 deletions R-package/demo/boost_from_prediction.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ param <- list(
bst <- lgb.train(param, dtrain, 1L, valids = valids)

# Note: we need the margin value instead of transformed prediction in set_init_score
ptrain <- predict(bst, agaricus.train$data, rawscore = TRUE)
ptest <- predict(bst, agaricus.test$data, rawscore = TRUE)
ptrain <- predict(bst, agaricus.train$data, type = "raw")
ptest <- predict(bst, agaricus.test$data, type = "raw")

# set the init_score property of dtrain and dtest
# base margin is the base prediction we will boost from
Expand Down
6 changes: 3 additions & 3 deletions R-package/demo/leaf_stability.R
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ new_data <- data.frame(
X = rowMeans(predict(
model
, agaricus.test$data
, predleaf = TRUE
, type = "leaf"
))
, Y = pmin(
pmax(
Expand Down Expand Up @@ -162,7 +162,7 @@ new_data2 <- data.frame(
X = rowMeans(predict(
model2
, agaricus.test$data
, predleaf = TRUE
, type = "leaf"
))
, Y = pmin(
pmax(
Expand Down Expand Up @@ -218,7 +218,7 @@ new_data3 <- data.frame(
X = rowMeans(predict(
model3
, agaricus.test$data
, predleaf = TRUE
, type = "leaf"
))
, Y = pmin(
pmax(
Expand Down
4 changes: 2 additions & 2 deletions R-package/demo/multiclass.R
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ my_preds <- predict(model, test[, 1L:4L])
my_preds <- predict(model, test[, 1L:4L])

# We can also get the predicted scores before the Sigmoid/Softmax application
my_preds <- predict(model, test[, 1L:4L], rawscore = TRUE)
my_preds <- predict(model, test[, 1L:4L], type = "raw")

# We can also get the leaf index
my_preds <- predict(model, test[, 1L:4L], predleaf = TRUE)
my_preds <- predict(model, test[, 1L:4L], type = "leaf")
4 changes: 2 additions & 2 deletions R-package/demo/multiclass_custom_objective.R
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ model_builtin <- lgb.train(
, obj = "multiclass"
)

preds_builtin <- predict(model_builtin, test[, 1L:4L], rawscore = TRUE)
preds_builtin <- predict(model_builtin, test[, 1L:4L], type = "raw")
probs_builtin <- exp(preds_builtin) / rowSums(exp(preds_builtin))

# Method 2 of training with custom objective function
Expand Down Expand Up @@ -109,7 +109,7 @@ model_custom <- lgb.train(
, eval = custom_multiclass_metric
)

preds_custom <- predict(model_custom, test[, 1L:4L], rawscore = TRUE)
preds_custom <- predict(model_custom, test[, 1L:4L], type = "raw")
probs_custom <- exp(preds_custom) / rowSums(exp(preds_custom))

# compare predictions
Expand Down
40 changes: 25 additions & 15 deletions R-package/man/predict.lgb.Booster.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

28 changes: 14 additions & 14 deletions R-package/tests/testthat/test_Predictor.R
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,8 @@ test_that("start_iteration works correctly", {
, early_stopping_rounds = 2L
)
expect_true(lgb.is.Booster(bst))
pred1 <- predict(bst, newdata = test$data, rawscore = TRUE)
pred_contrib1 <- predict(bst, test$data, predcontrib = TRUE)
pred1 <- predict(bst, newdata = test$data, type = "raw")
pred_contrib1 <- predict(bst, test$data, type = "contrib")
pred2 <- rep(0.0, length(pred1))
pred_contrib2 <- rep(0.0, length(pred2))
step <- 11L
Expand All @@ -96,7 +96,7 @@ test_that("start_iteration works correctly", {
inc_pred <- predict(bst, test$data
, start_iteration = start_iter
, num_iteration = n_iter
, rawscore = TRUE
, type = "raw"
)
inc_pred_contrib <- bst$predict(test$data
, start_iteration = start_iter
Expand All @@ -109,8 +109,8 @@ test_that("start_iteration works correctly", {
expect_equal(pred2, pred1)
expect_equal(pred_contrib2, pred_contrib1)

pred_leaf1 <- predict(bst, test$data, predleaf = TRUE)
pred_leaf2 <- predict(bst, test$data, start_iteration = 0L, num_iteration = end_iter + 1L, predleaf = TRUE)
pred_leaf1 <- predict(bst, test$data, type = "leaf")
pred_leaf2 <- predict(bst, test$data, start_iteration = 0L, num_iteration = end_iter + 1L, type = "leaf")
expect_equal(pred_leaf1, pred_leaf2)
})

Expand Down Expand Up @@ -139,11 +139,11 @@ test_that("start_iteration works correctly", {
# dense matrix with row names
pred <- predict(bst, X)
.expect_has_row_names(pred, X)
pred <- predict(bst, X, rawscore = TRUE)
pred <- predict(bst, X, type = "raw")
.expect_has_row_names(pred, X)
pred <- predict(bst, X, predleaf = TRUE)
pred <- predict(bst, X, type = "leaf")
.expect_has_row_names(pred, X)
pred <- predict(bst, X, predcontrib = TRUE)
pred <- predict(bst, X, type = "contrib")
.expect_has_row_names(pred, X)

# dense matrix without row names
Expand All @@ -156,11 +156,11 @@ test_that("start_iteration works correctly", {
Xcsc <- as(X, "CsparseMatrix")
pred <- predict(bst, Xcsc)
.expect_has_row_names(pred, Xcsc)
pred <- predict(bst, Xcsc, rawscore = TRUE)
pred <- predict(bst, Xcsc, type = "raw")
.expect_has_row_names(pred, Xcsc)
pred <- predict(bst, Xcsc, predleaf = TRUE)
pred <- predict(bst, Xcsc, type = "leaf")
.expect_has_row_names(pred, Xcsc)
pred <- predict(bst, Xcsc, predcontrib = TRUE)
pred <- predict(bst, Xcsc, type = "contrib")
.expect_has_row_names(pred, Xcsc)

# sparse matrix without row names
Expand Down Expand Up @@ -245,7 +245,7 @@ test_that("predictions for regression and binary classification are returned as
pred <- predict(model, X)
expect_true(is.vector(pred))
expect_equal(length(pred), nrow(X))
pred <- predict(model, X, rawscore = TRUE)
pred <- predict(model, X, type = "raw")
expect_true(is.vector(pred))
expect_equal(length(pred), nrow(X))

Expand All @@ -262,7 +262,7 @@ test_that("predictions for regression and binary classification are returned as
pred <- predict(model, X)
expect_true(is.vector(pred))
expect_equal(length(pred), nrow(X))
pred <- predict(model, X, rawscore = TRUE)
pred <- predict(model, X, type = "raw")
expect_true(is.vector(pred))
expect_equal(length(pred), nrow(X))
})
Expand All @@ -283,7 +283,7 @@ test_that("predictions for multiclass classification are returned as matrix", {
expect_true(is.matrix(pred))
expect_equal(nrow(pred), nrow(X))
expect_equal(ncol(pred), 3L)
pred <- predict(model, X, rawscore = TRUE)
pred <- predict(model, X, type = "raw")
expect_true(is.matrix(pred))
expect_equal(nrow(pred), nrow(X))
expect_equal(ncol(pred), 3L)
Expand Down
4 changes: 2 additions & 2 deletions R-package/tests/testthat/test_basic.R
Original file line number Diff line number Diff line change
Expand Up @@ -2898,7 +2898,7 @@ test_that("lightgbm() accepts init_score as function argument", {
, nrounds = 5L
, verbose = -1L
)
pred1 <- predict(bst1, train$data, rawscore = TRUE)
pred1 <- predict(bst1, train$data, type = "raw")

bst2 <- lightgbm(
data = train$data
Expand All @@ -2908,7 +2908,7 @@ test_that("lightgbm() accepts init_score as function argument", {
, nrounds = 5L
, verbose = -1L
)
pred2 <- predict(bst2, train$data, rawscore = TRUE)
pred2 <- predict(bst2, train$data, type = "raw")

expect_true(any(pred1 != pred2))
})
Expand Down
8 changes: 4 additions & 4 deletions R-package/tests/testthat/test_lgb.Booster.R
Original file line number Diff line number Diff line change
Expand Up @@ -293,8 +293,8 @@ test_that("Saving a large model to string should work", {
)

pred <- predict(bst, train$data)
pred_leaf_indx <- predict(bst, train$data, predleaf = TRUE)
pred_raw_score <- predict(bst, train$data, rawscore = TRUE)
pred_leaf_indx <- predict(bst, train$data, type = "leaf")
pred_raw_score <- predict(bst, train$data, type = "raw")
model_string <- bst$save_model_to_string()

# make sure this test is still producing a model bigger than the default
Expand All @@ -312,8 +312,8 @@ test_that("Saving a large model to string should work", {
model_str = model_string
)
pred2 <- predict(bst2, train$data)
pred2_leaf_indx <- predict(bst2, train$data, predleaf = TRUE)
pred2_raw_score <- predict(bst2, train$data, rawscore = TRUE)
pred2_leaf_indx <- predict(bst2, train$data, type = "leaf")
pred2_raw_score <- predict(bst2, train$data, type = "raw")
expect_identical(pred, pred2)
expect_identical(pred_leaf_indx, pred2_leaf_indx)
expect_identical(pred_raw_score, pred2_raw_score)
Expand Down