[R] Remove `reshape` argument in `predict` #10330

david-cortes · 2024-05-25T20:13:42Z

By default, the predict method will return outputs in a vector having row-major order. This is quite unintuitive in R, since matrices and arrays follow column-major order, thus it's rather unlikely that a user would want to use the prediction output as-is if it comes as row-major.

This PR changes the default instead to reshape=TRUE, which in the case of multi-column objectives, will return an R matrix with the dimensions that a user would expect.

What's more, if a user were looking to avoid computational inefficiencies by avoiding the data transpose from row-major to column-major, xgboost also provides a strict_shape parameter which has that effect with a more clearly signaled output.

I am nevertheless not sure that the current mix between reshape and strict_shape as mutually exclusive parameters is ideal - perhaps there could be a parameter strict_shape that would determine whether vectors are reshaped to matrices when they are 1D but have no effect for 2D and higher, and a parameter rows_as_first_dim or similar that would control whether the output shape should have rows as the first or last dimension.

trivialfis · 2024-05-26T06:12:52Z

Maybe we should consolidate them into a single parameter and pick a default?

david-cortes · 2024-05-26T15:37:10Z

Thing is, they have rather different usages:

A user might want to ensure that the output from predict will always be the same class, so that predictions from a unidimensional objective would still be a matrix but with only one column, in order to pass that to some further framework that would require this logic.
A user might want to make the predictions as fast as possible, for which outputting the data in inverse order of dimensions would be the solution in order to accommodate R's column-major order.

But I'm not sure what should be the ideal arguments to offer to control this sort of functionalities.

trivialfis · 2024-05-27T02:25:22Z

Hmm, my personal preference is something that aligns with R and is consistent. Performance is great, but not more important than consistency. I don't know what are the best practices established by other frameworks. I would choose the default prediction to return a consistent shape as stated in https://github.com/dmlc/xgboost/blob/master/doc/prediction.rst , in a format that aligns with R (say, column-major). The type is a bit of a headache, maybe we use a matrix when possible, an array when the dimension is higher than 2?

david-cortes · 2024-05-27T16:38:06Z

The type is a bit of a headache, maybe we use a matrix when possible, an array when the dimension is higher than 2?

The object class for predictions with more than one entry per row should not be an issue, because all 2D arrays in R are both arrays and matrices:

is.matrix(array(dim=c(2,3)))

[1] TRUE

class(matrix(0, nrow=2, ncol=3))

[1] "matrix" "array"

What I'd say is an issue is to return a 2D object as a 1D vector instead of giving it dimensions, which is the current situation in the master branch.

I don't know what are the best practices established by other frameworks.

For the most part, other frameworks (including base R's stats module) behave the same as when passing reshape=TRUE here. Some packages might return the equivalent of strict_shape=TRUE but they are in the minority. I'm not aware of any package that would allow configuring the output shape like this though.

trivialfis · 2024-05-28T07:30:24Z

I'm not aware of any package that would allow configuring the output shape like this though.

We can remove it as well. Simplicity seems important.

What I'd say is an issue is to return a 2D object as a 1D vector instead of giving it dimensions

If we can avoid doing that with this PR, we can always return a matrix with dims instead of a numeric vector. Thank you for sharing that matrix and array are not that different in R. I think that's good news for us, we can use it consistently and avoid other types.

david-cortes · 2024-05-28T19:53:00Z

@trivialfis After a more thorough reading, I'm thinking the following could be a better alternative:

Pass the strict_shape parameter from the user to the C-level prediction function instead of always setting it to TRUE, and use the dimensions as they come from it (in reverse order, since they are row-major).
Add an option to output the result as-is from the step above.
If the result has more than one dimension, transpose (or permute dimensions in reverse order if more than 2D) the output as it comes from the C-level function in order to translate that into an R array (column-major) of the same original dimensions.
Potentially, keep a reshape parameter (FALSE by default) to control whether to split multi-class and multi-target into lists or not. Otherwise output the result from the previous step.

@mayer79 what are your thoughts on this?

mayer79 · 2024-05-29T06:52:02Z

@david-cortes : I would support this, very good idea.

With predcontrib = TRUE or predinteraction = TRUE, I think it is convenient to allow the user to split output dimensions into a list. Maybe even keep this behavior as the default?

david-cortes · 2024-06-02T19:08:48Z

Redid the PR so as to let the C++ core library handle the shape, allowing the user to pass strict_shape and avoid_transpose. The shapes should now always match with the python interface, as long as one doesn't pass avoid_transpose.

As part of the changes, I realized now most of the custom R prediction code is no longer needed so I removed it, alongside with removing the reshape argument. Also added a warning when the user supplies unused prediction parameters under ..., which catched a few tests that were passing additional arguments from previous code that was also removed.

There's one thing that I wasn't very sure about though: there's a function xgb.shap.data which does some custom reshaping inside it. It seems to be an internal-only function, but it appears it is meant to take on objects that could have come from outside XGBoost. Not sure if it needs additional changes after this PR.

mayer79 · 2024-06-03T12:22:01Z

Looks great! Not sure about xgb.shap.data() as well. I don't think it is heavily used though.

One comment: Why do we use .Call(XGSetArrayDimNamesInplace_R, arr, dim_names) instead of a simple

dimnames(arra) <- dim_names

?

According to Hadley's book http://adv-r.had.co.nz/Functions.html:

"(primitive replacement functions don’t have to make copies)"

In our case:

Instead of

if ((predcontrib || predinteraction) && !is.null(colnames(newdata))) {
  cnames <- c(colnames(newdata), "(Intercept)")
  dim_names <- vector(mode = "list", length = length(dim(arr)))
  dim_names[[1L]] <- cnames
  if (predinteraction) dim_names[[2L]] <- cnames
  .Call(XGSetArrayDimNamesInplace_R, arr, dim_names)
}

we could write

if ((predcontrib || predinteraction) && !is.null(colnames(newdata))) {
  rownames(arr) <- cnames <- c(colnames(newdata), "(Intercept)")
  if (predinteraction) 
    colnames(arr) <- cnames
}

david-cortes · 2024-06-03T16:03:30Z

@mayer79 That tutorial is outdated.

mayer79 · 2024-06-04T07:44:10Z

@mayer79 That tutorial is outdated.

Here is the new edition:

https://adv-r.hadley.nz/functions.html

According to that text, "replacement functions actually create a modified copy". Which shows that my comment was wrong!

trivialfis

Thank you for the work on the prediction function. Could you please help rebase the PR?

david-cortes · 2024-08-18T08:45:10Z

Updated.

remove 'reshape' argument, let shapes be handled by core cpp library

ca74400

david-cortes force-pushed the reshape_by_default branch from 61d6e6a to ca74400 Compare June 2, 2024 19:03

david-cortes changed the title ~~[R] Default to reshape=TRUE in predictions~~ [R] Remove reshape argument in predict Jun 2, 2024

david-cortes mentioned this pull request Aug 2, 2024

[R] Working on Roxygen documentation #10674

Merged

trivialfis reviewed Aug 16, 2024

View reviewed changes

solver merge conflicts

165bcc9

trivialfis approved these changes Aug 18, 2024

View reviewed changes

trivialfis merged commit caabee2 into dmlc:master Aug 18, 2024
25 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Remove `reshape` argument in `predict` #10330

[R] Remove `reshape` argument in `predict` #10330

david-cortes commented May 25, 2024

trivialfis commented May 26, 2024

david-cortes commented May 26, 2024

trivialfis commented May 27, 2024

david-cortes commented May 27, 2024

trivialfis commented May 28, 2024

david-cortes commented May 28, 2024 •

edited

Loading

mayer79 commented May 29, 2024

david-cortes commented Jun 2, 2024

mayer79 commented Jun 3, 2024

david-cortes commented Jun 3, 2024

mayer79 commented Jun 4, 2024

trivialfis left a comment

david-cortes commented Aug 18, 2024

[R] Remove reshape argument in predict #10330

[R] Remove reshape argument in predict #10330

Conversation

david-cortes commented May 25, 2024

trivialfis commented May 26, 2024

david-cortes commented May 26, 2024

trivialfis commented May 27, 2024

david-cortes commented May 27, 2024

trivialfis commented May 28, 2024

david-cortes commented May 28, 2024 • edited Loading

mayer79 commented May 29, 2024

david-cortes commented Jun 2, 2024

mayer79 commented Jun 3, 2024

david-cortes commented Jun 3, 2024

mayer79 commented Jun 4, 2024

trivialfis left a comment

Choose a reason for hiding this comment

david-cortes commented Aug 18, 2024

[R] Remove `reshape` argument in `predict` #10330

[R] Remove `reshape` argument in `predict` #10330

david-cortes commented May 28, 2024 •

edited

Loading