Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Remove reshape argument in predict #10330

Merged
merged 2 commits into from
Aug 18, 2024

Conversation

david-cortes
Copy link
Contributor

ref #9810

By default, the predict method will return outputs in a vector having row-major order. This is quite unintuitive in R, since matrices and arrays follow column-major order, thus it's rather unlikely that a user would want to use the prediction output as-is if it comes as row-major.

This PR changes the default instead to reshape=TRUE, which in the case of multi-column objectives, will return an R matrix with the dimensions that a user would expect.

What's more, if a user were looking to avoid computational inefficiencies by avoiding the data transpose from row-major to column-major, xgboost also provides a strict_shape parameter which has that effect with a more clearly signaled output.

I am nevertheless not sure that the current mix between reshape and strict_shape as mutually exclusive parameters is ideal - perhaps there could be a parameter strict_shape that would determine whether vectors are reshaped to matrices when they are 1D but have no effect for 2D and higher, and a parameter rows_as_first_dim or similar that would control whether the output shape should have rows as the first or last dimension.

@trivialfis
Copy link
Member

Maybe we should consolidate them into a single parameter and pick a default?

@david-cortes
Copy link
Contributor Author

Thing is, they have rather different usages:

  • A user might want to ensure that the output from predict will always be the same class, so that predictions from a unidimensional objective would still be a matrix but with only one column, in order to pass that to some further framework that would require this logic.
  • A user might want to make the predictions as fast as possible, for which outputting the data in inverse order of dimensions would be the solution in order to accommodate R's column-major order.

But I'm not sure what should be the ideal arguments to offer to control this sort of functionalities.

@trivialfis
Copy link
Member

Hmm, my personal preference is something that aligns with R and is consistent. Performance is great, but not more important than consistency. I don't know what are the best practices established by other frameworks. I would choose the default prediction to return a consistent shape as stated in https://github.com/dmlc/xgboost/blob/master/doc/prediction.rst , in a format that aligns with R (say, column-major). The type is a bit of a headache, maybe we use a matrix when possible, an array when the dimension is higher than 2?

@david-cortes
Copy link
Contributor Author

The type is a bit of a headache, maybe we use a matrix when possible, an array when the dimension is higher than 2?

The object class for predictions with more than one entry per row should not be an issue, because all 2D arrays in R are both arrays and matrices:

is.matrix(array(dim=c(2,3)))
[1] TRUE
class(matrix(0, nrow=2, ncol=3))
[1] "matrix" "array"

What I'd say is an issue is to return a 2D object as a 1D vector instead of giving it dimensions, which is the current situation in the master branch.

I don't know what are the best practices established by other frameworks.

For the most part, other frameworks (including base R's stats module) behave the same as when passing reshape=TRUE here. Some packages might return the equivalent of strict_shape=TRUE but they are in the minority. I'm not aware of any package that would allow configuring the output shape like this though.

@trivialfis
Copy link
Member

I'm not aware of any package that would allow configuring the output shape like this though.

We can remove it as well. Simplicity seems important.

What I'd say is an issue is to return a 2D object as a 1D vector instead of giving it dimensions

If we can avoid doing that with this PR, we can always return a matrix with dims instead of a numeric vector. Thank you for sharing that matrix and array are not that different in R. I think that's good news for us, we can use it consistently and avoid other types.

@david-cortes
Copy link
Contributor Author

david-cortes commented May 28, 2024

@trivialfis After a more thorough reading, I'm thinking the following could be a better alternative:

  • Pass the strict_shape parameter from the user to the C-level prediction function instead of always setting it to TRUE, and use the dimensions as they come from it (in reverse order, since they are row-major).
  • Add an option to output the result as-is from the step above.
  • If the result has more than one dimension, transpose (or permute dimensions in reverse order if more than 2D) the output as it comes from the C-level function in order to translate that into an R array (column-major) of the same original dimensions.
  • Potentially, keep a reshape parameter (FALSE by default) to control whether to split multi-class and multi-target into lists or not. Otherwise output the result from the previous step.

@mayer79 what are your thoughts on this?

@mayer79
Copy link
Contributor

mayer79 commented May 29, 2024

@david-cortes : I would support this, very good idea.

With predcontrib = TRUE or predinteraction = TRUE, I think it is convenient to allow the user to split output dimensions into a list. Maybe even keep this behavior as the default?

@david-cortes
Copy link
Contributor Author

Redid the PR so as to let the C++ core library handle the shape, allowing the user to pass strict_shape and avoid_transpose. The shapes should now always match with the python interface, as long as one doesn't pass avoid_transpose.

As part of the changes, I realized now most of the custom R prediction code is no longer needed so I removed it, alongside with removing the reshape argument. Also added a warning when the user supplies unused prediction parameters under ..., which catched a few tests that were passing additional arguments from previous code that was also removed.

There's one thing that I wasn't very sure about though: there's a function xgb.shap.data which does some custom reshaping inside it. It seems to be an internal-only function, but it appears it is meant to take on objects that could have come from outside XGBoost. Not sure if it needs additional changes after this PR.

@david-cortes david-cortes changed the title [R] Default to reshape=TRUE in predictions [R] Remove reshape argument in predict Jun 2, 2024
@mayer79
Copy link
Contributor

mayer79 commented Jun 3, 2024

Looks great! Not sure about xgb.shap.data() as well. I don't think it is heavily used though.

One comment: Why do we use .Call(XGSetArrayDimNamesInplace_R, arr, dim_names) instead of a simple

dimnames(arra) <- dim_names

?

According to Hadley's book http://adv-r.had.co.nz/Functions.html:

"(primitive replacement functions don’t have to make copies)"

In our case:

Instead of

if ((predcontrib || predinteraction) && !is.null(colnames(newdata))) {
  cnames <- c(colnames(newdata), "(Intercept)")
  dim_names <- vector(mode = "list", length = length(dim(arr)))
  dim_names[[1L]] <- cnames
  if (predinteraction) dim_names[[2L]] <- cnames
  .Call(XGSetArrayDimNamesInplace_R, arr, dim_names)
}

we could write

if ((predcontrib || predinteraction) && !is.null(colnames(newdata))) {
  rownames(arr) <- cnames <- c(colnames(newdata), "(Intercept)")
  if (predinteraction) 
    colnames(arr) <- cnames
}

@david-cortes
Copy link
Contributor Author

@mayer79 That tutorial is outdated.

@mayer79
Copy link
Contributor

mayer79 commented Jun 4, 2024

@mayer79 That tutorial is outdated.

Here is the new edition:

https://adv-r.hadley.nz/functions.html

According to that text, "replacement functions actually create a modified copy". Which shows that my comment was wrong!

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work on the prediction function. Could you please help rebase the PR?

@david-cortes
Copy link
Contributor Author

Updated.

@trivialfis trivialfis merged commit caabee2 into dmlc:master Aug 18, 2024
25 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants