-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package][R-package] allow using feature names when retrieving number of bins #5116
Conversation
/gha run r-solaris Workflow Solaris CRAN check has been triggered! 🚀 solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.2.99.tar.gz-2a6822f9b9014b7aaffdba4b55bc65cb |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me a few days to provide a review. I left a few suggestions for your consideration.
R-package/R/lgb.Dataset.R
Outdated
# If the data didn't have feature names we take the ones defined at cpp side | ||
# otherwise we just overwrite them | ||
self$get_colnames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the discussion in https://github.com/microsoft/LightGBM/pull/5116/files#r843276288 that led to this change, but I don't understand why adding this call here is necessary for the R package and I don't think it's achieving what you want (avoiding a call to the C++ side for performance reasons in Dataset$get_feature_num_bin()
).
The use of colnames(self)
down in this PR's proposed change to Dataset$get_feature_num_bin()
will still make a call to the C++ side.
colnames()
called on a Dataset
calls dimnames.lgb.Dataset()
, which calls self$get_colnames()
, which gets feature names from the C++ side for a constructed Dataset.
LightGBM/R-package/R/lgb.Dataset.R
Line 983 in b462d0a
return(list(NULL, x$get_colnames())) |
If you want to avoid that call to the C++ side, I think the code in Dataset$get_feature_num_bin()
could just reference private$colnames
directly.
Consider the following example:
library(lightgbm)
data("EuStockMarkets")
feature_names <- c("SMI", "CAC", "FTSE")
target_name <- "DAX"
stockDF <- as.data.frame(EuStockMarkets)
X_mat <- data.matrix(stockDF[, feature_names])
y <- stockDF[[target_name]]
# colnames are stored in private$colnames when initializing Dataset
dtrain <- lightgbm::lgb.Dataset(
data = X_mat
, label = y
, colnames = feature_names
)
print(dtrain$.__enclos_env__$private$colnames)
# colnames are in private$colnames after construction
dtrain$construct()
print(dtrain$.__enclos_env__$private$colnames)
# set_colnames() updates the C++ side AND private$colnames
dtrain$set_colnames(c("A", "B", "C"))
print(dtrain$.__enclos_env__$private$colnames)
.Call(
lightgbm:::LGBM_DatasetGetFeatureNames_R
, dtrain$.__enclos_env__$private$handle
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. You're right, the call in the Dataset$get_feature_num_bin()
should use private$colnames
. The line here however tries to mimic the one on the python side when after constructing the dataset calls self.feature_name = self.get_feature_name()
to make sure that the feature names on the python side are the same as the ones on the C++ side. This is for the case when no colnames
are provided in the constructor but the default names are assigned on the C++ side and we want to retrieve the feature bins by the default names, because in that case private$colnames
is NULL
until we call self$get_colnames()
once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make sure that the feature names on the python side are the same as the ones on the C++ side
Ok, I see. I still think it is not obvious, when looking at this line, that that's what is happening. It just looks like a call that got left behind, whose result isn't assigned.
Since you already know with certainty that at this point in Dataset$construct()
that Dataset
has been constructed, I think it would be clearer to directly set private$colnames
like this:
private$colnames <- .Call(
LGBM_DatasetGetFeatureNames_R
, private$handle
)
Instead of relying on the fact that self$get_colnames()
happens to do that while it's also getting the features from the C++ side.
Looking at this again, I also recommend reformatting this comment to the following:
# Ensure that private$colnames matches the feature names on the C++ side. This line is necessary
# in cases like constructing from a file or from a matrix with no column names.
So that it's a bit clearer when this can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 302a6a1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for adding this feature!
Just one suggestion: how about adding a test with categorical features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-aproving. Thanks for adding the test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Please just see my one suggestion.
R-package/R/lgb.Dataset.R
Outdated
# If the data didn't have feature names we take the ones defined at cpp side | ||
# otherwise we just overwrite them | ||
self$get_colnames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make sure that the feature names on the python side are the same as the ones on the C++ side
Ok, I see. I still think it is not obvious, when looking at this line, that that's what is happening. It just looks like a call that got left behind, whose result isn't assigned.
Since you already know with certainty that at this point in Dataset$construct()
that Dataset
has been constructed, I think it would be clearer to directly set private$colnames
like this:
private$colnames <- .Call(
LGBM_DatasetGetFeatureNames_R
, private$handle
)
Instead of relying on the fact that self$get_colnames()
happens to do that while it's also getting the features from the C++ side.
Looking at this again, I also recommend reformatting this comment to the following:
# Ensure that private$colnames matches the feature names on the C++ side. This line is necessary
# in cases like constructing from a file or from a matrix with no column names.
So that it's a bit clearer when this can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks for working on this!
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Improves on #5048 by allowing to use the feature names as well as the feature indices when retrieving the constructed number of bins. This follows the proposal described #5048 (review).