Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package][R-package] allow using feature names when retrieving number of bins #5116

Merged
merged 10 commits into from
May 17, 2022

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Apr 1, 2022

Improves on #5048 by allowing to use the feature names as well as the feature indices when retrieving the constructed number of bins. This follows the proposal described #5048 (review).

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Apr 1, 2022

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2080279172

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.2.99.tar.gz-2a6822f9b9014b7aaffdba4b55bc65cb
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.3.2.99.tar.gz-f5203778faf94b419de748090cfe20ab
Reports also have been sent to LightGBM public e-mail: https://yopmail.com?lightgbm_rhub_checks
Status: success ✔️.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Apr 1, 2022

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2080279502

Status: success ✔️.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a few days to provide a review. I left a few suggestions for your consideration.

R-package/R/lgb.Dataset.R Outdated Show resolved Hide resolved
Comment on lines 291 to 293
# If the data didn't have feature names we take the ones defined at cpp side
# otherwise we just overwrite them
self$get_colnames()
Copy link
Collaborator

@jameslamb jameslamb Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the discussion in https://github.com/microsoft/LightGBM/pull/5116/files#r843276288 that led to this change, but I don't understand why adding this call here is necessary for the R package and I don't think it's achieving what you want (avoiding a call to the C++ side for performance reasons in Dataset$get_feature_num_bin()).

The use of colnames(self) down in this PR's proposed change to Dataset$get_feature_num_bin() will still make a call to the C++ side.

colnames() called on a Dataset calls dimnames.lgb.Dataset(), which calls self$get_colnames(), which gets feature names from the C++ side for a constructed Dataset.

return(list(NULL, x$get_colnames()))

If you want to avoid that call to the C++ side, I think the code in Dataset$get_feature_num_bin() could just reference private$colnames directly.

Consider the following example:

library(lightgbm)
data("EuStockMarkets")

feature_names <- c("SMI", "CAC", "FTSE")
target_name <- "DAX"
stockDF <- as.data.frame(EuStockMarkets)
X_mat <- data.matrix(stockDF[, feature_names])
y <- stockDF[[target_name]]

# colnames are stored in private$colnames when initializing Dataset
dtrain <- lightgbm::lgb.Dataset(
    data = X_mat
    , label = y
    , colnames = feature_names
)
print(dtrain$.__enclos_env__$private$colnames)

# colnames are in private$colnames after construction
dtrain$construct()
print(dtrain$.__enclos_env__$private$colnames)

# set_colnames() updates the C++ side AND private$colnames
dtrain$set_colnames(c("A", "B", "C"))
print(dtrain$.__enclos_env__$private$colnames)
.Call(
    lightgbm:::LGBM_DatasetGetFeatureNames_R
    , dtrain$.__enclos_env__$private$handle
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. You're right, the call in the Dataset$get_feature_num_bin() should use private$colnames. The line here however tries to mimic the one on the python side when after constructing the dataset calls self.feature_name = self.get_feature_name() to make sure that the feature names on the python side are the same as the ones on the C++ side. This is for the case when no colnames are provided in the constructor but the default names are assigned on the C++ side and we want to retrieve the feature bins by the default names, because in that case private$colnames is NULL until we call self$get_colnames() once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure that the feature names on the python side are the same as the ones on the C++ side

Ok, I see. I still think it is not obvious, when looking at this line, that that's what is happening. It just looks like a call that got left behind, whose result isn't assigned.

Since you already know with certainty that at this point in Dataset$construct() that Dataset has been constructed, I think it would be clearer to directly set private$colnames like this:

private$colnames <- .Call(
    LGBM_DatasetGetFeatureNames_R
    , private$handle
)

Instead of relying on the fact that self$get_colnames() happens to do that while it's also getting the features from the C++ side.

Looking at this again, I also recommend reformatting this comment to the following:

# Ensure that private$colnames matches the feature names on the C++ side. This line is necessary
# in cases like constructing from a file or from a matrix with no column names.

So that it's a bit clearer when this can happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 302a6a1

R-package/tests/testthat/test_dataset.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for adding this feature!

Just one suggestion: how about adding a test with categorical features?

@jameslamb jameslamb mentioned this pull request Apr 14, 2022
60 tasks
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-aproving. Thanks for adding the test!

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Please just see my one suggestion.

Comment on lines 291 to 293
# If the data didn't have feature names we take the ones defined at cpp side
# otherwise we just overwrite them
self$get_colnames()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure that the feature names on the python side are the same as the ones on the C++ side

Ok, I see. I still think it is not obvious, when looking at this line, that that's what is happening. It just looks like a call that got left behind, whose result isn't assigned.

Since you already know with certainty that at this point in Dataset$construct() that Dataset has been constructed, I think it would be clearer to directly set private$colnames like this:

private$colnames <- .Call(
    LGBM_DatasetGetFeatureNames_R
    , private$handle
)

Instead of relying on the fact that self$get_colnames() happens to do that while it's also getting the features from the C++ side.

Looking at this again, I also recommend reformatting this comment to the following:

# Ensure that private$colnames matches the feature names on the C++ side. This line is necessary
# in cases like constructing from a file or from a matrix with no column names.

So that it's a bit clearer when this can happen.

@jmoralez jmoralez requested a review from jameslamb May 16, 2022 16:40
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks for working on this!

@jameslamb jameslamb merged commit 5b664b6 into microsoft:master May 17, 2022
@jmoralez jmoralez deleted the feature-bin-by-name branch May 17, 2022 04:16
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants