Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package][R-package] allow using feature names when retrieving number of bins #5116

Merged
merged 10 commits into from
May 17, 2022
10 changes: 10 additions & 0 deletions R-package/R/lgb.Dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,10 @@ Dataset <- R6::R6Class(
self$set_colnames(colnames = private$colnames)
}

# If the data didn't have feature names we take the ones defined at cpp side
# otherwise we just overwrite them
self$get_colnames()
Copy link
Collaborator

@jameslamb jameslamb Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the discussion in https://github.com/microsoft/LightGBM/pull/5116/files#r843276288 that led to this change, but I don't understand why adding this call here is necessary for the R package and I don't think it's achieving what you want (avoiding a call to the C++ side for performance reasons in Dataset$get_feature_num_bin()).

The use of colnames(self) down in this PR's proposed change to Dataset$get_feature_num_bin() will still make a call to the C++ side.

colnames() called on a Dataset calls dimnames.lgb.Dataset(), which calls self$get_colnames(), which gets feature names from the C++ side for a constructed Dataset.

return(list(NULL, x$get_colnames()))

If you want to avoid that call to the C++ side, I think the code in Dataset$get_feature_num_bin() could just reference private$colnames directly.

Consider the following example:

library(lightgbm)
data("EuStockMarkets")

feature_names <- c("SMI", "CAC", "FTSE")
target_name <- "DAX"
stockDF <- as.data.frame(EuStockMarkets)
X_mat <- data.matrix(stockDF[, feature_names])
y <- stockDF[[target_name]]

# colnames are stored in private$colnames when initializing Dataset
dtrain <- lightgbm::lgb.Dataset(
    data = X_mat
    , label = y
    , colnames = feature_names
)
print(dtrain$.__enclos_env__$private$colnames)

# colnames are in private$colnames after construction
dtrain$construct()
print(dtrain$.__enclos_env__$private$colnames)

# set_colnames() updates the C++ side AND private$colnames
dtrain$set_colnames(c("A", "B", "C"))
print(dtrain$.__enclos_env__$private$colnames)
.Call(
    lightgbm:::LGBM_DatasetGetFeatureNames_R
    , dtrain$.__enclos_env__$private$handle
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. You're right, the call in the Dataset$get_feature_num_bin() should use private$colnames. The line here however tries to mimic the one on the python side when after constructing the dataset calls self.feature_name = self.get_feature_name() to make sure that the feature names on the python side are the same as the ones on the C++ side. This is for the case when no colnames are provided in the constructor but the default names are assigned on the C++ side and we want to retrieve the feature bins by the default names, because in that case private$colnames is NULL until we call self$get_colnames() once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure that the feature names on the python side are the same as the ones on the C++ side

Ok, I see. I still think it is not obvious, when looking at this line, that that's what is happening. It just looks like a call that got left behind, whose result isn't assigned.

Since you already know with certainty that at this point in Dataset$construct() that Dataset has been constructed, I think it would be clearer to directly set private$colnames like this:

private$colnames <- .Call(
    LGBM_DatasetGetFeatureNames_R
    , private$handle
)

Instead of relying on the fact that self$get_colnames() happens to do that while it's also getting the features from the C++ side.

Looking at this again, I also recommend reformatting this comment to the following:

# Ensure that private$colnames matches the feature names on the C++ side. This line is necessary
# in cases like constructing from a file or from a matrix with no column names.

So that it's a bit clearer when this can happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 302a6a1


# Load init score if requested
if (!is.null(private$predictor) && is.null(private$used_indices)) {

Expand Down Expand Up @@ -381,6 +385,12 @@ Dataset <- R6::R6Class(
if (lgb.is.null.handle(x = private$handle)) {
stop("Cannot get number of bins in feature before constructing Dataset.")
}
if (is.character(feature)) {
feature <- which(colnames(self) == feature)
if (length(feature) == 0L) {
stop("feature not found")
}
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
}
num_bin <- integer(1L)
.Call(
LGBM_DatasetGetFeatureNumBin_R
Expand Down
11 changes: 11 additions & 0 deletions R-package/tests/testthat/test_dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -547,4 +547,15 @@ test_that("lgb.Dataset$get_feature_num_bin() works", {
)
actual_num_bins <- sapply(1L:5L, ds$get_feature_num_bin)
expect_identical(actual_num_bins, expected_num_bins)
# test using defined feature names
bins_by_name <- sapply(colnames(raw_mat), ds$get_feature_num_bin)
expect_identical(unname(bins_by_name), expected_num_bins)
# test using default feature names
no_names_mat <- raw_mat
colnames(no_names_mat) <- NULL
ds_no_names <- lgb.Dataset(no_names_mat, params = list(min_data_in_bin = min_data_in_bin))
ds_no_names$construct()
default_names <- lapply(seq(1L, ncol(raw_mat)), function(i) sprintf("Column_%d", i - 1L))
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
bins_by_default_name <- sapply(default_names, ds_no_names$get_feature_num_bin)
expect_identical(bins_by_default_name, expected_num_bins)
})
9 changes: 6 additions & 3 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1821,6 +1821,7 @@ def construct(self):
feature_name=self.feature_name, categorical_feature=self.categorical_feature, params=self.params)
if self.free_raw_data:
self.data = None
self.feature_name = self.get_feature_name()
return self

def create_valid(self, data, label=None, weight=None, group=None, init_score=None, params=None):
Expand Down Expand Up @@ -2386,20 +2387,22 @@ def num_feature(self):
else:
raise LightGBMError("Cannot get num_feature before construct dataset")

def feature_num_bin(self, feature: int) -> int:
def feature_num_bin(self, feature: Union[int, str]) -> int:
"""Get the number of bins for a feature.

Parameters
----------
feature : int
Index of the feature.
feature : int or str
Index or name of the feature.

Returns
-------
number_of_bins : int
The number of constructed bins for the feature in the Dataset.
"""
if self.handle is not None:
if isinstance(feature, str):
feature = self.feature_name.index(feature)
jmoralez marked this conversation as resolved.
Show resolved Hide resolved
ret = ctypes.c_int(0)
_safe_call(_LIB.LGBM_DatasetGetFeatureNumBin(self.handle,
ctypes.c_int(feature),
Expand Down
13 changes: 12 additions & 1 deletion tests/python_package_test/test_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -634,7 +634,9 @@ def test_feature_num_bin(min_data_in_bin):
np.array([1, 2] * 49 + 2 * [np.nan]),
np.zeros(100),
]).T
ds = lgb.Dataset(X, params={'min_data_in_bin': min_data_in_bin}).construct()
feature_name = [f'x{i}' for i in range(X.shape[1])]
ds = lgb.Dataset(X, params={'min_data_in_bin': min_data_in_bin}, feature_name=feature_name)
ds.construct()
expected_num_bins = [
100 // min_data_in_bin + 1, # extra bin for zero
3, # 0, 1, 2
Expand All @@ -644,6 +646,15 @@ def test_feature_num_bin(min_data_in_bin):
]
actual_num_bins = [ds.feature_num_bin(i) for i in range(X.shape[1])]
assert actual_num_bins == expected_num_bins
# test using defined feature names
bins_by_name = [ds.feature_num_bin(name) for name in feature_name]
assert bins_by_name == expected_num_bins
# test using default feature names
ds_no_names = lgb.Dataset(X, params={'min_data_in_bin': min_data_in_bin})
ds_no_names.construct()
default_names = [f'Column_{i}' for i in range(X.shape[1])]
bins_by_default_name = [ds_no_names.feature_num_bin(name) for name in default_names]
assert bins_by_default_name == expected_num_bins


def test_feature_num_bin_with_max_bin_by_feature():
Expand Down