Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Dataset construction fails if data does not have feature names and categorical_features is specified #4374

Closed
jameslamb opened this issue Jun 14, 2021 · 4 comments · Fixed by #5184

Comments

@jameslamb
Copy link
Collaborator

Description

If you specify categorical_features and the training data does not have feature names, Dataset construction fails.

Reproducible example

library(lightgbm)
X <- matrix(
    data = c(
        rnorm(100L)
        , rnorm(100L)
        , as.integer(runif(100L, min = 0.0, max = 1.0) / 0.02)
    )
    , ncol = 3L
)
y <- rnorm(100L)

dtrain <- lgb.Dataset(
    data = X
    , label = y
    , categorical_feature = 3L
)
dtrain$construct()

# note that X does not have column names
names(X)

Results in the following error message:

Error in dtrain$construct() :
lgb.self.get.handle: supplied a too large value in categorical_feature: 3 but only 0 features

Environment info

LightGBM version or commit hash: latest master (53ffba7)

Command(s) you used to install LightGBM

sh build-cran-package.sh
R CMD INSTALL lightgbm_*.tar.gz

Additional Comments

For anyone reading this who is not familiar with LightGBM's internals...this bug will also affect lightgbm(), lgb.cv() and lgb.train().

This bug is caused by the fact that the R packages uses length(colnames) to compute "number of features".

if (max(private$categorical_feature) > length(private$colnames)) {

This is not reliable, since creating a Dataset without feature names is supported.

@LimpEmu
Copy link

LimpEmu commented Apr 28, 2022

This problem also happens when the data does have feature names. When I look at length(colnames) for my data, it returns 10, yet I get this message for the last column in the categorical feature list. This package is really difficult to use.

@jameslamb
Copy link
Collaborator Author

This problem also happens when the data does have feature names. When I look at length(colnames) for my data, it returns 10, yet I get this message for the last column in the categorical feature list

Are you able to provide a reproducible example of the behavior you're describing? That would help us understand exactly what you mean by "get this message for the last column".

This package is really difficult to use.

specific comments indicating surprising or incorrect behavior you run into using LightGBM are very helpful and welcome.

Sweeping complaints like "this package is difficult to use" do not help to improve this project and are very much not welcomed. Please keep your comments in this repo polite and focused on improving the project or getting more information about it.

@LimpEmu
Copy link

LimpEmu commented Apr 28, 2022

I apologize for my sweeping comment.

I was able to fix my problem by reading my data mydata <- as.matrix(read_sas("[...]/xxxxxxxx.sas7bdat"))
Without the as.matrix part, I got the result below. This problem continued even when I tried to change the categorical columns to integers. From the documentation it appears that lightgbm does not work with tibbles ().

> cld <- read_sas("[...]/xxxxxx.sas7bdat")
> # split data into training and testing set based on cld36fnl
> cldsplit = sample.split(cld[,1], SplitRatio=0.7)                                            
> cldTrain = subset(cld, cldsplit == TRUE)
> cldTest = subset(cld, cldsplit == FALSE)
> cldTrainX = subset(cldTrain,select=c(3:12))
> cldTrainY = subset(cldTrain,select=c(1))
> # lbg needs 0 1 2 for multiclass values!
> cldTrainY =  replace(cldTrainY, cldTrainY == 3, 2)
> cldTrainX
# A tibble: 14,083 x 10
   surfcat  dret caffeine vitamina indometh ventdays srgany nitrico   ap5 gaweeks
     <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>   <dbl> <dbl>   <dbl>
 1       0     0        1        0        0        0      0       0     9      30
 2       1     1        1        0        0      103      1       1     5      28
 3       3     1        1        0        0        2      0       0     6      26
 4       0     0        1        0        0       10      1       0     5      29
 5       0     0        1        0        1        0      0       0     9      29
 6       0     0        1        0        0        3      1       0     6      25
 7       0     0        1        0        0        0      0       0     8      25
 8       2     0        1        0        0       10      0       0     7      28
 9       1     1        1        0        1        6      0       0     2      28
10       0     0        1        0        0        0      0       0     9      31
# ... with 14,073 more rows
> lgbtrain <- lgb.Dataset(cldTrainX, label=cldTrainY, categorical_feature = c(1L, 2L, 3L, 4L, 5L, 7L, 8L), 
+             params=list(feature_pre_filter = FALSE))
> lgb.Dataset.construct(lgbtrain)
Error in dataset$construct() : 
  lgb.self.get.handle: supplied a too large value in categorical_feature: 8 but only 0 features

StrikerRUS pushed a commit that referenced this issue Apr 30, 2022
…ata does not have column names (fixes #4374) (#5184)

* check for number of columns if data is matrixx for categorical indices check

* check for error when using a greater index than the number of columns

* apply suggestion

Co-authored-by: James Lamb <[email protected]>

* revert whitespace change

* check if is filename instead of matrix

Co-authored-by: James Lamb <[email protected]>
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
2 participants