[R-package] Very large l2 when training model #4305

jfouyang · 2021-05-20T16:29:04Z

Description

Hi, I am using lightGBM to determine feature importances from an in-house dataset that is very sparse in nature. When training the model on this sparse dataset, I noticed that the training l2 error is very large in the order of 10^73 and the feature importance results do not agree with my domain knowledge.

I also tried running the same dataset using xgboost and the training RMSE is much smaller in the range of 0.4-0.6. Furthermore, the feature importance results make a lot more sense to me. Finally, I also compared the Gain computed from lightGBM and xgboost (see the scatter plot below) and they do not agree very well with each other. I wonder if lightGBM does any manipulation/preprocessing to the dataset which resulted in the spurious large training l2 error?

As an additional note, I ran the same feature importance code previously on the older version of lightGBM (v2.3.4) and got results that are similar to xgboost. I only started getting this weird phenomenon when I upgraded to version3+ of lightGBM.

Reproducible example

The in-house dataset testData.rds can be downloaded from here

And here is the R code:

library(Matrix)
library(ggplot2)
library(xgboost)
library(lightgbm)

# LGB portion
testData = readRDS("testData.rds")
lgbParams = list(boosting_type = "gbdt", objective = "regression",
                 learning_rate = 0.01)
inp = lgb.Dataset(data = testData[, -1], 
                  label = testData[, 1]) 
set.seed(42)
model = lightgbm(data = inp, params = lgbParams, nrounds = 1000, 
                 eval_freq = 100, verbose = 1, num_threads = 20)
oup1 = lgb.importance(model)

# XGB portion
xgbParams = list(booster = "gbtree", objective = "reg:squarederror",
                 eta = 0.01, tree_method = "hist")
inp = xgb.DMatrix(data = testData[, -1], 
                  label = testData[, 1]) 
set.seed(42)
model = xgboost(data = inp, params = xgbParams, print_every_n = 100,
                nrounds = 1000, verbose = 1, nthread = 20)
oup2 = xgb.importance(model = model)

# Compare LGB and XGB Gain and plot 
oupCompare = oup1[oup2, on = "Feature"]
ggplot(oupCompare, aes(Gain, i.Gain)) +
  geom_point() + xlab("LGB Gain") + ylab("XGB Gain") + 
  theme_classic(base_size = 24) + scale_x_log10() + scale_y_log10()

Output from lightGBM:

[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.081095 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 104040
[LightGBM] [Info] Number of data points in the train set: 43791, number of used features: 408
[LightGBM] [Info] Start training from score 114245549891896526252176302135050240.000000
[1] "[1]:  train's l2:1.36747e+73"
[1] "[101]:  train's l2:1.2623e+73"
[1] "[201]:  train's l2:1.18071e+73"
[1] "[301]:  train's l2:1.10743e+73"
[1] "[401]:  train's l2:1.04192e+73"
[1] "[501]:  train's l2:9.82944e+72"
[1] "[601]:  train's l2:9.29686e+72"
[1] "[701]:  train's l2:8.79174e+72"
[1] "[801]:  train's l2:8.3274e+72"
[1] "[901]:  train's l2:7.92462e+72"
[1] "[1000]:  train's l2:7.55389e+72"

Output from xgboost:

[1]	train-rmse:0.663716 
[101]	train-rmse:0.504467 
[201]	train-rmse:0.462098 
[301]	train-rmse:0.442670 
[401]	train-rmse:0.429328 
[501]	train-rmse:0.419056 
[601]	train-rmse:0.410899 
[701]	train-rmse:0.403912 
[801]	train-rmse:0.397756 
[901]	train-rmse:0.391985 
[1000]	train-rmse:0.386239

Comparison of Gain feature importance from xgboost vs lightGBM:

Environment info

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /home/john/miniconda3/lib/libopenblasp-r0.3.15.so

locale:
 [1] LC_CTYPE=en_SG.UTF-8       LC_NUMERIC=C               LC_TIME=en_SG.UTF-8        LC_COLLATE=en_SG.UTF-8    
 [5] LC_MONETARY=en_SG.UTF-8    LC_MESSAGES=en_SG.UTF-8    LC_PAPER=en_SG.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_SG.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.3   lightgbm_3.2.1  R6_2.5.0        xgboost_1.4.1.1 Matrix_1.3-2   

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1    tidyselect_1.1.0  munsell_0.5.0     colorspace_2.0-0  lattice_0.20-41   rlang_0.4.10     
 [7] fansi_0.4.2       dplyr_1.0.5       tools_4.0.3       grid_4.0.3        data.table_1.14.0 gtable_0.3.0     
[13] utf8_1.2.1        DBI_1.1.1         withr_2.4.1       ellipsis_0.3.1    digest_0.6.27     assertthat_0.2.1 
[19] tibble_3.1.0      lifecycle_1.0.0   crayon_1.4.1      farver_2.1.0      purrr_0.3.4       vctrs_0.3.6      
[25] glue_1.4.2        compiler_4.0.3    pillar_1.5.1      generics_0.1.0    scales_1.1.1      jsonlite_1.7.2   
[31] pkgconfig_2.0.3

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-05-20T16:37:15Z

Thanks very much for using {lightgbm} and for the detailed write-up with a reproducible example! If no other maintainers get to it sooner, I will take a look in the next day or two.

jameslamb · 2021-05-20T19:41:35Z

Ok, I took a look.

I was able to reproduce this behavior on my system ({lightgbm} 3.2.1 installed from CRAN, R 4.05, macOS).

[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.212046 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 104040
[LightGBM] [Info] Number of data points in the train set: 43791, number of used features: 408
[LightGBM] [Info] Start training from score 5786324437318345225632202477835124736.000000
[1] "[1]:  train's l2:7.61155e+74"
[1] "[101]:  train's l2:7.44365e+74"
[1] "[201]:  train's l2:7.2954e+74"
[1] "[301]:  train's l2:7.15143e+74"
[1] "[401]:  train's l2:7.0124e+74"
[1] "[501]:  train's l2:6.87951e+74"
[1] "[601]:  train's l2:6.75179e+74"
[1] "[701]:  train's l2:6.62947e+74"
[1] "[801]:  train's l2:6.51303e+74"
[1] "[901]:  train's l2:6.39489e+74"
[1] "[1000]:  train's l2:6.28377e+74"

I then tried building {lightgbm} from latest master, and found that the problem seems to have been fixed.

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.378029 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 104040
[LightGBM] [Info] Number of data points in the train set: 43791, number of used features: 408
[LightGBM] [Info] Start training from score 0.408969
[1] "[1]:  train's l2:0.432521"
[1] "[101]:  train's l2:0.262002"
[1] "[201]:  train's l2:0.226265"
[1] "[301]:  train's l2:0.211681"
[1] "[401]:  train's l2:0.202313"
[1] "[501]:  train's l2:0.194707"
[1] "[601]:  train's l2:0.188175"
[1] "[701]:  train's l2:0.18244"
[1] "[801]:  train's l2:0.177234"
[1] "[901]:  train's l2:0.172524"
[1] "[1000]:  train's l2:0.168078"

So I'm not sure what the root cause is, but I suspect that one of the stability fixes we've made recently for the R package fixed this. Maybe one or all of these:

I'm very sorry for the inconvenience, but could you try building {lightgbm} from source on latest master and see if that solves the problem for you as well?

git clone --recursive [email protected]:microsoft/LightGBM.git
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

I'll start a separate conversation with other maintainers about doing a new release to CRAN soon.

jfouyang · 2021-05-24T13:38:57Z

Hi @jameslamb, I followed your code to install the latest version of lightGBM and I am getting exactly the same l2 training error as you posted. Thanks so much for the help and looking forward to lightGBM v3.3.0 on CRAN soon!

jameslamb · 2021-05-24T14:00:13Z

Ok great! Very sorry for the inconvenience.

Thanks again for the excellent bug report with a detailed reproducible example. Made it easy for me to test fixes.

You can subscribe to #4310 to be notified when the next release is out.

github-actions · 2023-08-23T14:37:53Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added question r-package labels May 20, 2021

jameslamb added bug awaiting response and removed question labels May 20, 2021

jameslamb mentioned this issue May 20, 2021

release 3.3.0 #4310

Closed

21 tasks

no-response bot removed the awaiting response label May 24, 2021

jameslamb closed this as completed May 24, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Very large l2 when training model #4305

[R-package] Very large l2 when training model #4305

jfouyang commented May 20, 2021

jameslamb commented May 20, 2021

jameslamb commented May 20, 2021

jfouyang commented May 24, 2021

jameslamb commented May 24, 2021

github-actions bot commented Aug 23, 2023

[R-package] Very large l2 when training model #4305

[R-package] Very large l2 when training model #4305

Comments

jfouyang commented May 20, 2021

Description

Reproducible example

Environment info

jameslamb commented May 20, 2021

jameslamb commented May 20, 2021

jfouyang commented May 24, 2021

jameslamb commented May 24, 2021

github-actions bot commented Aug 23, 2023