-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] Using forcedsplits parameter causes wild inaccuracies and crashes #4591
Comments
Thanks very much for using I've edited the formatting of your original post to make it a bit easier to read. If you are new to GitHub, please consider reading through https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax to learn how to use GitHub-flavored markdown to format posts here. |
@Sinnombre can you please try updating to the latest version of git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz I tried your reproducible example (thanks very much for providing that!!) and found that for
When I use the version of the R package built from There have been a lot of fixes to LightGBM since 3.2.1 was released in April. I recommend subscribing to #4310 to be notified when release 3.3.0 comes out. We apologize for the inconvenience.
This warning is raised whenever LightGBM stops growing a tree before other tree-specific stopping conditions like For example, with the R package on
Since you weren't able to produce a reproducible example for this it's difficult for me to say with confidence what is happening, unfortunately. But could you please try running the code that produced that result using the R package built from latest |
Hi James thanks for your quick reply. I installed the latest version from master and while it did fix the simple scenario, I still get crashes with the larger test case. It turns out I can share the data though (since it's anonymized), so please see attached zip. It works fine when line 60 (the ForcedSplits parameter) is commented out, but with it the crash still occurs. drive link: https://drive.google.com/file/d/1JsP7uEx09d2JQxiST6byk9YxdnKNOAsY/view?usp=sharing Also, in cases that work I frequently get the message: |
Please also provide the exact code you're using to train on this data if you'd like me to test it.
There is not a parameter you can use to suppress this warning. It comes from this point in the source code LightGBM/src/treelearner/feature_histogram.hpp Lines 605 to 611 in 346f883
That is part of a method eventually called in
Which is called at the beginning of training for each tree.
So I believe (@shiyu1994 or @StrikerRUS please correct me if I'm wrong) that it would be more accurate to say that that warning means
If I'm right about that, it means that the |
I believe the code file on the google drive works entirely on its own (with the two common libraries) does it not? I would also like clarification on your last point there; it seems improvable to me that, given the number of features and the fact that 'improvement' splits keep being found for tens of thousands of iterations without the forced splits, adding the force split would result in NO candidates that improve gain at all after only a couple iterations. I can definitely see it not being optimal, or even being the case that simply taking out that initial forced split would improve the gain, but that's kinda the point; 'improved gain' in this case is likely coming from overfitting. Requiring a forced split is basically saying 'treat these as separate problems based on this feature', presumably the user has a reason for doing so? I guess I see two use cases for ForcedSplits, either telling the learner 'hey I have insight into the features and I think you will get the best results trying this first,' or telling it 'hey I know my data is biased so you will overfit if you don't do this.' Anyway thanks again for looking into this, and your insight into how the learner works! |
Ah I wasn't clear, this is not what I mean or what I think the code is doing. I think it's very possible to see this warning and behavior in situations where I believe LightGBM is saying "if the tree stopped growing after this forced split (its nodes became leaf nodes), would the gain compared to a tree which stopped growing before this split be greater than There isn't a pruning process in LightGBM where all combinations of splits are tried and then LightGBM picks the best complete sequence. Splits are added one at a time, based on which split provides the best gain. (https://lightgbm.readthedocs.io/en/latest/Features.html#leaf-wise-best-first-tree-growth) My interpretation of the code path generating that warning above is that tree growth in LightGBM works like this:
If I'm right about that (let's see if another maintainer confirms that, I'm not as knowledgeable as some others here 😬 ), then I think there's definitely an opportunity to improve the documentation on this!
Totally makes sense to me! But I think that the fact that LightGBM uses leaf-wise growth (what XGBoost refers to as Imagine, for example, that you have a forced split which always sends 90% of samples to the left of the first split and 10% to the right. For a wide range of loss functions and depending on the distribution of the target, I think LightGBM is going to tend to prefer splits on the left side, because they'll offer a larger total gain. And as a result, tree growth might hit tree-specific stopping conditions like If you want try to train a LightGBM model to work on two problems, you might find that you have greater control by writing a custom objective function. You can see the following for an example of how to do this in the R package.
If you just want to control overfitting generally, you can try some of the other suggestions at https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#deal-with-over-fitting
Oh sorry! I just expected that file to contain data. I'll take a look later tonight or tomorrow and see if I can reproduce the crash you ran into. Is it ok for me to re-post the code you've provided here? We have a strong preference for posting code in plaintext (not links to external services) so it's usable by others who find this issue from search engines in the future. |
Yeah it's fine to repost the code. Thanks for the detailed explanation, that does make sense. The part that confuses me though is that forced splits are the first splits the tree makes. So either my forced split to two leafs reduces the gain vs. a tree consisting of just one leaf, in which case I should never get the warning, or my forced split results in a two-leaf tree with worse gain than the initial one leaf tree, in which case I should always get the warning. But in practice I don't get the warning for the first several iterations, then it starts showing up somewhere down the line. If my understanding of what your were saying is correct I don't see how this makes sense? Also, at least for me, it's not a problem that the learner focuses more on the more populated side of the tree; that's exactly what I would want, to use the time and memory budget I allot it to optimize the most impactful sections. |
I think you're missing an important point that I didn't include in previous posts because it's implicit in the use of LightGBM. Because LightGBM is a gradient boosting library, you can't safely assume that a specific split's gain will be the same across all iterations. Each additional tree is fit to explain the errors of the model up to that point (something like "residuals between the true value of the target and the predicted value you'd get from the model in its current state"). If you haven't seen it, XGBoost's docs have an excellent tutorial on how the boosting process works: https://xgboost.readthedocs.io/en/latest/tutorials/model.html#. That is how you can get the behavior of "I don't see this warning for the future few iterations but then it shows up in later iterations". Also, I want to be sure it's clear...I'm not saying that using a single forced split means all your trees will be either one leaf (0 splits) or two leaves (1 split). Just explaining that in the way LightGBM grows trees, it adds splits one at a time.
If you aren't trying to achieve the behavior of "train one model which performs similarly well on different parts of my training data's distribution" and just want to produce a model that provides the most accurate predictions of the target overall, then you shouldn't use forced splits at all. |
Ok @Sinnombre , I was able to reproduce the errors you saw running the code you provided in #4591 (comment). Thanks very very much for that! Specifically, running your provided code with your provided data, training regularly failed with the following error
I ran this using R 4.1.0 on my Mac, with It looks like you've uncovered a pretty challenging bug in LightGBM! And I suspect it affects LightGBM's core library, not only the R package. I was able to create a reproducible example for it in R using only the built-in For now, if you want to use forced splits the only reliable way I've found to avoid the error you hit is to set |
By the way (unrelated to this issue), I noticed this in your sample code: train_dt = data.table(read.csv("TrainingData.csv")) I think you'll find that it's much faster to use train_dt = data.table::fread("TrainingData.csv") |
Dang sounds like a fix will be a while out then. Thanks for looking into it, and for answering my other questions!
The issue with this is the bias. The training data has hundreds of thousands of examples with sales between 0-5 per week, and a few dozen with weekly sales in the thousands. If I weight towards the high-preforming items, the model massively overestimates the sales of the low performers. If I don't weight aggressively towards the high numbers, than the model massively underestimates sales of the high performers; this resulted in a very good RMSE, but when I compare sum(predictions) to sum(labels) its like 70%, due to the underestimation of the high sellers. My goal with all this is to split the data into effectively separate trees, trained on just high or low preforming data, but preserving the settings s.t. total training time and model size were constrained by one set of parameters. |
Makes sense, makes sense. That type of use-case was in fact even mentioned as the reason for adding the
But I think the key point there is "combined with setting appropriate weights". As I think you're seeing, it can be difficult to set up the right combination of weights + tree growth parameters like |
Going to close this, as it seems there are no remaining open questions. Thanks very much for raising this issue and providing some sample code, as it helped us to identify a bug (#4601)! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description
Maybe I don't understand the function of this parameter but I am having a great deal of trouble using it.
I'm working with forecasting sales using LightGBM in R. In the data I have (which unfortunately I am unable to share), the overwhelming majority of items sell 0-1 per week, with about 0.3% outliers with weekly sales averaging >20, some going into the 1000s. I observed that separating the data into three training runs, for high, medium and low performers, resulted in substantially better accuracy. From my understanding of LightGBM, training three separate models based on one feature like this should be equivalent to forcing the first two steps of the decision trees to split based that feature, so I looked into this and found the forcedsplits_filename parameter.
However, whenever I use forcedsplits_filename, I get a huge number of warnings, frequent crashes and, even when it works, incredibly inaccurate results.
I've reproduced the crash with the example code below. The error message is:
I have determined that the crashes only occur when I include a categorical feature with high cardinality which tracks closely to the parameter I'm forcing splits on (specifically the item ID), so my theory is that certain splits of this feature contain no data with values above the threshold, causing a tree split taking both the ID feature and the forced split feature into account to result in a branch with no samples.
Searching this forum I've found several other issues around this error, but those seem to have been resolved with the latest version and were unconnected to forcedsplits.
Reproducible example
OUTPUTS:
Environment info
R version 4.0.4 (2021-02-15)
R Studio version 1.4.1717
lightgbm version 3.2.1
Additional Comments
In addition to the crash, in my main code (the data for which I cannot share) I also get frequent instances of the warnings:
Even with the forced split, I'm not sure how its finding best gains of -inf? I'm pretty sure the other two warnings follow from this issue.
And finally, even when the crash doesn't occur, I find results which are orders of magnitude off. I have no been able to replicate this with a simple example but in one run I found:
-- WITH forcedsplits_filename parameter
-- WITHOUT forcedsplits_filename parameter
I don't understand why the errors are so huge. The largest label in the training data is 1729 and none are negative, so even if every leaf returned 1729 the worst RMSE should be somewhat less than that; how can any leaf in a decision tree have a value 8 orders of magnitude higher than any actual label? And why is this happening when I simply add a single forced split?
The text was updated successfully, but these errors were encountered: