-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightgbm crash #2820
Comments
Hi @frank-dong-ms, |
@guolinke Thanks for your reply, I tried to use latest nuget version and the issue still exists. |
@guolinke thanks, is this dll you provided debug version? This crash only appears in Release version, can you provide Release version as well? |
The release version. Does the crash only appear in the Release version? That is strange... |
@guolinke Thanks so much for quick response. Yes, the crash only appear in Release build, I tried to build in Debug build there is no crash.. I got several repro with the dll you provided and I paste the call stack below, they seems different issues: Issue1: Exception: Issue2: Exception: Issue3: Exception: All these exception happens in multi-thread scenario, so are these API thread safe? |
did you mean the One possible reason is, the BTW, for the |
I made update to #1 issue, I think I pasted wrong call stack. For #1 issue, LGBM_DatasetPushRowsByCSR will be called multi-thread in C# by different test case, will this likely to cause the issue? How to set num_threads of LightGBM? I don't find a place ml.net set this. And yes, I set both ml.net and lightGBM to Debug mode. We only observing crash on Release build from our past CI builds. |
Could you try to use the release mode for ml.net, but debug mode for LightGBM? |
the error in |
I check the code of which has a lock outside. |
Num_threads parameter: https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L150 You can also check whether the error code is called by multi-threading outside or not. |
Yes, I tried, no repro. |
num of threads is set at start of tests as 1 and not changed during test running. |
@frank-dong-ms are these tests run by multi-threading? And can you ensure If there are many tests run at the same time, and some of them use the different
Do you mean that it can pass? |
I tried to hard code num of threads to 1 in ml.net code and rerun tests, the #1 issue is also reproduced with slightly different call stack:
Yes, pass as long as I use debug version of lightgbm dll |
This seems like various issue exists here, I feel like memory is corrupted during test running and cause issue when lightgbm try to access some internal data structure |
@frank-dong-ms did you lock the PushRows in dataset like https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L956-L958 ? |
And set num of threads to 1 eliminates other issues, right? |
Yes, I checked the current codebase seems all PushRows has lock.
No, I tried to hard code num of threads to 1 in our source code and crash still exists. Can I know which commit you are building lightgbm dll from (the one you give me) or how can I build lightgbm dll myself? I'm using application verifier to troubleshoot the crash and I need correct version of lightgbm source code to view the correct call stack |
for the |
Can I get dll from visual studio build? I can only see exe generated |
use |
@guolinke I have below finding, please see whether this is bug in lightgbm. We are setting num of threads to 1 when running test, so share_state_->num_threads should be 1 in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L414. I have tried to add below assertion at line 438 and this assertion fails during test run: Could you please take a look or I'm misunderstanding anything, thanks. |
@frank-dong-ms I don't think this is a bug in LightGBM, as we never saw it before. The did you use other modules that relies on open_mp too? |
Yes, I can see num_threads is 1 and is correct, but how many threads will be use in below line? And is it possible we get 1 for tid (const int tid = omp_get_thread_num())? This is what I'm seeing during my test.
Yes, I believe OnnxRuntime native library that ML.NET is referencing also use open_mp during their training. I'm not familiar with open_mp lib use this lib inside OnnxRuntime will affect light gbm?
Yes, I will try as you suggested as well. |
If omp number of threads is set, it will always use that number of threads globally. |
Yes, you are right, when this crash happens, omp_get_num_threads() return 2 and cause vector access index out of range, thanks |
not, for fasterst speed, we use many per-thread buffers in multi-threading. |
I see, at lease some protection can be added like to https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L355, change to something like below: does it make sense? |
@frank-dong-ms good suggestion! I think it can work |
@frank-dong-ms , sorry, I benchmark for that solution, it seems will be much slower. updated: |
@guolinke Do you know when new version of lightgbm will be released with this fix? |
I think it needs about one or two months. we have many breaking changes recently, needs more time to ensure there are no bugs. |
Got it, thanks! |
Our team (https://github.com/dotnet/machinelearning) is using lightgbm (version 2.2.3) and sometimes lightgbm crashes the test process to LGBM_BoosterUpdateOneIter method call when running on CI, mainly on windows 10 dotnet core 2.1.
I tried to capture memory dump but seems lightgbm dll don't have symbol built (Binary was not built with debug information), so I'm wondering how to build lightgbm dll with debug information so I can investigate further, also will upgrade to the latest version resolve the issue?
This issue is happening randomly so I'm not able to reproduce a stable repro.
Exception looks like below:
The error call stack looks like below:
Any help will be appreciated, thanks
The text was updated successfully, but these errors were encountered: