Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightgbm crash #2820

Closed
frank-dong-ms-zz opened this issue Feb 25, 2020 · 37 comments · Fixed by #2876
Closed

lightgbm crash #2820

frank-dong-ms-zz opened this issue Feb 25, 2020 · 37 comments · Fixed by #2876

Comments

@frank-dong-ms-zz
Copy link

frank-dong-ms-zz commented Feb 25, 2020

Our team (https://github.com/dotnet/machinelearning) is using lightgbm (version 2.2.3) and sometimes lightgbm crashes the test process to LGBM_BoosterUpdateOneIter method call when running on CI, mainly on windows 10 dotnet core 2.1.

I tried to capture memory dump but seems lightgbm dll don't have symbol built (Binary was not built with debug information), so I'm wondering how to build lightgbm dll with debug information so I can investigate further, also will upgrade to the latest version resolve the issue?

This issue is happening randomly so I'm not able to reproduce a stable repro.

Exception looks like below:
image

The error call stack looks like below:
image

Any help will be appreciated, thanks

@guolinke
Copy link
Collaborator

Hi @frank-dong-ms,
Firstly, you can try the latest master branch, and see what happens.
In addition, I can create a PR to easily compile the DLL with debug flags, so that we can get meaningful information.

@guolinke guolinke mentioned this issue Feb 26, 2020
1 task
@frank-dong-ms-zz
Copy link
Author

@guolinke Thanks for your reply, I tried to use latest nuget version and the issue still exists.
Can you please provide dll with debug flags so I can get more information, thanks

@guolinke
Copy link
Collaborator

DEBUG_DLL.zip

@frank-dong-ms-zz
Copy link
Author

@guolinke thanks, is this dll you provided debug version? This crash only appears in Release version, can you provide Release version as well?

@guolinke
Copy link
Collaborator

DLL.zip

The release version.

Does the crash only appear in the Release version? That is strange...

@frank-dong-ms-zz
Copy link
Author

frank-dong-ms-zz commented Feb 28, 2020

@guolinke Thanks so much for quick response. Yes, the crash only appear in Release build, I tried to build in Debug build there is no crash..

I got several repro with the dll you provided and I paste the call stack below, they seems different issues:

Issue1:

Exception:
Unhandled exception at 0x00007FFEDBC9C0DD (lib_lightgbm.DLL) in dotnet.exe.13236.dmp: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF. occurred

Call Stack:
image

Issue2:

Exception:
Unhandled exception at 0x00007FFF3878B049 (ntdll.dll) in dotnet.exe.7588.dmp: 0xC0000374: A heap has been corrupted (parameters: 0x00007FFF387F27F0). occurred

Call Stack:
image

Issue3:

Exception:
Unhandled exception at 0x00007FFEE9223B8D (coreclr.dll) in dotnet.exe.3768.dmp: 0xC0000005: Access violation. occurred

Call Stack:
image

All these exception happens in multi-thread scenario, so are these API thread safe?

@guolinke
Copy link
Collaborator

guolinke commented Feb 28, 2020

did you mean the lib_lightgbm.dll is called by multi-thread in C#?
I check the lines that throw the error, and don't think there are problems.
The SplitInfo is a simple C struct, no contains any pointers.

One possible reason is, the num_threads of LightGBM is changed (become larger) during training. Can you check this?

BTW, for the Debug mode, did you set Both ml.net and LightGBM to debug mode? or only LightGBM?

@frank-dong-ms-zz
Copy link
Author

I made update to #1 issue, I think I pasted wrong call stack.

For #1 issue, LGBM_DatasetPushRowsByCSR will be called multi-thread in C# by different test case, will this likely to cause the issue?

How to set num_threads of LightGBM? I don't find a place ml.net set this.

And yes, I set both ml.net and lightGBM to Debug mode. We only observing crash on Release build from our past CI builds.

@guolinke
Copy link
Collaborator

guolinke commented Feb 28, 2020

Could you try to use the release mode for ml.net, but debug mode for LightGBM?

@guolinke
Copy link
Collaborator

the error in LGBM_DatasetPushRowsByCSR seems related to the num_threads changed in traning as well.

@guolinke
Copy link
Collaborator

guolinke commented Feb 28, 2020

I check the code of LGBM_DatasetPushRowsByCSR, I think it is not thread-safe, for one Dataset.
And most Dataset related APIs are not thread-safe.

Also refer to https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L956-L958 ,

which has a lock outside.

@guolinke
Copy link
Collaborator

Num_threads parameter: https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L150

You can also check whether the error code is called by multi-threading outside or not.

@frank-dong-ms-zz
Copy link
Author

Could you try to use the release mode for ml.net, but debug mode for LightGBM?

Yes, I tried, no repro.

@frank-dong-ms-zz
Copy link
Author

the error in LGBM_DatasetPushRowsByCSR seems related to the num_threads changed in traning as well.

num of threads is set at start of tests as 1 and not changed during test running.

@guolinke
Copy link
Collaborator

@frank-dong-ms are these tests run by multi-threading? And can you ensure num_threads is 1 in all test cases? (the default is all cores, not 1)

If there are many tests run at the same time, and some of them use the different num_threads, the num_threads could be changed, and cause the error.

Yes, I tried, no repro

Do you mean that it can pass?

@frank-dong-ms-zz
Copy link
Author

If there are many tests run at the same time, and some of them use the different num_threads, the num_threads could be changed, and cause the error.

I tried to hard code num of threads to 1 in ml.net code and rerun tests, the #1 issue is also reproduced with slightly different call stack:
image

Do you mean that it can pass?

Yes, pass as long as I use debug version of lightgbm dll

@frank-dong-ms-zz
Copy link
Author

This seems like various issue exists here, I feel like memory is corrupted during test running and cause issue when lightgbm try to access some internal data structure

@guolinke
Copy link
Collaborator

And set num of threads to 1 eliminates other issues, right?

@frank-dong-ms-zz
Copy link
Author

@frank-dong-ms did you lock the PushRows in dataset like https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L956-L958 ?

Yes, I checked the current codebase seems all PushRows has lock.

And set num of threads to 1 eliminates other issues, right?

No, I tried to hard code num of threads to 1 in our source code and crash still exists.

Can I know which commit you are building lightgbm dll from (the one you give me) or how can I build lightgbm dll myself? I'm using application verifier to troubleshoot the crash and I need correct version of lightgbm source code to view the correct call stack

@guolinke
Copy link
Collaborator

guolinke commented Mar 3, 2020

@guolinke
Copy link
Collaborator

guolinke commented Mar 3, 2020

for the dll, you can choose DLL for Release, and Debug_DLL for debug.

@frank-dong-ms-zz
Copy link
Author

@frank-dong-ms refer to https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#with-gui

Can I get dll from visual studio build? I can only see exe generated

@guolinke
Copy link
Collaborator

guolinke commented Mar 3, 2020

use DLL or Debug_DLL config, and then build, the dll is in the corresponding folder.

@frank-dong-ms-zz
Copy link
Author

frank-dong-ms-zz commented Mar 4, 2020

@guolinke I have below finding, please see whether this is bug in lightgbm.

We are setting num of threads to 1 when running test, so share_state_->num_threads should be 1 in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L414.
In this https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L424 the number of threads should be number of core on vm that running (in our case number of core is 2) so tid can be 0 or 1 from https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L430.
when tid equals 1, we are accessing index that is out of boundary at line https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L441.

I have tried to add below assertion at line 438 and this assertion fails during test run:
assert(tid < share_state_->num_threads);

Could you please take a look or I'm misunderstanding anything, thanks.

@guolinke
Copy link
Collaborator

guolinke commented Mar 4, 2020

@frank-dong-ms I don't think this is a bug in LightGBM, as we never saw it before.

The num_threads is got by omp_get_num_threads(), it should be correct if there are no changes during tests.
https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/utils/openmp_wrapper.h#L22

did you use other modules that relies on open_mp too?
I am not sure what happens, but maybe the omp_num_threads is set by other modules.
Maybe you can try only run tests for LightGBM, without the effect of other modules.

@frank-dong-ms-zz
Copy link
Author

frank-dong-ms-zz commented Mar 4, 2020

The num_threads is got by omp_get_num_threads(), it should be correct if there are no changes during tests.
https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/utils/openmp_wrapper.h#L22

Yes, I can see num_threads is 1 and is correct, but how many threads will be use in below line? And is it possible we get 1 for tid (const int tid = omp_get_thread_num())? This is what I'm seeing during my test.
https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L424

did you use other modules that relies on open_mp too?
I am not sure what happens, but maybe the omp_num_threads is set by other modules.

Yes, I believe OnnxRuntime native library that ML.NET is referencing also use open_mp during their training. I'm not familiar with open_mp lib use this lib inside OnnxRuntime will affect light gbm?

Maybe you can try only run tests for LightGBM, without the effect of other modules.

Yes, I will try as you suggested as well.

@guolinke
Copy link
Collaborator

guolinke commented Mar 4, 2020

If omp number of threads is set, it will always use that number of threads globally.

@frank-dong-ms-zz
Copy link
Author

If omp number of threads is set, it will always use that number of threads globally.

Yes, you are right, when this crash happens, omp_get_num_threads() return 2 and cause vector access index out of range, thanks

@frank-dong-ms-zz
Copy link
Author

@guolinke it is possible LightGBM not to rely on number of threads to do indexing?
If this setting of OpenMP (number of threads) is global then it is likely this setting will be override in other native library and cause issue in LightGBM code.

Also CC @harishsk

@guolinke
Copy link
Collaborator

guolinke commented Mar 5, 2020

not, for fasterst speed, we use many per-thread buffers in multi-threading.

@frank-dong-ms-zz
Copy link
Author

I see, at lease some protection can be added like to https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L355, change to something like below:
#pragma omp parallel for schedule(static) num_threads(share_state_->num_threads)

does it make sense?

@guolinke
Copy link
Collaborator

guolinke commented Mar 6, 2020

@frank-dong-ms good suggestion! I think it can work

@guolinke
Copy link
Collaborator

guolinke commented Mar 6, 2020

@frank-dong-ms , sorry, I benchmark for that solution, it seems will be much slower.
I guess there are some overheads when specific num_threads.
So we cannot adopt that fix.

updated:
never mind, it seems to keeping num_threads with the same value in all loops solve the problem.

@frank-dong-ms-zz
Copy link
Author

@guolinke Do you know when new version of lightgbm will be released with this fix?

@guolinke
Copy link
Collaborator

I think it needs about one or two months. we have many breaking changes recently, needs more time to ensure there are no bugs.

@frank-dong-ms-zz
Copy link
Author

Got it, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants