Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mmlspark] seeing memory corruption error on LGBM_BoosterFree with latest version #2981

Closed
imatiach-msft opened this issue Apr 7, 2020 · 14 comments

Comments

@imatiach-msft
Copy link
Contributor

when trying to upgrade lightgbm to latest version on master I am seeing a random memory corruption error in LGBM_BoosterFree inside SerialTreeLearner, pasting stack trace below:

[LightGBM] [Info] Finished linking network in 0.006790 seconds

*** Error in `java': double free or corruption (!prev): 0x00007fb0ac9342a0 ***

======= Backtrace: =========

/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fb0c91467e5]

/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fb0c914f37a]

/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fb0c915353c]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM23DataParallelTreeLearnerINS_17SerialTreeLearnerEED2Ev+0x78)[0x7fb05c3915f8]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM23DataParallelTreeLearnerINS_17SerialTreeLearnerEED0Ev+0x9)[0x7fb05c391639]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM4GBDTD1Ev+0xa30)[0x7fb05c1c8c50]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM4GBDTD0Ev+0x9)[0x7fb05c1c9219]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM7BoosterD2Ev+0x5f9)[0x7fb05c4401c9]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(LGBM_BoosterFree+0xe)[0x7fb05c42ec1e]

@guolinke
Copy link
Collaborator

guolinke commented Apr 7, 2020

@imatiach-msft is that possible to locate which function/code block causes the memory corruption?

@imatiach-msft
Copy link
Contributor Author

imatiach-msft commented Apr 13, 2020

I'm currently trying to locate the function/code block. Removing the call to LGBM_BoosterFree, StringArrayHandle_free and LGBM_DatasetFree still seems to result in segfault or sigabort errors in other places, and they seem to vary from run to run. Using a single partition (instead of two) seems to fix all issues, so I suspect it's something that only happens for distributed case.

Interestingly out of all of the tests that are run I am only seeing this issue on just one test that uses the flare dataset.

@imatiach-msft
Copy link
Contributor Author

imatiach-msft commented Apr 15, 2020

Tried to look at this again this evening. This time I modified the code to use LGBM_DatasetCreateFromCSR instead of the mmlspark-optimized method LGBM_DatasetCreateFromCSRSpark. I had my hopes high that this may fix it, as I've isolated the issue to training-only APIs (not scoring), and what seems to be sparse-only code - which automatically made me think this would be caused by the relatively new method that was added for sparse code. Unfortunately, I still ran into the issue, albeit with a different stack trace, in SplitInner method:

# C  [lib_lightgbm.so+0x258491]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0x581

Specifically with the stack trace:

Stack: [0x00007f7592dee000,0x00007f7592eef000],  sp=0x00007f7592eebff0,  free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [lib_lightgbm.so+0x258491]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0x581
C  [lib_lightgbm.so+0x2b0e60]  LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::Split(LightGBM::Tree*, int, int*, int*)+0x20

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 9131  com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_BoosterUpdateOneIter(JJ)I (0 bytes) @ 0x00007f761dd41884 [0x00007f761dd41840+0x44]
J 9130 C1 com.microsoft.ml.lightgbm.lightgbmlib.LGBM_BoosterUpdateOneIter(Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;Lcom/microsoft/ml/light\
gbm/SWIGTYPE_p_int;)I (12 bytes) @ 0x00007f761d47b044 [0x00007f761d47aea0+0x1a4]
j  com.microsoft.ml.spark.lightgbm.TrainUtils$.trainCore(Lcom/microsoft/ml/spark/lightgbm/TrainParams;Lscala/Option;Lorg/slf4j/Logger;Z)V+78\
5
...

Interestingly, as I was looking through source and various commits, I stumbled on a recent issue that may be related to the memory corruption errors I am seeing:
#2820

However, the fact that the memory corruption is random, and sometimes the test works when I run, makes me wonder if it is a different issue. I've also seen the test get stuck as well.

@guolinke
Copy link
Collaborator

maybe we can try to add #pragma omp parallel for schedule(static) num_threads(share_state_->num_threads) in the omp loop in parallel tree learners?

@guolinke
Copy link
Collaborator

guolinke commented Apr 19, 2020

as our CI in the serial version didn't catch such an error, I think the bug is very possible in parallel tree learner: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/data_parallel_tree_learner.cpp . Maybe some edge cases cause the out-of-the-boundaries array access.
@imatiach-msft could you help to add some assert/check (like check the index range) to locate the bugs?

@imatiach-msft
Copy link
Contributor Author

"could you help to add some assert/check"
yes definitely
" try to add #pragma omp parallel for schedule(static) num_threads(share_state_->num_threads) in the omp loop"
added in both omp loops:
image
unfortunately it did not seem to fix the issue yet

@imatiach-msft
Copy link
Contributor Author

@guolinke I think I've made some progress on this, the issue seems to occur much earlier, as early as in the memset calls here:
https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L1241
https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L1342
I think it is related to constructing histogram for sparse data in parallel case.
Although the error can change from run to run that is the earliest I've seen it occur so far.

@guolinke
Copy link
Collaborator

@imatiach-msft , yeah, when call the std::memset, it will check the memory states, and will throw the exception if found memory errors.

BTW, maybe using the dll with debug flags will be easier to locate the problem?

@imatiach-msft
Copy link
Contributor Author

@guolinke I did add a lot of debug in this branch to track down the error:
https://github.com/imatiach-msft/LightGBM/tree/ilmat/debug-free-err
I am currently building with USE_DEBUG flag:

mkdir build ; cd build
cmake -DUSE_SWIG=ON -DUSE_DEBUG=ON ..
make -j4

I tried to modify the CMakeLists.txt by adding:
-D_GLIBCXX_ASSERTIONS
and I also tried:
add_compile_definitions(_GLIBCXX_ASSERTIONS)
but it doesn't seem to be working or it's just not catching the errors.
I tried to run valgrind on sbt (scala build/test tool) but I couldn't enter the test command due to the output from valgrind, will try to refactor the test code to a java file and run valgrind on that.
I've attached 3 sample runs in case it helps - note the errors highly vary from run to run, and are almost never the same. The earliest I've seen them was in the histogram code for sparse data so I assume it must be coming somewhere around there, even before any training, but I'm not sure.

LightGBMRun1.txt
LightGBMRun2.txt
LightGBMRun3.txt

@guolinke
Copy link
Collaborator

@imatiach-msft
I find the out of range in your log:

[LightGBM] [Warning] vector::_M_range_check: __n (which is 1) >= this->size() (which is 0)
[LightGBM] [Warning] vector::_M_range_check: __n (which is 2) >= this->size() (which is 0)
[LightGBM] [Warning] vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 1) >= this->size() (which is 0)

where did this happen?

@imatiach-msft
Copy link
Contributor Author

@guolinke yes, that is the first time I've seen that specific error, the errors change every time when I run the tests - I suspect this is caused by memory corruption further upstream. Unfortunately I don't have the stack for that error anymore, but I can send one when/if I see a similar error.

@imatiach-msft
Copy link
Contributor Author

imatiach-msft commented May 22, 2020

@guolinke I have some great news! I was able to resolve this issue, but only after pivoting in my debugging strategy several times. I had a lot of trouble narrowing down the cause of the issue - I added a lot of std::cout debug but it did not help, partly because the errors that appeared were always very random. After struggling with getting valgrind to work with Java (and spark), I gave up on this approach. I then struggled to use address sanitizer with mmlspark to narrow down the problem, but still could not get it to work. I came up with a new strategy to reproduce the issue in native-only mode, without using mmlspark at all. After a lot of parameter tweaking, exporting data from the test in mmlspark and intense debugging, I was able to reproduce the issue! Not only that, I was able to run valgrind (without the issues I had with running it against Java), and I was able to see the following invalid write:

IM%% input_buffer_ size before SyncUpGlobalBestSplit: 96
%%IM SplitInfo size: 222
==4422== Invalid write of size 4
==4422==    at 0x63B1E2: memcpy (string3.h:53)
==4422==    by 0x63B1E2: CopyTo (split_info.hpp:56)
==4422==    by 0x63B1E2: SyncUpGlobalBestSplit (parallel_tree_learner.h:195)
==4422==    by 0x63B1E2: LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::FindBestSplitsFromHistograms(std::vector<signed char, std::allocator<signed char> > const&, bool) (data_parallel_tree_learner.cpp:242)
==4422==    by 0x631AD4: LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::FindBestSplits() (data_parallel_tree_learner.cpp:175)
==4422==    by 0x6543A2: LightGBM::SerialTreeLearner::Train(float const*, float const*) (serial_tree_learner.cpp:184)
==4422==    by 0x493247: LightGBM::GBDT::TrainOneIter(float const*, float const*) (gbdt.cpp:373)
==4422==    by 0x48D6CF: LightGBM::GBDT::Train(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (gbdt.cpp:252)
==4422==    by 0x45C572: LightGBM::Application::Train() (application.cpp:203)
==4422==    by 0x45B9B0: Run (application.h:86)
==4422==    by 0x45B9B0: main (main.cpp:15)

I found out that input_buffer_ was too small before calling SyncUpGlobalBestSplit, which caused the memory overwrite. Inserting the following code prior to calling SyncUpGlobalBestSplit fixed the memory issues for me:

  auto max_cat_threshold = this->config_->max_cat_threshold;
  int splitInfoSize = SplitInfo::Size(max_cat_threshold);
  if (input_buffer_.size() < splitInfoSize * 2) {
      input_buffer_.resize(splitInfoSize * 2);
  }
  SyncUpGlobalBestSplit(input_buffer_.data(), input_buffer_.data(), &smaller_best_split, &larger_best_split, this->config_->max_cat_threshold);

I validated this fixed both the native example and, after exporting the code to mmlspark, the mmlspark test. It is very satisfying to finally be able to resolve this memory corruption issue!

You can find the branch here:
https://github.com/imatiach-msft/LightGBM/commits/ilmat/debug-free-err

The native LightGBM example that reproduced the issue is here:
https://github.com/imatiach-msft/LightGBM/tree/ilmat/debug-free-err/examples/memory_debug

To reproduce, I ran the following commands on two terminals within the memory_debug folder:
valgrind ./lightgbm config=train.conf1
valgrind ./lightgbm config=train.conf2

Make sure to remove the fix I added (mentioned above) and recompile in order to reproduce the error.

@StrikerRUS
Copy link
Collaborator

Closed via #3110.

Thank you very @imatiach-msft !

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants