-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mmlspark] seeing memory corruption error on LGBM_BoosterFree with latest version #2981
Comments
@imatiach-msft is that possible to locate which function/code block causes the memory corruption? |
I'm currently trying to locate the function/code block. Removing the call to LGBM_BoosterFree, StringArrayHandle_free and LGBM_DatasetFree still seems to result in segfault or sigabort errors in other places, and they seem to vary from run to run. Using a single partition (instead of two) seems to fix all issues, so I suspect it's something that only happens for distributed case. Interestingly out of all of the tests that are run I am only seeing this issue on just one test that uses the flare dataset. |
Tried to look at this again this evening. This time I modified the code to use LGBM_DatasetCreateFromCSR instead of the mmlspark-optimized method LGBM_DatasetCreateFromCSRSpark. I had my hopes high that this may fix it, as I've isolated the issue to training-only APIs (not scoring), and what seems to be sparse-only code - which automatically made me think this would be caused by the relatively new method that was added for sparse code. Unfortunately, I still ran into the issue, albeit with a different stack trace, in SplitInner method:
Specifically with the stack trace:
Interestingly, as I was looking through source and various commits, I stumbled on a recent issue that may be related to the memory corruption errors I am seeing: However, the fact that the memory corruption is random, and sometimes the test works when I run, makes me wonder if it is a different issue. I've also seen the test get stuck as well. |
maybe we can try to add |
as our CI in the serial version didn't catch such an error, I think the bug is very possible in parallel tree learner: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/data_parallel_tree_learner.cpp . Maybe some edge cases cause the out-of-the-boundaries array access. |
@guolinke I think I've made some progress on this, the issue seems to occur much earlier, as early as in the memset calls here: |
@imatiach-msft , yeah, when call the std::memset, it will check the memory states, and will throw the exception if found memory errors. BTW, maybe using the dll with debug flags will be easier to locate the problem? |
@guolinke I did add a lot of debug in this branch to track down the error:
I tried to modify the CMakeLists.txt by adding: |
@imatiach-msft
where did this happen? |
@guolinke yes, that is the first time I've seen that specific error, the errors change every time when I run the tests - I suspect this is caused by memory corruption further upstream. Unfortunately I don't have the stack for that error anymore, but I can send one when/if I see a similar error. |
@guolinke I have some great news! I was able to resolve this issue, but only after pivoting in my debugging strategy several times. I had a lot of trouble narrowing down the cause of the issue - I added a lot of std::cout debug but it did not help, partly because the errors that appeared were always very random. After struggling with getting valgrind to work with Java (and spark), I gave up on this approach. I then struggled to use address sanitizer with mmlspark to narrow down the problem, but still could not get it to work. I came up with a new strategy to reproduce the issue in native-only mode, without using mmlspark at all. After a lot of parameter tweaking, exporting data from the test in mmlspark and intense debugging, I was able to reproduce the issue! Not only that, I was able to run valgrind (without the issues I had with running it against Java), and I was able to see the following invalid write:
I found out that input_buffer_ was too small before calling SyncUpGlobalBestSplit, which caused the memory overwrite. Inserting the following code prior to calling SyncUpGlobalBestSplit fixed the memory issues for me:
I validated this fixed both the native example and, after exporting the code to mmlspark, the mmlspark test. It is very satisfying to finally be able to resolve this memory corruption issue! You can find the branch here: The native LightGBM example that reproduced the issue is here: To reproduce, I ran the following commands on two terminals within the memory_debug folder: Make sure to remove the fix I added (mentioned above) and recompile in order to reproduce the error. |
Closed via #3110. Thank you very @imatiach-msft ! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
when trying to upgrade lightgbm to latest version on master I am seeing a random memory corruption error in LGBM_BoosterFree inside SerialTreeLearner, pasting stack trace below:
[LightGBM] [Info] Finished linking network in 0.006790 seconds
*** Error in `java': double free or corruption (!prev): 0x00007fb0ac9342a0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fb0c91467e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fb0c914f37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fb0c915353c]
/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM23DataParallelTreeLearnerINS_17SerialTreeLearnerEED2Ev+0x78)[0x7fb05c3915f8]
/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM23DataParallelTreeLearnerINS_17SerialTreeLearnerEED0Ev+0x9)[0x7fb05c391639]
/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM4GBDTD1Ev+0xa30)[0x7fb05c1c8c50]
/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM4GBDTD0Ev+0x9)[0x7fb05c1c9219]
/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM7BoosterD2Ev+0x5f9)[0x7fb05c4401c9]
/tmp/mml-natives484132555220552210/lib_lightgbm.so(LGBM_BoosterFree+0xe)[0x7fb05c42ec1e]
The text was updated successfully, but these errors were encountered: