Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix coredumps #18315

Merged
merged 1 commit into from
May 19, 2020
Merged

Fix coredumps #18315

merged 1 commit into from
May 19, 2020

Conversation

leezu
Copy link
Contributor

@leezu leezu commented May 14, 2020

Fix MXNet trapping the segfault signal and preventing coredumps being generated.

Edit: Comparison of user experience during segfault with and without this PR:

[ins] In [1]: import gc
         ...: import objgraph
         ...: gc.set_debug(gc.DEBUG_SAVEALL)
         ...: import mxnet as mx
         ...: net = mx.gluon.nn.Dense(10, in_units=10)
         ...: net.initialize()
         ...: objgraph.show_refs([net.weight._data[0]], filename='weight_array.png')
         ...: del net
         ...: print(gc.collect())
Graph written to /tmp/objgraph-2ypakr86.dot (28 nodes)
Image generated as weight_array.png
38

[ins] In [2]: gc.garbage
Out[2]:
Segmentation fault: 11

%  

ipython crashed and user is back to shell. With this PR

[ins] In [1]: import gc
         ...: import objgraph
         ...: gc.set_debug(gc.DEBUG_SAVEALL)
         ...: import mxnet as mx
         ...: net = mx.gluon.nn.Dense(10, in_units=10)
         ...: net.initialize()
         ...: objgraph.show_refs([net.weight._data[0]], filename='weight_array.png')
         ...: del net
         ...: print(gc.collect())
Graph written to /tmp/objgraph-b0i9rl27.dot (28 nodes)
Image generated as weight_array.png
38

[ins] In [2]: gc.garbage
Out[2]:
Segmentation fault: 11

zsh: abort      ipython
% echo $?                                                                                                                                                          9s ~/src/mxnet-master/build fixcoredumps
134

After this PR, if the user set ulimit -c unlimited prior to running the segfault inducing program, they will get a coredump for analysis in gdb.

@mxnet-bot
Copy link

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [miscellaneous, clang, website, unix-gpu, windows-gpu, edge, unix-cpu, centos-gpu, windows-cpu, sanity, centos-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@leezu leezu mentioned this pull request May 14, 2020
@leezu
Copy link
Contributor Author

leezu commented May 14, 2020

@mxnet-bot run ci [unix-cpu, centos-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, centos-cpu]

@leezu
Copy link
Contributor Author

leezu commented May 15, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@leezu
Copy link
Contributor Author

leezu commented May 15, 2020

@mxnet-bot run ci [unix-cpu, centos-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-cpu, unix-cpu]

@leezu
Copy link
Contributor Author

leezu commented May 16, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@leezu
Copy link
Contributor Author

leezu commented May 18, 2020

@mxnet-bot run ci [unix-cpu, centos-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-gpu, unix-cpu]

@leezu leezu merged commit 5dfbaa6 into apache:master May 19, 2020
@leezu leezu deleted the fixcoredumps branch May 19, 2020 15:49
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants