Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Flaky Test] Segmentation fault in memory profiler tests #18564

Open
mseth10 opened this issue Jun 15, 2020 · 6 comments
Open

[Flaky Test] Segmentation fault in memory profiler tests #18564

mseth10 opened this issue Jun 15, 2020 · 6 comments

Comments

@mseth10
Copy link
Contributor

mseth10 commented Jun 15, 2020

Description

test_gpu_memory_profiler_gluon fails intermittently for different cu* flavors in nightly CD pipelines.

Occurrences

  1. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1257/pipeline
  2. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1245/pipeline

Error log

[2020-06-14T15:23:01.268Z] ________________________ test_gpu_memory_profiler_gluon ________________________
[2020-06-14T15:23:01.268Z] [gw1] linux -- Python 3.6.9 /opt/rh/rh-python36/root/usr/bin/python3
[2020-06-14T15:23:01.268Z] 
[2020-06-14T15:23:01.268Z]     @pytest.mark.skipif(mx.context.num_gpus() == 0, reason="GPU memory profiler records allocation on GPUs only")
[2020-06-14T15:23:01.268Z]     def test_gpu_memory_profiler_gluon():
[2020-06-14T15:23:01.268Z]         enable_profiler(profile_filename='test_profiler.json',
[2020-06-14T15:23:01.268Z] >                       run=True, continuous_dump=True)
[2020-06-14T15:23:01.268Z] 
[2020-06-14T15:23:01.268Z] tests/python/unittest/test_profiler.py:537: 
[2020-06-14T15:23:01.268Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2020-06-14T15:23:01.268Z] tests/python/unittest/test_profiler.py:40: in enable_profiler
[2020-06-14T15:23:01.268Z]     aggregate_stats=aggregate_stats)
[2020-06-14T15:23:01.268Z] python/mxnet/profiler.py:69: in set_config
[2020-06-14T15:23:01.268Z]     profiler_kvstore_handle))
[2020-06-14T15:23:01.268Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2020-06-14T15:23:01.268Z] 
[2020-06-14T15:23:01.268Z] ret = -1
[2020-06-14T15:23:01.268Z] 
[2020-06-14T15:23:01.268Z]     def check_call(ret):
[2020-06-14T15:23:01.268Z]         """Check the return value of C API call.
[2020-06-14T15:23:01.268Z]     
[2020-06-14T15:23:01.268Z]         This function will raise an exception when an error occurs.
[2020-06-14T15:23:01.268Z]         Wrap every API call with this function.
[2020-06-14T15:23:01.268Z]     
[2020-06-14T15:23:01.268Z]         Parameters
[2020-06-14T15:23:01.268Z]         ----------
[2020-06-14T15:23:01.268Z]         ret : int
[2020-06-14T15:23:01.268Z]             return value from API calls.
[2020-06-14T15:23:01.268Z]         """
[2020-06-14T15:23:01.268Z]         if ret != 0:
[2020-06-14T15:23:01.268Z] >           raise get_last_ffi_error()
[2020-06-14T15:23:01.268Z] E           mxnet.base.MXNetError: Traceback (most recent call last):
[2020-06-14T15:23:01.268Z] E             [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXSetProcessProfilerConfig+0x1bb) [0x7f937ce083eb]
[2020-06-14T15:23:01.268Z] E             [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::profiler::Profiler::SetConfig(int, std::string, bool, float, bool)+0x85) [0x7f93824c4ba5]
[2020-06-14T15:23:01.268Z] E             [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::profiler::Profiler::SetContinuousProfileDump(bool, float)+0x8b8) [0x7f93824c4428]
[2020-06-14T15:23:01.268Z] E             [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::ThreadGroup::Thread::joinable() const+0xbf) [0x7f93824c637f]
[2020-06-14T15:23:01.268Z] E             [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6d) [0x7f937cc7df3d]
[2020-06-14T15:23:01.268Z] E             File "../include/dmlc/thread_group.h", line 226
[2020-06-14T15:23:01.268Z] E           MXNetError: Check failed: auto_remove_ == false (1 vs. 0) :
[2020-06-14T15:23:01.268Z] 
[2020-06-14T15:23:01.268Z] python/mxnet/base.py:246: MXNetError
[2020-06-14T15:23:01.268Z] ---------------------------- Captured stderr setup -----------------------------
[2020-06-14T15:23:01.268Z] DEBUG:root:np/mx/python random seeds are set to 794738585, use MXNET_TEST_SEED=794738585 to reproduce.
[2020-06-14T15:23:01.268Z] ------------------------------ Captured log setup ------------------------------
[2020-06-14T15:23:01.268Z] DEBUG    root:conftest.py:193 np/mx/python random seeds are set to 794738585, use MXNET_TEST_SEED=794738585 to reproduce.
[2020-06-14T15:23:01.268Z] --------------------------- Captured stderr teardown ---------------------------
[2020-06-14T15:23:01.268Z] INFO:root:np/mx/python random seeds are set to 794738585, use MXNET_TEST_SEED=794738585 to reproduce.
[2020-06-14T15:23:01.268Z] ---------------------------- Captured log teardown -----------------------------
[2020-06-14T15:23:01.268Z] INFO     root:conftest.py:210 np/mx/python random seeds are set to 794738585, use MXNET_TEST_SEED=794738585 to reproduce.
@leezu
Copy link
Contributor

leezu commented Jul 13, 2020

Probably only failing on CD and not CI because the test is never run on CI... See #18701

@leezu leezu changed the title [Flaky Test] test_gpu_memory_profiler_gluon fails for cu* flavors on CD pipeline [Flaky Test] test_gpu_memory_profiler_gluon fails Sep 29, 2020
@leezu leezu added the v2.0 label Sep 29, 2020
@leezu
Copy link
Contributor

leezu commented Sep 29, 2020

Also flaky on CI. @ArmageddonKnight can you take a look why the test is causing segfault?

[2020-09-29T17:47:32.445Z] tests/python/gpu/test_profiler_gpu.py::test_gpu_memory_profiler_gluon 
[2020-09-29T17:47:32.445Z] Fatal Python error: Segmentation fault
[2020-09-29T17:47:32.445Z] 
[2020-09-29T17:47:32.445Z] Thread 0x00007f1161e1c700 (most recent call first):
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 400 in read
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 432 in from_io
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 967 in _thread_receiver
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 220 in run
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 285 in _perform_spawn
[2020-09-29T17:47:32.445Z] 
[2020-09-29T17:47:32.445Z] Current thread 0x00007f11633a4740 (most recent call first):
[2020-09-29T17:47:32.445Z]   File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2907 in backward
[2020-09-29T17:47:32.445Z]   File "/work/mxnet/tests/python/gpu/test_profiler_gpu.py", line 129 in test_gpu_memory_profiler_gluon
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 167 in pytest_pyfunc_call
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 1445 in runtest
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 134 in pytest_runtest_call
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in <lambda>
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 237 in from_call
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in call_runtest_hook
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 129 in call_and_report
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 99 in runtestprotocol
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 92 in pytest_runtest_protocol
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/xdist/remote.py", line 87 in run_one_test
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/xdist/remote.py", line 70 in pytest_runtestloop
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in _main
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in wrap_session
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in pytest_cmdline_main
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/xdist/remote.py", line 258 in <module>
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 1084 in executetask
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 220 in run
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 285 in _perform_spawn
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 267 in integrate_as_primary_thread
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 1060 in serve
[2020-09-29T17:47:32.445Z]   File "/usr/local/lib/python3.6/dist-packages/execnet/gateway_base.py", line 1554 in serve
[2020-09-29T17:47:32.445Z]   File "<string>", line 8 in <module>
[2020-09-29T17:47:32.445Z]   File "<string>", line 1 in <module>
[2020-09-29T17:47:32.445Z] tests/python/gpu/test_profiler_gpu.py::test_gpu_memory_profiler_symbolic 
[2020-09-29T17:47:32.699Z] [gw0] [ 90%] PASSED tests/python/gpu/test_profiler_gpu.py::test_gpu_memory_profiler_symbolic 
[2020-09-29T17:47:32.699Z] tests/python/gpu/test_profiler_gpu.py::test_profile_create_domain 
[2020-09-29T17:47:32.699Z] [gw0] [ 90%] PASSED tests/python/gpu/test_profiler_gpu.py::test_profile_create_domain 
[2020-09-29T17:47:32.699Z] [gw3] [ 90%] PASSED tests/python/gpu/test_gluon_gpu.py::test_cosine_loss[False] 
[2020-09-29T17:47:32.699Z] [gw1] node down: Not properly terminated
[2020-09-29T17:47:32.699Z] [gw1] [ 91%] FAILED tests/python/gpu/test_profiler_gpu.py::test_gpu_memory_profiler_gluon 
[2020-09-29T17:47:32.699Z] 
[2020-09-29T17:47:32.699Z] replacing crashed worker gw1

https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-19185/runs/2/nodes/277/steps/307/log/?start=0

@ArmageddonKnight
Copy link
Contributor

@leezu I am currently working the GPU memory profile visualization. I will have a look after I am done with that.

@leezu
Copy link
Contributor

leezu commented Sep 29, 2020

Thank you!

@ptrendx
Copy link
Member

ptrendx commented Oct 13, 2020

I also saw a segfault in the test_gpu_memory_profiler_symbolic test: https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-gpu/job/PR-19269/6/display/redirect - it seems related.

@leezu leezu changed the title [Flaky Test] test_gpu_memory_profiler_gluon fails [Flaky Test] Segmentation fault in memory profiler tests Nov 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants