Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

flaky test: check leak ndarray #18400

Open
eric-haibin-lin opened this issue May 24, 2020 · 8 comments · Fixed by #18407 or #18595
Open

flaky test: check leak ndarray #18400

eric-haibin-lin opened this issue May 24, 2020 · 8 comments · Fixed by #18407 or #18595
Labels

Comments

@eric-haibin-lin
Copy link
Member

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18394/1/pipeline

[2020-05-24T09:53:03.464Z] ==================================== ERRORS ====================================
[2020-05-24T09:53:03.464Z] _____________________ ERROR at teardown of test_function1 ______________________
[2020-05-24T09:53:03.464Z] 
[2020-05-24T09:53:03.464Z] request = <SubRequest 'check_leak_ndarray' for <Function test_function1>>
[2020-05-24T09:53:03.464Z] 
[2020-05-24T09:53:03.464Z]     @pytest.fixture(autouse=True)
[2020-05-24T09:53:03.464Z]     def check_leak_ndarray(request):
[2020-05-24T09:53:03.464Z]         garbage_expected = request.node.get_closest_marker('garbage_expected')
[2020-05-24T09:53:03.464Z]         if garbage_expected:  # Some tests leak references. They should be fixed.
[2020-05-24T09:53:03.464Z]             yield  # run test
[2020-05-24T09:53:03.464Z]             return
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         if 'centos' in platform.platform():
[2020-05-24T09:53:03.464Z]             # Multiple tests are failing due to reference leaks on CentOS. It's not
[2020-05-24T09:53:03.464Z]             # yet known why there are more memory leaks in the Python 3.6.9 version
[2020-05-24T09:53:03.464Z]             # shipped on CentOS compared to the Python 3.6.9 version shipped in
[2020-05-24T09:53:03.464Z]             # Ubuntu.
[2020-05-24T09:53:03.464Z]             yield
[2020-05-24T09:53:03.464Z]             return
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         del gc.garbage[:]
[2020-05-24T09:53:03.464Z]         # Collect garbage prior to running the next test
[2020-05-24T09:53:03.464Z]         gc.collect()
[2020-05-24T09:53:03.464Z]         # Enable gc debug mode to check if the test leaks any arrays
[2020-05-24T09:53:03.464Z]         gc_flags = gc.get_debug()
[2020-05-24T09:53:03.464Z]         gc.set_debug(gc.DEBUG_SAVEALL)
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         # Run the test
[2020-05-24T09:53:03.464Z]         yield
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         # Check for leaked NDArrays
[2020-05-24T09:53:03.464Z]         gc.collect()
[2020-05-24T09:53:03.464Z]         gc.set_debug(gc_flags)  # reset gc flags
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         seen = set()
[2020-05-24T09:53:03.464Z]         def has_array(element):
[2020-05-24T09:53:03.464Z]             try:
[2020-05-24T09:53:03.464Z]                 if element in seen:
[2020-05-24T09:53:03.464Z]                     return False
[2020-05-24T09:53:03.464Z]                 seen.add(element)
[2020-05-24T09:53:03.464Z]             except (TypeError, ValueError):  # unhashable
[2020-05-24T09:53:03.464Z]                 pass
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]             if isinstance(element, mx.nd._internal.NDArrayBase):
[2020-05-24T09:53:03.464Z]                 return True
[2020-05-24T09:53:03.464Z]             elif isinstance(element, mx.sym._internal.SymbolBase):
[2020-05-24T09:53:03.464Z]                 return False
[2020-05-24T09:53:03.464Z]             elif hasattr(element, '__dict__'):
[2020-05-24T09:53:03.464Z]                 return any(has_array(x) for x in vars(element))
[2020-05-24T09:53:03.464Z]             elif isinstance(element, dict):
[2020-05-24T09:53:03.464Z]                 return any(has_array(x) for x in element.items())
[2020-05-24T09:53:03.464Z]             else:
[2020-05-24T09:53:03.464Z]                 try:
[2020-05-24T09:53:03.464Z]                     return any(has_array(x) for x in element)
[2020-05-24T09:53:03.464Z]                 except (TypeError, KeyError, RecursionError):
[2020-05-24T09:53:03.464Z]                     return False
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z] >       assert not any(has_array(x) for x in gc.garbage), 'Found leaked NDArrays due to reference cycles'
[2020-05-24T09:53:03.464Z] E       AssertionError: Found leaked NDArrays due to reference cycles
[2020-05-24T09:53:03.464Z] E       assert not True
[2020-05-24T09:53:03.464Z] E        +  where True = any(<generator object check_leak_ndarray.<locals>.<genexpr> at 0x7f96c07802b0>)
[2020-05-24T09:53:03.464Z] 
[2020-05-24T09:53:03.464Z] tests/python/conftest.py:78: AssertionError
[2020-05-24T09:53:03.464Z] ---------------------------- Captured stderr setup -----------------------------
[2020-05-24T09:53:03.464Z] DEBUG:root:np/mx/python random seeds are set to 135663639, use MXNET_TEST_SEED=135663639 to reproduce.
[2020-05-24T09:53:03.464Z] ------------------------------ Captured log setup ------------------------------
[2020-05-24T09:53:03.464Z] DEBUG    root:conftest.py:193 np/mx/python random seeds are set to 135663639, use MXNET_TEST_SEED=135663639 to reproduce.
[2020-05-24T09:53:03.464Z] ----------------------------- Captured stderr call -----------------------------
[2020-05-24T09:53:03.464Z] [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1404816900 to reproduce.
[2020-05-24T09:53:03.465Z] DEBUG:common:Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1404816900 to reproduce.
[2020-05-24T09:53:03.465Z] ------------------------------ Captured log call -------------------------------
[2020-05-24T09:53:03.465Z] DEBUG    common:common.py:221 Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1404816900 to reproduce.

@leezu

@leezu
Copy link
Contributor

leezu commented May 26, 2020

As the flakyness occurrs with mx.autograd.Function, which is "known to leak" (cf the test_function in the same file), I suggest to mark the flaky test_function1 as "known to leak" as well. I'm not yet sure why test_function1 leaks only sometimes.

@leezu
Copy link
Contributor

leezu commented Jun 5, 2020

 _____________________ ERROR at teardown of test_get_symbol _____________________
[2020-06-04T22:33:35.745Z] 
[2020-06-04T22:33:35.745Z] request = <SubRequest 'check_leak_ndarray' for <Function test_get_symbol>>
[2020-06-04T22:33:35.745Z] 
[2020-06-04T22:33:35.745Z]     @pytest.fixture(autouse=True)
[2020-06-04T22:33:35.745Z]     def check_leak_ndarray(request):
[2020-06-04T22:33:35.745Z]         garbage_expected = request.node.get_closest_marker('garbage_expected')
[2020-06-04T22:33:35.745Z]         if garbage_expected:  # Some tests leak references. They should be fixed.
[2020-06-04T22:33:35.745Z]             yield  # run test
[2020-06-04T22:33:35.745Z]             return
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         if 'centos' in platform.platform():
[2020-06-04T22:33:35.745Z]             # Multiple tests are failing due to reference leaks on CentOS. It's not
[2020-06-04T22:33:35.745Z]             # yet known why there are more memory leaks in the Python 3.6.9 version
[2020-06-04T22:33:35.745Z]             # shipped on CentOS compared to the Python 3.6.9 version shipped in
[2020-06-04T22:33:35.745Z]             # Ubuntu.
[2020-06-04T22:33:35.745Z]             yield
[2020-06-04T22:33:35.745Z]             return
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         del gc.garbage[:]
[2020-06-04T22:33:35.745Z]         # Collect garbage prior to running the next test
[2020-06-04T22:33:35.745Z]         gc.collect()
[2020-06-04T22:33:35.745Z]         # Enable gc debug mode to check if the test leaks any arrays
[2020-06-04T22:33:35.745Z]         gc_flags = gc.get_debug()
[2020-06-04T22:33:35.745Z]         gc.set_debug(gc.DEBUG_SAVEALL)
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         # Run the test
[2020-06-04T22:33:35.745Z]         yield
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         # Check for leaked NDArrays
[2020-06-04T22:33:35.745Z]         gc.collect()
[2020-06-04T22:33:35.745Z]         gc.set_debug(gc_flags)  # reset gc flags
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         seen = set()
[2020-06-04T22:33:35.745Z]         def has_array(element):
[2020-06-04T22:33:35.745Z]             try:
[2020-06-04T22:33:35.745Z]                 if element in seen:
[2020-06-04T22:33:35.745Z]                     return False
[2020-06-04T22:33:35.745Z]                 seen.add(element)
[2020-06-04T22:33:35.745Z]             except (TypeError, ValueError):  # unhashable
[2020-06-04T22:33:35.745Z]                 pass
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]             if isinstance(element, mx.nd._internal.NDArrayBase):
[2020-06-04T22:33:35.745Z]                 return True
[2020-06-04T22:33:35.745Z]             elif isinstance(element, mx.sym._internal.SymbolBase):
[2020-06-04T22:33:35.745Z]                 return False
[2020-06-04T22:33:35.745Z]             elif hasattr(element, '__dict__'):
[2020-06-04T22:33:35.745Z]                 return any(has_array(x) for x in vars(element))
[2020-06-04T22:33:35.745Z]             elif isinstance(element, dict):
[2020-06-04T22:33:35.745Z]                 return any(has_array(x) for x in element.items())
[2020-06-04T22:33:35.745Z]             else:
[2020-06-04T22:33:35.745Z]                 try:
[2020-06-04T22:33:35.745Z]                     return any(has_array(x) for x in element)
[2020-06-04T22:33:35.745Z]                 except (TypeError, KeyError, RecursionError):
[2020-06-04T22:33:35.745Z]                     return False
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z] >       assert not any(has_array(x) for x in gc.garbage), 'Found leaked NDArrays due to reference cycles'
[2020-06-04T22:33:35.745Z] E       AssertionError: Found leaked NDArrays due to reference cycles
[2020-06-04T22:33:35.745Z] E       assert not True
[2020-06-04T22:33:35.745Z] E        +  where True = any(<generator object check_leak_ndarray.<locals>.<genexpr> at 0x7f8a046fb0a0>)
[2020-06-04T22:33:35.745Z] 

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-18485/runs/2/nodes/365/steps/570/log/?start=0

@leezu leezu reopened this Jun 5, 2020
@eric-haibin-lin
Copy link
Member Author

@leezu
Copy link
Contributor

leezu commented Jun 19, 2020

And a third time. I'm not sure why this happens time to time and why it only affects test_get_symbol, but let's disable the check for test_get_symbol in favor of CI stability: #18595

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-18589/runs/1/nodes/364/steps/755/log/?start=0

AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this issue Jul 6, 2020
@szha
Copy link
Member

szha commented Jul 9, 2020

@szha szha reopened this Jul 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
3 participants