Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown non-recurrent failures in //bindings/pydrake/systems:py/general_test #19335

Open
svenevs opened this issue May 2, 2023 · 35 comments
Open
Assignees
Labels
component: continuous integration Jenkins, CDash, mirroring of externals, website infrastructure type: bug

Comments

@svenevs
Copy link
Contributor

svenevs commented May 2, 2023

First occurrence came up in continuous macOS x86:

[4:48:13 PM]  FAIL: //bindings/pydrake/systems:py/general_test (see /Users/monterey/workspace/mac-x86-monterey-clang-bazel-continuous-release/_bazel_monterey/56870061588957414ef418ee351da9fe/execroot/drake/bazel-out/darwin-opt/testlogs/bindings/pydrake/systems/py/general_test/test.log)
[4:48:13 PM]  INFO: From Testing //bindings/pydrake/systems:py/general_test:
[4:48:13 PM]  ==================== Test output for //bindings/pydrake/systems:py/general_test:
[4:48:13 PM]  
[4:48:13 PM]  Running tests...
[4:48:13 PM]  ----------------------------------------------------------------------
[4:48:13 PM]  ..............General stats regarding discrete updates:
[4:48:13 PM]  Number of time steps taken (simulator stats) = 17
[4:48:13 PM]  Simulator publishes every time step: false
[4:48:13 PM]  Number of publishes = 0
[4:48:13 PM]  Number of discrete updates = 0
[4:48:13 PM]  Number of "unrestricted" updates = 0
[4:48:13 PM]  
[4:48:13 PM]  Stats for integrator RungeKutta3Integrator with error control:
[4:48:13 PM]  Number of time steps taken (integrator stats) = 17
[4:48:13 PM]  Initial time step taken =          0 s
[4:48:13 PM]  Largest time step taken =        0.1 s
[4:48:13 PM]  Smallest adapted step size =          0 s
[4:48:13 PM]  Number of steps shrunk due to error control = 0
[4:48:13 PM]  Number of derivative evaluations = 51
[4:48:13 PM]  Number of steps shrunk due to convergence-based failure = 0
[4:48:13 PM]  Number of convergence-based step failures (should match) = 0
[4:48:13 PM]  ............F....
[4:48:13 PM]  ======================================================================
[4:48:13 PM]  FAIL [0.002s]: test_system_base_api (general_test.TestGeneral.test_system_base_api)
[4:48:13 PM]  ----------------------------------------------------------------------
[4:48:13 PM]  Traceback (most recent call last):
[4:48:13 PM]    File "/Users/monterey/workspace/mac-x86-monterey-clang-bazel-continuous-release/_bazel_monterey/56870061588957414ef418ee351da9fe/sandbox/darwin-sandbox/8448/execroot/drake/bazel-out/darwin-opt/bin/bindings/pydrake/systems/py/general_test.runfiles/drake/bindings/pydrake/systems/test/general_test.py", line 122, in test_system_base_api
[4:48:13 PM]      self.assertIs(u1.get_system(), system)
[4:48:13 PM]  AssertionError: <pydrake.systems.primitives.Adder object at 0x116343d30> is not <pydrake.systems.primitives.Adder object at 0x10eaa1330>
[4:48:13 PM]  
[4:48:13 PM]  ----------------------------------------------------------------------
[4:48:13 PM]  Ran 31 tests in 0.165s
[4:48:13 PM]  
[4:48:13 PM]  FAILED (failures=1)
[4:48:13 PM]  
[4:48:13 PM]  Generating XML reports...
[4:48:13 PM]  ================================================================================

We booted a CI machine to try and triage thinking it was related to the workspace upgrades (#19332), however it is not. It also appears under df13567 but CI does not always pick it up. May not be limited to macOS.

Testing can sometimes reproduce it if you run it multiple times. In the macOS case, the command line:

$ bazel test --cache_test_results=no --runs_per_test=150 --config=clang --compilation_mode=opt --test_timeout=300,1500,4500,-1 //bindings/pydrake/systems:py/general_test

For now we label this "buildcop noise" and will log future occurrences, while silently ignoring it otherwise.

@svenevs svenevs added type: bug component: continuous integration Jenkins, CDash, mirroring of externals, website infrastructure labels May 2, 2023
@jwnimmer-tri
Copy link
Collaborator

The first time an unexplained failure occurs, close the issue immediately – there is not much value in keeping an open issue for a failure that only ever happened once. If the issue occurs a second time, reopen it.

https://drake.mit.edu/buildcop.html#process

@jwnimmer-tri jwnimmer-tri closed this as not planned Won't fix, can't repro, duplicate, stale May 3, 2023
@ggould-tri
Copy link
Contributor

Happened again. Also monterey. Reopening -- there's something in the monterey toolchain that's tickling pybind object identity semantics.
https://drake-jenkins.csail.mit.edu/view/Production/job/mac-x86-monterey-unprovisioned-clang-bazel-nightly-release/178/consoleFull

@ggould-tri
Copy link
Contributor

@jwnimmer-tri speculates that this may share a common source with #19394 https://drakedevelopers.slack.com/archives/C270MN28G/p1683817207919069?thread_ts=1683810179.404429&cid=C270MN28G
(Mentioning this to cause the bugs to be crosslinked, for future convenience if one of them gets found and fixed)

@DamrongGuoy
Copy link
Contributor

It happened again last night (5/16/23) in mac-arm-monterey-clang-bazel-nightly-debug/196.
Repeated run mac-arm-monterey-clang-bazel-nightly-debug/197 is fine.

@liangfok
Copy link
Contributor

@BetsyMcPhail
Copy link
Contributor

@BetsyMcPhail
Copy link
Contributor

@BetsyMcPhail
Copy link
Contributor

@DamrongGuoy
Copy link
Contributor

DamrongGuoy commented Jul 24, 2023

@ggould-tri
Copy link
Contributor

Again: https://drake-jenkins.csail.mit.edu/view/Production/job/mac-arm-monterey-clang-bazel-nightly-debug/375/consoleFull

5:07:41 AM]  INFO: From Testing //bindings/pydrake/systems:py/general_test:
[5:07:41 AM]  ==================== Test output for //bindings/pydrake/systems:py/general_test:
[5:07:41 AM]  
[5:07:41 AM]  Running tests...
[5:07:41 AM]  ----------------------------------------------------------------------
[5:07:41 AM]  ..............General stats regarding discrete updates:
[5:07:41 AM]  Number of time steps taken (simulator stats) = 17
[5:07:41 AM]  Simulator publishes every time step: false
[5:07:41 AM]  Number of publishes = 0
[5:07:41 AM]  Number of discrete updates = 0
[5:07:41 AM]  Number of "unrestricted" updates = 0
[5:07:41 AM]  
[5:07:41 AM]  Stats for integrator RungeKutta3Integrator with error control:
[5:07:41 AM]  Number of time steps taken (integrator stats) = 17
[5:07:41 AM]  Initial time step taken =          0 s
[5:07:41 AM]  Largest time step taken =        0.1 s
[5:07:41 AM]  Smallest adapted step size =          0 s
[5:07:41 AM]  Number of steps shrunk due to error control = 0
[5:07:41 AM]  Number of derivative evaluations = 51
[5:07:41 AM]  Number of steps shrunk due to convergence-based failure = 0
[5:07:41 AM]  Number of convergence-based step failures (should match) = 0
[5:07:41 AM]  ............F....
[5:07:41 AM]  ======================================================================
[5:07:41 AM]  FAIL [0.004s]: test_system_base_api (general_test.TestGeneral.test_system_base_api)
[5:07:41 AM]  ----------------------------------------------------------------------
[5:07:41 AM]  Traceback (most recent call last):
[5:07:41 AM]    File "/Users/admin/workspace/mac-arm-monterey-clang-bazel-nightly-debug/_bazel_admin/c321c5db3bfe1b41e5a3a7639d47261f/sandbox/darwin-sandbox/8451/execroot/drake/bazel-out/darwin_arm64-dbg/bin/bindings/pydrake/systems/py/general_test.runfiles/drake/bindings/pydrake/systems/test/general_test.py", line 125, in test_system_base_api
[5:07:41 AM]      self.assertIs(u1.get_system(), system)
[5:07:41 AM]  AssertionError: <pydrake.systems.primitives.Adder object at 0x10c836db0> is not <pydrake.systems.primitives.Adder object at 0x1045fe170>
[5:07:41 AM]  
[5:07:41 AM]  ----------------------------------------------------------------------
[5:07:41 AM]  Ran 31 tests in 0.195s

@BetsyMcPhail
Copy link
Contributor

@jwnimmer-tri
Copy link
Collaborator

The test has a runtime of like 2 seconds. We should mark it flaky = True to have it be slightly less noisy in CI.

@BetsyMcPhail
Copy link
Contributor

@SeanCurtis-TRI
Copy link
Contributor

@BetsyMcPhail
Copy link
Contributor

@williamjallen
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: continuous integration Jenkins, CDash, mirroring of externals, website infrastructure type: bug
Projects
None yet
Development

No branches or pull requests

10 participants