Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👩‍🌾 get_node_names test failing on macOS #750

Closed
chapulina opened this issue Aug 18, 2020 · 5 comments
Closed

👩‍🌾 get_node_names test failing on macOS #750

chapulina opened this issue Aug 18, 2020 · 5 comments

Comments

@chapulina
Copy link

Bug report

Required Info:

  • Operating System:
    • macOS
  • Installation type:
    • source
  • Version or commit hash:
    • master since 2020-08-16
  • DDS implementation:
    • FastRTPS
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

Do the same setup as CI:

https://ci.ros2.org/job/nightly_osx_release/1754/consoleFull

Expected behavior

All tests pass

Actual behavior

Three get_node_names tests fail:

  • rcl.TestGetNodeNames__rmw_fastrtps_cpp.test_rcl_get_node_names_with_enclave
  • rcl.TestGetNodeNames__rmw_fastrtps_cpp.test_rcl_get_node_names
  • projectroot.test.test_get_node_names__rmw_fastrtps_cpp

Additional information

These tests started failing on CI 2 days ago:

/Users/osrf/jenkins-agent/workspace/nightly_osx_release/ws/src/ros2/rcl/rcl/test/rcl/test_get_node_names.cpp:138
Expected equality of these values:
  discovered_nodes
    Which is: { ("launch_ros_12244", "/"), ("mock_component_container", "/"), ("mock_component_container", "/"), ("mock_component_container", "/"), ("mock_component_container", "/"), ("node1", "/"), ("node1", "/"), ("node2", "/"), ("node2", "/ns/ns"), ("node3", "/ns") }
  expected_nodes
    Which is: { ("node1", "/"), ("node1", "/"), ("node2", "/"), ("node2", "/ns/ns"), ("node3", "/ns") }

The 2 PRs merged to this repository since don't seem to be related to the failure (#746, #734 )

@clalancette
Copy link
Contributor

clalancette commented Aug 21, 2020

So, this looks like cross-talk between this test and previous tests that ran, but didn't properly cleanup after themselves. It also looks like it is a flake, since we've had no failures in the last 4 nights after this.

There are 2 things I can think about doing here:

  1. Relax the test a bit so that discovered_nodes is a superset of expected_nodes. This would reduce the flake, and insulate this test against previous tests failing in one way or another in the future.
  2. Try to find out why the previous tests didn't cleanup. I'm not entirely sure how we would go about this, but we could try to reproduce somehow and pursue it.

@ros2/team any thoughts?

@sloretz
Copy link
Contributor

sloretz commented Aug 21, 2020

Relax the test a bit so that discovered_nodes is a superset of expected_nodes. This would reduce the flake, and insulate this test against previous tests failing in one way or another in the future.

A while back one of the node names tests was failing because the code to set ROS_DOMAIN_ID on the CI machines wasn't working correctly, and this test's strictness is what caught that. I would be hesitant to relax the restriction, though I can see why that's attractive if there's a hard to locate cleanup issue.

Since this test failed on an OSX machine and those are all on the same network, is it possible this is showing a regression in the code setting ROS_DOMAIN_ID?

@dirk-thomas
Copy link
Member

Try to find out why the previous tests didn't cleanup. I'm not entirely sure how we would go about this, but we could try to reproduce somehow and pursue it.

Imo we should definitely do this.

any thoughts?

If we don't want to relax the test (which we could do too) we could also set the localhost-only option and use the domain coordinator to pick a different domain ID (which is not the one set in the environment variable).

Relax the test a bit so that discovered_nodes is a superset of expected_nodes. This would reduce the flake, and insulate this test against previous tests failing in one way or another in the future.

@jacobperron
Copy link
Member

Try to find out why the previous tests didn't cleanup

I'm pretty sure this is my fault. I ran CI for some broken tests related to composable nodes that left some nodes running (hence the node name "mock_component_container"). AFAIK, this only affected our macOS machines since they are not containerized. I noticed these zombie nodes about a day after running the buggy test and since killed them.

It's still possible to run into a similar issue in the future if a buggy test is run leaving behind stray nodes. So, it's probably still worth relaxing test or changing up the domain ID as @dirk-thomas suggests.

@clalancette
Copy link
Contributor

We haven't seen this in a while, and we know the root cause. It can still happen in the future, but for now I'm going to close this bug out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants