👩‍🌾 get_node_names test failing on macOS #750

chapulina · 2020-08-18T02:42:25Z

Bug report

Required Info:

Operating System:
- macOS
Installation type:
- source
Version or commit hash:
- master since 2020-08-16
DDS implementation:
- FastRTPS
Client library (if applicable):
- N/A

Steps to reproduce issue

Do the same setup as CI:

https://ci.ros2.org/job/nightly_osx_release/1754/consoleFull

Expected behavior

All tests pass

Actual behavior

Three get_node_names tests fail:

rcl.TestGetNodeNames__rmw_fastrtps_cpp.test_rcl_get_node_names_with_enclave
rcl.TestGetNodeNames__rmw_fastrtps_cpp.test_rcl_get_node_names
projectroot.test.test_get_node_names__rmw_fastrtps_cpp

Additional information

These tests started failing on CI 2 days ago:

/Users/osrf/jenkins-agent/workspace/nightly_osx_release/ws/src/ros2/rcl/rcl/test/rcl/test_get_node_names.cpp:138
Expected equality of these values:
  discovered_nodes
    Which is: { ("launch_ros_12244", "/"), ("mock_component_container", "/"), ("mock_component_container", "/"), ("mock_component_container", "/"), ("mock_component_container", "/"), ("node1", "/"), ("node1", "/"), ("node2", "/"), ("node2", "/ns/ns"), ("node3", "/ns") }
  expected_nodes
    Which is: { ("node1", "/"), ("node1", "/"), ("node2", "/"), ("node2", "/ns/ns"), ("node3", "/ns") }

The 2 PRs merged to this repository since don't seem to be related to the failure (#746, #734 )

The text was updated successfully, but these errors were encountered:

clalancette · 2020-08-21T15:11:29Z

So, this looks like cross-talk between this test and previous tests that ran, but didn't properly cleanup after themselves. It also looks like it is a flake, since we've had no failures in the last 4 nights after this.

There are 2 things I can think about doing here:

Relax the test a bit so that discovered_nodes is a superset of expected_nodes. This would reduce the flake, and insulate this test against previous tests failing in one way or another in the future.
Try to find out why the previous tests didn't cleanup. I'm not entirely sure how we would go about this, but we could try to reproduce somehow and pursue it.

@ros2/team any thoughts?

sloretz · 2020-08-21T15:24:25Z

Relax the test a bit so that discovered_nodes is a superset of expected_nodes. This would reduce the flake, and insulate this test against previous tests failing in one way or another in the future.

A while back one of the node names tests was failing because the code to set ROS_DOMAIN_ID on the CI machines wasn't working correctly, and this test's strictness is what caught that. I would be hesitant to relax the restriction, though I can see why that's attractive if there's a hard to locate cleanup issue.

Since this test failed on an OSX machine and those are all on the same network, is it possible this is showing a regression in the code setting ROS_DOMAIN_ID?

dirk-thomas · 2020-08-21T15:36:21Z

Try to find out why the previous tests didn't cleanup. I'm not entirely sure how we would go about this, but we could try to reproduce somehow and pursue it.

Imo we should definitely do this.

any thoughts?

If we don't want to relax the test (which we could do too) we could also set the localhost-only option and use the domain coordinator to pick a different domain ID (which is not the one set in the environment variable).

Relax the test a bit so that discovered_nodes is a superset of expected_nodes. This would reduce the flake, and insulate this test against previous tests failing in one way or another in the future.

jacobperron · 2020-08-25T00:41:30Z

Try to find out why the previous tests didn't cleanup

I'm pretty sure this is my fault. I ran CI for some broken tests related to composable nodes that left some nodes running (hence the node name "mock_component_container"). AFAIK, this only affected our macOS machines since they are not containerized. I noticed these zombie nodes about a day after running the buggy test and since killed them.

It's still possible to run into a similar issue in the future if a buggy test is run leaving behind stray nodes. So, it's probably still worth relaxing test or changing up the domain ID as @dirk-thomas suggests.

clalancette · 2020-09-10T19:09:42Z

We haven't seen this in a while, and we know the root cause. It can still happen in the future, but for now I'm going to close this bug out.

clalancette closed this as completed Sep 10, 2020

This was referenced Nov 2, 2021

Add setter and getter for domain_id in rcl_init_options_t (backport #678) #946

Merged

[env] Add set_env_var function ros2/rcpputils#150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👩‍🌾 get_node_names test failing on macOS #750

👩‍🌾 get_node_names test failing on macOS #750

chapulina commented Aug 18, 2020

clalancette commented Aug 21, 2020 •

edited

Loading

sloretz commented Aug 21, 2020

dirk-thomas commented Aug 21, 2020

jacobperron commented Aug 25, 2020

clalancette commented Sep 10, 2020

👩‍🌾 get_node_names test failing on macOS #750

👩‍🌾 get_node_names test failing on macOS #750

Comments

chapulina commented Aug 18, 2020

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

clalancette commented Aug 21, 2020 • edited Loading

sloretz commented Aug 21, 2020

dirk-thomas commented Aug 21, 2020

jacobperron commented Aug 25, 2020

clalancette commented Sep 10, 2020

clalancette commented Aug 21, 2020 •

edited

Loading