Skip to content
This repository has been archived by the owner on Oct 7, 2021. It is now read-only.

various parameter tests failing #264

Closed
dirk-thomas opened this issue Mar 21, 2019 · 13 comments
Closed

various parameter tests failing #264

dirk-thomas opened this issue Mar 21, 2019 · 13 comments
Labels
backlog bug Something isn't working help wanted Extra attention is needed

Comments

@dirk-thomas
Copy link
Member

Since the transition to the IDL pipeline various parameter tests are failing - only with OpenSplice. See ros2/rosidl#346.

@jacobperron
Copy link
Member

I did some debugging after running across failures in the demos repo. See ros2/demos#337 (comment) for details.

@cwyark
Copy link
Contributor

cwyark commented Jun 24, 2019

@dirk-thomas This issue seems build failed since , but build pass since
ADLink would like to help solve this issue. How do we help on this ?

@jacobperron
Copy link
Member

jacobperron commented Jun 24, 2019

A rebuild of the CI from ros2/demos#337 (testing demo_nodes_cpp) shows parameter service tests are still failing on macOS: Build Status

I've triggered a fresh build on all platforms to see where we stand (OpenSplice only):

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@jacobperron
Copy link
Member

Apparently there's build requirement on Fast-RTPS at the moment. Rebuilding with both Fast-RTPS and OpenSplice:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@cwyark
Copy link
Contributor

cwyark commented Jun 28, 2019

Some updates about this issue. Connect to ros2/demos#337 . I can reproduce segment fault when I pass nothing to cmake build type.

Download ros2.repos and vcs import src < ros2.repos

$> colcon build --packages-up-to demo_nodes_cpp 
$> ./build/demo_nodes_cpp/list_parameters
[INFO] [list_parameters]: Setting parameters...
Segmentation fault: 11

I make another clean environment to rebuild ros2 by passing CMAKE_BUILD_TYPE=RelWithDebInfo

$> colcon build --cmake-args \ -DCMAKE_BUILD_TYPE=RelWithDebInfo --packages-up-to demo_nodes_cpp 
$> ./build/demo_nodes_cpp/list_parameters
[INFO] [list_parameters]: Setting parameters...
[INFO] [list_parameters]: Listing parameters...
[INFO] [list_parameters]: 
Parameter names:
 bar
 foo
 foo.first
 foo.second
Parameter prefixes:
 foo

It always happen on my macbook and my default compiler is clang.

Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

I believe there's something with clang's optimization.

@cwyark
Copy link
Contributor

cwyark commented Jul 10, 2019

Follow up my previous update, I write a server / client example.
(param_client.cpp) https://gist.github.com/cwyark/7d032d3a0672d4418505474d90819900
(param_server.cpp) https://gist.github.com/cwyark/ecec75775e20451b4d496599937808b0
The segment fault problem only when I use rcl_interfaces's .srv files

  • DescribeParameters.srv
  • GetParameterTypes.srv
  • GetParameters.srv
  • ListParameters.srv
  • SetParameters.srv
  • SetParametersAtomically.srv

If I use another .srv like lifecycle_msgs, no segment fault happen.
This only happen on OSX and use normal build or debug build without optimization (e.g.: colcon build --cmake-args \ -DCMAKE_BUILD_TYPE=Debug --packages-select rcl_interfaces or colcon build --packages-select rcl_interfaces)

@Karsten1987
Copy link
Contributor

So I think we just ran into this situation again with the rclcpp timesource test: https://ci.ros2.org/job/ci_osx/6892/testReport/(root)/rclcpp/test_time_source_gtest_missing_result/

While debugging locally I've only could get a traceback of this:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000012287ea91 librcl_interfaces__rosidl_typesupport_opensplice_c.dylib`rosidl_typesupport_opensplice_cpp::TemplateDataWriter<rcl_interfaces::srv::dds_::Sample_SetParameters_Request_>::write_sample(datawriter=0x0000000103dc8550, sample=0x00007ffeefbe8f00) at set_parameters__type_support.cpp:1068:50
   1065      {
   1066        rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter * typed_datawriter = _narrow(datawriter);
   1067
-> 1068        DDS::ReturnCode_t status = typed_datawriter->write(sample, DDS::HANDLE_NIL);

where the datawriter instance is valid, however a nullptr is being returned by the call to _narrow.

@e-hndrks
Copy link

e-hndrks commented Oct 24, 2019

Hi @Karsten1987, this is interesting. Usually the _narrow just performs a downcast from an untyped DDS::DataWriter_ptr to in this case a rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter, but I don't see where in your case the _narrow function you use is coming from. Could you show me the code of this _narrow function? Furthermore, it could be that the "datawriter" pointer is currently not holding a valid pointer anymore, causing the cast to fail, but it might also be possible that it represents an object of a different datatype than rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter causing the cast to fail.
Could you print the state of *datawriter in your debugger and post the result? This should give us at least an answer whether we are looking at a valid DataWriter object or not.

Regards,
Erik Hendriks.

@e-hndrks
Copy link

@Karsten1987: I did some more research into this, and I see that the _narrow is inherited by the TemplateDataWriter class in rosidl_typesupport_opensplice/rosidl_typesupport_opensplice_cpp/resource/srv__type_support.cpp.em, which contains the infamous line 1066 in your gdb window. The problem here is that the _narrow function call is ambiguous: not only does the direct parent of TemplateDataWriter specify a static _narrow operation, but so do all other parent classes higher up in the inheritance hierarchy. Please note that the _narrow function is static, and not virtual. I don't know what the compiler will do when it picks an _narrow from its inheritance hierarchy here, since all _narrow operations differ only in the return type, which does not play a role in their name mangling schema. I could see how one compiler would make different choices than another compiler in this particular scenario.
Could you try to disambiguate the _narrow function in line 1066, by prefixing it with the scope of the Writer class that we are targeting, as in:
rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter * typed_datawriter = rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter::_narrow(datawriter);
and see if that solves this particular crash? If so, then we can update the template in rosidl_typesupport_opensplice/rosidl_typesupport_opensplice_cpp/resource/srv__type_support.cpp.em accordingly.
Regards,
Erik Hendriks.

@Karsten1987
Copy link
Contributor

@e-hndrks I just did what you've proposed. However, it didn't change much the output.

Process 68778 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000012007eb61 librcl_interfaces__rosidl_typesupport_opensplice_c.dylib`rosidl_typesupport_opensplice_cpp::TemplateDataWriter<rcl_interfaces::srv::dds_::Sample_SetParameters_Request_>::write_sample(datawriter=0x0000000105aa32b0, sample=0x00007ffeefbed820) at set_parameters__type_support.cpp:1070:50
   1067	    rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter * typed_datawriter =
   1068	      rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter::_narrow(datawriter);
   1069
-> 1070	    DDS::ReturnCode_t status = typed_datawriter->write(sample, DDS::HANDLE_NIL);
   1071	    switch (status) {
   1072	      case DDS::RETCODE_ERROR:
   1073	        return "rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter.write: "
Target 0: (test_time_source) stopped.
(lldb) p datawriter
(rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter_impl *) $0 = 0x0000000105aa32b0
(lldb) p typed_datawriter
(rcl_interfaces::srv::dds_::Sample_SetParameters_Request_DataWriter *) $1 = 0x0000000000000000

Maybe it's just me wrongly interpreting lldb's output, but I am a bit confused why we had to change things in the opensplice typesupport c++ when the failing library in question is librcl_interfaces__rosidl_typesupport_opensplice_c.dylib

@Karsten1987
Copy link
Contributor

So basically, if I grep for narrow within the rosidl_typesupport_opensplice repository, I get the following:

$ ~/workspace/ros2/ros2_master/src/ros2/rosidl_typesupport_opensplicerg -i narrow
rosidl_typesupport_opensplice_cpp/resource/msg__type_support.cpp.em
226:    @(__dds_msg_type_prefix)DataWriter::_narrow(topic_writer);
335:    @(__dds_msg_type_prefix)DataReader::_narrow(topic_reader);

rosidl_typesupport_opensplice_c/resource/msg__type_support_c.cpp.em
351:    @(__dds_msg_type_prefix)DataWriter::_narrow(topic_writer);
570:    @(__dds_msg_type_prefix)DataReader::_narrow(topic_reader);

rosidl_typesupport_opensplice_cpp/resource/srv__type_support.cpp.em
101:      @(__dds_sample_type_prefix)@(suffix)_DataReader::_narrow(datareader);
196:      @(__dds_sample_type_prefix)@(suffix)_DataWriter::_narrow(datawriter);

@e-hndrks
Copy link

e-hndrks commented Oct 25, 2019

@Karsten1987 OK, thanks for looking into this. So the issue is not that that an ambiguous _narrow function is invoked. That leaves only two other options:

  • The datawriter pointer is no longer pointing to a valid object.
    • Could you print the state of datawriter? (in gdb I would do a "p *datawriter") to check the contents of its attributes and see if that makes sense? And can you post the output of the debugger to this thread?
  • Somehow the Writer is instantiated under a different type than what it is now being casted into.
    • If the Writer object itself is valid, then we would need to figure out what is the real type of this Writer. Some of its attributes might give us some clue in that case.

Please let me know the results of your findings.

@clalancette
Copy link
Contributor

Closing since we no longer support OpenSplice in any active ROS 2 distribution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backlog bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants