Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abort() in test_subscription__rmw_cyclonedds_cpp #279

Closed
clalancette opened this issue Feb 1, 2021 · 5 comments
Closed

abort() in test_subscription__rmw_cyclonedds_cpp #279

clalancette opened this issue Feb 1, 2021 · 5 comments

Comments

@clalancette
Copy link
Contributor

We're seeing occasional failures in test_subscription__rmw_cyclonedds_cpp::take_sequence: https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_repeated/1491/testReport/junit/(root)/projectroot/test_subscription__rmw_cyclonedds_cpp/

I finally managed to capture a stack trace, and it looks like this:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffff81ca9d68 in __GI_abort () at abort.c:79
#2  0x0000ffff81af782c in sertype_rmw_realloc_samples (ptrs=0xaaaaca346c10, d=0xaaaaca027250, old=0x0, oldcount=0, count=1)
    at /home/ubuntu/ros2_ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/serdata.cpp:458
#3  0x0000ffff818c7290 in ddsi_sertype_realloc_samples (ptrs=0xaaaaca346c10, tp=0xaaaaca027250, old=0x0, oldcount=0, count=1)
    at /home/ubuntu/ros2_ws/src/eclipse-cyclonedds/cyclonedds/src/security/api/../../core/ddsi/include/dds/ddsi/ddsi_sertype.h:160
#4  0x0000ffff8195959c in dds_read_impl (take=true, reader_or_condition=682686299, buf=0xaaaaca346c10, bufsz=1, maxs=1, si=0xaaaaca3446b0, mask=128, hand=0, 
    lock=true, only_reader=false) at /home/ubuntu/ros2_ws/src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds_read.c:93
#5  0x0000ffff8195a0d8 in dds_take (rd_or_cnd=682686299, buf=0xaaaaca346c10, si=0xaaaaca3446b0, bufsz=1, maxs=1)
    at /home/ubuntu/ros2_ws/src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds_read.c:354
#6  0x0000ffff81a8b358 in rmw_take_seq (subscription=0xaaaaca02bdd0, count=1, message_sequence=0xfffff6cfa170, message_info_sequence=0xfffff6cfa190, 
    taken=0xfffff6cfa138) at /home/ubuntu/ros2_ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2660
#7  0x0000ffff81a8ba7c in rmw_take_sequence (subscription=0xaaaaca02bdd0, count=1, message_sequence=0xfffff6cfa170, message_info_sequence=0xfffff6cfa190, 
    taken=0xfffff6cfa138, allocation=0x0) at /home/ubuntu/ros2_ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2780
#8  0x0000ffff821c113c in rmw_take_sequence (v6=0xaaaaca02bdd0, v5=1, v4=0xfffff6cfa170, v3=0xfffff6cfa190, v2=0xfffff6cfa138, v1=0x0)
    at /home/ubuntu/ros2_ws/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:369
#9  0x0000aaaabeda4fcc in TestSubscriptionUse__rmw_cyclonedds_cpp_take_sequence_Test::TestBody (this=0xaaaaca03a150)
    at /home/ubuntu/ros2_ws/src/ros2/rmw_implementation/test_rmw_implementation/test/test_subscription.cpp:538
#10 0x0000aaaabedebaf4 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0xaaaaca03a150, 
    method=&virtual testing::Test::TestBody(), location=0xaaaabee02ee8 "the test body")
    at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:2433
#11 0x0000aaaabede41ac in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0xaaaaca03a150, 
    method=&virtual testing::Test::TestBody(), location=0xaaaabee02ee8 "the test body")
    at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:2469
#12 0x0000aaaabedc19b0 in testing::Test::Run (this=0xaaaaca03a150) at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:2508
#13 0x0000aaaabedc22b0 in testing::TestInfo::Run (this=0xaaaaca011860) at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:2684
#14 0x0000aaaabedc299c in testing::TestSuite::Run (this=0xaaaaca010a50) at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:2816
#15 0x0000aaaabedcd530 in testing::internal::UnitTestImpl::RunAllTests (this=0xaaaaca00f640)
    at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:5338
#16 0x0000aaaabeded0d8 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xaaaaca00f640, 
    method=(bool (testing::internal::UnitTestImpl::*)(class testing::internal::UnitTestImpl * const)) 0xaaaabedcd154 <testing::internal::UnitTestImpl::RunAllTests()>, location=0xaaaabee039d8 "auxiliary test code (environments or event listeners)")
    at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:2433
#17 0x0000aaaabede5414 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xaaaaca00f640, 
    method=(bool (testing::internal::UnitTestImpl::*)(class testing::internal::UnitTestImpl * const)) 0xaaaabedcd154 <testing::internal::UnitTestImpl::RunAllTests()>, location=0xaaaabee039d8 "auxiliary test code (environments or event listeners)")
    at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:2469
#18 0x0000aaaabedcbf98 in testing::UnitTest::Run (this=0xaaaabee3f268 <testing::UnitTest::GetInstance()::instance>)
    at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/./src/gtest.cc:4925
#19 0x0000aaaabedb9964 in RUN_ALL_TESTS () at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:2473
#20 0x0000aaaabedb98cc in main (argc=1, argv=0xfffff6cfab18) at /home/ubuntu/ros2_ws/install/gtest_vendor/src/gtest_vendor/src/gtest_main.cc:45

Digging through the trace, we can see that the real code starts in frame 9, inside the test, calling rmw_take_sequence(). Via a series of other calls, it ends up inside of cyclonedds itself, in dds_read_impl. What happens in there is that if buf[0] == NULL, we end up attempting to ddsi_sertype_realloc_samples. However, rmw_cyclonedds_cpp sets the realloc callbacks to abort() here: https://github.com/ros2/rmw_cyclonedds/blob/master/rmw_cyclonedds_cpp/src/serdata.cpp#L447

Backing up a bit, we can see why this is failing. The test calls rmw_message_sequence_init to initialize the message. The size parameter is 1, so we allocate one void * pointer in there. However, we do not initialize it. This is then void **buf that eventually gets passed into dds_read_impl. In most cases, the sequence of bytes stored in that pointer is not zero, so we never end up trying to call ddsi_sertype_realloc_samples (and hence we don't abort()). However, if that pointer just happens to be NULL, then we will attempt to call the ddsi_sertype_realloc_samples and abort.

This leads to a couple of questions:

  1. Is that take_sequence test correct? That is, should we be initializing the returned data pointers before passing them into rmw_take_sequence?
  2. The use of buf[0] == NULL inside of dds_read_impl seems somewhat fragile. Should that be using a buffer size or something?
  3. Should we really always abort if we try to realloc?

@eboasson looking for any feedback from you on what the correct path forward here is.

@eboasson
Copy link
Collaborator

eboasson commented Feb 2, 2021

This leads to a couple of questions:

Is that take_sequence test correct? That is, should we be initializing the returned data pointers before passing them into rmw_take_sequence?

The original serdata.cpp code never needed to allocate anything, and I think that should still be the case. If I am reading the rmw_message_sequence code correctly, data points to an array of messages, but in rmw_take_seq (at

auto ret = dds_take(sub->enth, message_sequence->data, infos.data(), count, maxsamples);
) it is passed to dds_take that is expecting an array of pointers to messages. That gets one into trouble.

Secondly, I do think the messages themselves ought to be initialised, because I think the deserialiser assumes it is storing into a valid sample. (See

auto & vector = *reinterpret_cast<std::vector<T> *>(field);
or #228). Either my understanding or the code may be in error, of course ...

The use of buf[0] == NULL inside of dds_read_impl seems somewhat fragile. Should that be using a buffer size or something?

Totally fragile. I should've replaced that read/take interface when I had the chance of freely doing so when I discovered it, shortly after it was open-sourced. Now changing those interfaces needs some effort to make the transition as easy as possible for users (other than ROS 2).

It currently takes an array of pointers to samples (it is emulating a std::vector<T*> as it were with a void **). If the first one is a null pointer, it will allocate memory on the fly and store the corresponding addresses in the buf array. It is absolutely horrid.

Should we really always abort if we try to realloc?

There are two lines of reasoning:

  1. We control the inputs and know that it should never call realloc. By definition, if one ever ends up on a path that calls realloc, panicking makes the most sense.
  2. A change in the RMW layer could unwittingly cause it to legitimately end up in realloc, given that there are perfectly valid sequences of operations and inputs that get you there. It is thus better if it all works as advertised.

The abort originally entered the code as a shortcut (no time to waste on unused things), but I think (2) is actually the wiser route.

@eboasson looking for any feedback from you on what the correct path forward here is.

Within the boundaries of the current dds_take interface, there are not that many options. The most sensible alternative is probably to dynamically allocate an array of pointers. It is definitely unattractive and, indeed, against the spirit of the interface, but it should fix this crash.

The only proper solution is to fix the interface of Cyclone. That'll either be a backwards incompatible change (the desire to change this interface is one of the reasons I have been holding the major version at 0) or it'll be a second interface sitting next to it (with the problem of finding a good name). So that route simply takes more time.

My suggestion would be to take the ugly fix in the RMW layer, then fix Cyclone's interface and finally switch to using the fixed interface in the RMW layer.

@clalancette
Copy link
Contributor Author

My suggestion would be to take the ugly fix in the RMW layer, then fix Cyclone's interface and finally switch to using the fixed interface in the RMW layer.

OK, so if I'm understanding properly, you are suggesting that we actually implement sertype_rmw_zero_samples, sertype_rmw_realloc_samples, and sertype_rmw_free_samples in https://github.com/ros2/rmw_cyclonedds/blob/master/rmw_cyclonedds_cpp/src/serdata.cpp. Is that correct?

Additionally, I think we probably want to augment the test in https://github.com/ros2/rmw_implementation/blob/master/test_rmw_implementation/test/test_subscription.cpp#L526 . In particular, I think we should add a case that exercises this exact problem (when the pointer is NULL), and a case where the pointer is properly setup (which is probably the original intent of this test).

@eboasson I'm going to work on the second one (the test) tomorrow. If you have time to work on the first one (implementation of sertype_rmw* APIs), that would be appreciated.

@eboasson
Copy link
Collaborator

eboasson commented Feb 3, 2021

@clalancette

OK, so if I'm understanding properly, you are suggesting that we actually implement sertype_rmw_zero_samples, sertype_rmw_realloc_samples, and sertype_rmw_free_samples in https://github.com/ros2/rmw_cyclonedds/blob/master/rmw_cyclonedds_cpp/src/serdata.cpp. Is that correct?

Eventually yes. But this:

My suggestion would be to take the ugly fix in the RMW layer

was actually meant as a reference to:

[...] The most sensible alternative is probably to dynamically allocate an array of pointers. It is definitely unattractive [...]

If I am not mistaken, the type confusion I mentioned (made oh so easy by the weak type system of C) is the real bug. A functionally correct (but ugly) fix is easy and I'll try to do it today. The most uncertain part of the timing is that I'd like to reliably reproduce it locally to verify my reading of the code and the fix.

@clalancette
Copy link
Contributor Author

clalancette commented Feb 3, 2021

ros2/rmw_implementation#175 should fix the test. I didn't add the nullptr test because all of the RMWs seem to have the same problem. Also, the documentation for rmw_take_sequence says:

 * \pre Given `message_sequence` must be a valid message sequence, initialized
 *   by rmw_message_sequence_init() and populated with ROS messages whose
 *   type matches the message type support registered with the `subscription`
 *   on creation.

So I think that between the fix to rmw_cyclonedds and the fix to test_rmw_implementation, we can consider this issue fixed. I'm going to review the rmw_cyclonedds PR shortly.

@clalancette
Copy link
Contributor Author

Closing this as we fixed it years ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants