Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing is slow in Docker with MutliThreadedExecutor #1487

Closed
buschbapti opened this issue Dec 8, 2020 · 22 comments · Fixed by #1516
Closed

Publishing is slow in Docker with MutliThreadedExecutor #1487

buschbapti opened this issue Dec 8, 2020 · 22 comments · Fixed by #1516
Assignees
Labels
bug Something isn't working

Comments

@buschbapti
Copy link

buschbapti commented Dec 8, 2020

I am using ros2 foxy from the official ros2 docker images to build ros2_control and test their new implementations. I have noticed that when starting a joint state publisher that is set to publish at 200hz, monitoring the topic with ros2 topic hz /joint_states gives at best 30hz.

I know this has nothing to do specifically with ros2_control because I had a similar issue when trying my own lifecycle publisher nodes. Basically, attaching the node to a MultiThreadedExecutor produces a similar behavior where the topic is published at a much slower rate than expected, sometime by a factor of 10. Changing to a SingleThreadedExecutor solves the issue. Problem is ros2_control relies on this MultiThreadedExecutor.

I do believe this comes from the combination of Docker and MultiThreadedExecutor. I tested it on multiple computers and got similar behavior. I haven't been able to test on a non docker installation as it requires Ubuntu 20.04 which I don't have. But I will try it just in case.

Steps to reproduce:

  • Use a Docker foxy image
  • Create a LifecycleNode with a publisher and attach it to a MultiThreadedExecutor
  • Monitor the topic with hz

Alternatively, follow the ros2_control_demo installed on a Docker foxy image and monitor /joint_states.

@clalancette
Copy link
Contributor

So if I understand this bug report right, this is happening with some combination of LifecycleNode, MultiThreadedExecutor, and Docker. You've already swapped out the SingleThreadedExecutor, and that seems to make it work better.

One other test that would be interesting to do is to write your own very simple version of the ros2 hz command in C++. That might tell us if the problem is in the Python side or the C++ side (though the switch from MultiThreadedExecutor to SingleThreadedExecutor suggests it is on the publisher side).

Another test that would be interesting would be to switch from a LifecycleNode to a regular Node. That might also narrow it down a bit.

@buschbapti
Copy link
Author

That is right. I will try to make some more tests indeed to narrow it down. And give the results here.

@buschbapti
Copy link
Author

Ok so here is the code for the simple publisher I am using:

#include "std_msgs/msg/string.hpp"

using namespace std::chrono_literals;

/* This example creates a subclass of Node and uses std::bind() to register a
* member function as a callback from the timer. */

class MinimalPublisher : public rclcpp::Node
{
  public:
    MinimalPublisher()
    : Node("minimal_publisher"), count_(0)
    {
      publisher_ = this->create_publisher<std_msgs::msg::String>("topic", 10);
      timer_ = this->create_wall_timer(
      1ms, std::bind(&MinimalPublisher::timer_callback, this));
    }

  private:
    void timer_callback()
    {
      auto message = std_msgs::msg::String();
      message.data = "Hello, world! " + std::to_string(count_++);
      RCLCPP_INFO(this->get_logger(), "Publishing: '%s'", message.data.c_str());
      publisher_->publish(message);
    }
    rclcpp::TimerBase::SharedPtr timer_;
    rclcpp::Publisher<std_msgs::msg::String>::SharedPtr publisher_;
    size_t count_;
  };

  int main(int argc, char * argv[])
  {
    rclcpp::init(argc, argv);

    rclcpp::executors::MultiThreadedExecutor exe;

    std::shared_ptr<MinimalPublisher> pub = std::make_shared<MinimalPublisher>();
    exe.add_node(pub->get_node_base_interface());

    exe.spin();
    rclcpp::shutdown();
    return 0;
  }

Results:

  • SingleThreadedExecutor:
average rate: 1005.051
	min: 0.000s max: 0.002s std dev: 0.00009s window: 1006
average rate: 1002.526
	min: 0.000s max: 0.002s std dev: 0.00008s window: 2006
  • MultiThreadedExecutor
average rate: 86.153
	min: 0.000s max: 0.099s std dev: 0.01464s window: 350
average rate: 85.509
	min: 0.000s max: 0.099s std dev: 0.01433s window: 434

Sometime a bit more (max 300Hz).

Simple Node doesn't change the issue so we can remove that from the equation.

@buschbapti
Copy link
Author

buschbapti commented Dec 8, 2020

Very interestingly I tried the exact same code (made a small package from scratch only dependent on rclcpp) and tested on eloquent installed directly on my computer (non dockerized). I have the exact same behavior:

  • SingleThreadedExecutor
average rate: 1002.446
	min: 0.000s max: 0.001s std dev: 0.00007s window: 1003
average rate: 1001.183
	min: 0.000s max: 0.001s std dev: 0.00006s window: 2003
  • MultiThreadedExecutor
average rate: 104.027
	min: 0.000s max: 0.074s std dev: 0.01382s window: 105
average rate: 109.665
	min: 0.000s max: 0.074s std dev: 0.01189s window: 221

So what I thought is a combination of Docker and MultiThreadedExecutor can actually be narrowed it down to the MutliThreadedExecutor itself.

@buschbapti
Copy link
Author

Same thing in a fresh foxy install. And same behavior observed with ros2_control as when using the Docker container. So the problem is definitely not coming from Docker

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Dec 9, 2020
@fujitatomoya
Copy link
Collaborator

I think the reason is the same with ros2/ros2#1035 (comment), I tried to use #1382 to see if it can also fix the problem. it can be better but with this high frequency it cannot fix the problem completely. I think that to address this problem, the same TimerBase object should be taken once and not being scheduled on other threads.

@iuhilnehc-ynos
Copy link
Collaborator

I am interested in this issue.

@iuhilnehc-ynos
Copy link
Collaborator

iuhilnehc-ynos commented Dec 10, 2020

After checking the source code, I found the mutex wait_mutex_


is overused at
std::lock_guard<std::mutex> wait_lock(wait_mutex_);
if (!rclcpp::ok(this->context_) || !spinning.load()) {
return;
}
if (!get_next_executable(any_exec, next_exec_timeout_)) {
continue;
}
if (any_exec.timer) {
// Guard against multiple threads getting the same timer.
if (scheduled_timers_.count(any_exec.timer) != 0) {
// Make sure that any_exec's callback group is reset before
// the lock is released.
if (any_exec.callback_group) {
any_exec.callback_group->can_be_taken_from().store(true);
}
continue;
}
scheduled_timers_.insert(any_exec.timer);
}

and
std::lock_guard<std::mutex> wait_lock(wait_mutex_);

.

I think there is a way to fix this issue, which will break the ABI, but maybe it's worth adding a new mutex.

diff --git a/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp b/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp
index c18df5b7..8ae3b44d 100644
--- a/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp
+++ b/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp
@@ -87,6 +87,7 @@ private:
   std::chrono::nanoseconds next_exec_timeout_;
 
   std::set<TimerBase::SharedPtr> scheduled_timers_;
+  std::mutex scheduled_timers_mutex_;
 };
 
 }  // namespace executors

diff --git a/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp b/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp
index 0dfdc354..a692f7c5 100644
--- a/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp
+++ b/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp
@@ -76,14 +76,17 @@ MultiThreadedExecutor::run(size_t)
   while (rclcpp::ok(this->context_) && spinning.load()) {
     rclcpp::AnyExecutable any_exec;
     {
-      std::lock_guard<std::mutex> wait_lock(wait_mutex_);
-      if (!rclcpp::ok(this->context_) || !spinning.load()) {
-        return;
-      }
-      if (!get_next_executable(any_exec, next_exec_timeout_)) {
-        continue;
+      {
+        std::lock_guard<std::mutex> wait_lock(wait_mutex_);
+        if (!rclcpp::ok(this->context_) || !spinning.load()) {
+          return;
+        }
+        if (!get_next_executable(any_exec, next_exec_timeout_)) {
+          continue;
+        }
       }
       if (any_exec.timer) {
+        std::lock_guard<std::mutex> wait_lock(scheduled_timers_mutex_);
         // Guard against multiple threads getting the same timer.
         if (scheduled_timers_.count(any_exec.timer) != 0) {
           // Make sure that any_exec's callback group is reset before
@@ -103,7 +106,7 @@ MultiThreadedExecutor::run(size_t)
     execute_any_executable(any_exec);
 
     if (any_exec.timer) {
-      std::lock_guard<std::mutex> wait_lock(wait_mutex_);
+      std::lock_guard<std::mutex> wait_lock(scheduled_timers_mutex_);
       auto it = scheduled_timers_.find(any_exec.timer);
       if (it != scheduled_timers_.end()) {
         scheduled_timers_.erase(it);

What do you think about above patch?

@buschbapti
Copy link
Author

Just tested on the nightly image of ros2 and I got better results:

average rate: 847.717
	min: 0.000s max: 0.012s std dev: 0.00086s window: 849
average rate: 890.528
	min: 0.000s max: 0.012s std dev: 0.00066s window: 1783

So I guess something has already been implemented in one previous commit. Yet it is not entirely resolved.

@buschbapti
Copy link
Author

Confirmed that @iuhilnehc-ynos solution makes things better:

average rate: 999.501
	min: 0.000s max: 0.005s std dev: 0.00015s window: 10000
average rate: 999.495
	min: 0.000s max: 0.005s std dev: 0.00015s window: 10000

It also resolves the problem with ros2_control_demo. I am not sure what side effects it can have but at least it does solve this issue.

@fujitatomoya
Copy link
Collaborator

@iuhilnehc-ynos

thanks! yeah, i thought that way in the past. but i think there will be multi-thread problem with that fix.

please see inline comment,

diff --git a/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp b/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp
index c18df5b7..8ae3b44d 100644
--- a/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp
+++ b/rclcpp/include/rclcpp/executors/multi_threaded_executor.hpp
@@ -87,6 +87,7 @@ private:
   std::chrono::nanoseconds next_exec_timeout_;
 
   std::set<TimerBase::SharedPtr> scheduled_timers_;
+  std::mutex scheduled_timers_mutex_;
 };
 
 }  // namespace executors

diff --git a/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp b/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp
index 0dfdc354..a692f7c5 100644
--- a/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp
+++ b/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp
@@ -76,14 +76,17 @@ MultiThreadedExecutor::run(size_t)
   while (rclcpp::ok(this->context_) && spinning.load()) {
     rclcpp::AnyExecutable any_exec;
     {
-      std::lock_guard<std::mutex> wait_lock(wait_mutex_);
-      if (!rclcpp::ok(this->context_) || !spinning.load()) {
-        return;
-      }
-      if (!get_next_executable(any_exec, next_exec_timeout_)) {
-        continue;
+      {
+        std::lock_guard<std::mutex> wait_lock(wait_mutex_);
+        if (!rclcpp::ok(this->context_) || !spinning.load()) {
+          return;
+        }
+        if (!get_next_executable(any_exec, next_exec_timeout_)) {
+          continue;
+        }
       }
       if (any_exec.timer) {  ##### (1) context switch here, thread-A to thread-B. (thread-A already has the Timer to be executed, thread-B comes in here with the same Timer and go on.)
                              ##### (4) thread-A be back on RUNNING, will execute the same timer twice. (or more if other threads available)
+        std::lock_guard<std::mutex> wait_lock(scheduled_timers_mutex_);
         // Guard against multiple threads getting the same timer.
         if (scheduled_timers_.count(any_exec.timer) != 0) {
           // Make sure that any_exec's callback group is reset before
@@ -103,7 +106,7 @@ MultiThreadedExecutor::run(size_t)
     execute_any_executable(any_exec);  ##### (2) thread-B executes the Timer
 
     if (any_exec.timer) {
-      std::lock_guard<std::mutex> wait_lock(wait_mutex_);
+      std::lock_guard<std::mutex> wait_lock(scheduled_timers_mutex_);
       auto it = scheduled_timers_.find(any_exec.timer);
       if (it != scheduled_timers_.end()) {
         scheduled_timers_.erase(it);  ##### (3) thread-B erases Timer from the list.

I gotta admit that is really rare, but it cannot guarantee that the same Timer will not be executed twice or more. also i believe that almost same idea is #1168. maybe you could take a look. IMO, get_next_executable and scheduled_timers_ should be protected by the same mutex, unless there will be possibility that the same timer is executed multiple times. hope this helps.

@iuhilnehc-ynos
Copy link
Collaborator

@fujitatomoya

It definitely helped. Thanks.
I am still a bit confused.

thread-A already has the Timer to be executed, thread-B comes in here with the same Timer and go on.

get_next_executable protected by a mutex wait_mutex_ make sure any_exec getting from two threads are different.

std::lock_guard<std::mutex> wait_lock(wait_mutex_);
get_next_executable
    get_next_ready_executable
         get_next_ready_executable_from_map
               MemoryStrategy::get_next_timer
                     set `any_exec`
                     timer_handles_.erase(it);

Anyway, I'll see related issues later.

@fujitatomoya
Copy link
Collaborator

not at all😄 i might be missing something, but if the Timer is taken exclusively and not being scheduled on other threads in current implementation, i think it would be okay to go with your fix.

@iuhilnehc-ynos
Copy link
Collaborator

iuhilnehc-ynos commented Dec 11, 2020

It seems #1168 already did the same thing, so I will not create a new PR for it, maybe we could push #1168 to be merged ASAP.

Updated: It seems scheduled_timers_ and scheduled_timers_mutex_ have a long history. 👀
#1374 (comment)

@ivanpauno
Copy link
Member

ivanpauno commented Jan 14, 2021

I have no idea of how to fix this, so I will unassign myself (cc @clalancette).

Well I have ideas of things to try, but the scope of this is much bigger of what I would expect from a randomly assigned issue.
Maybe there's an obvious fix I'm not seeing, but the whole executor code in rclcpp needs some months of love really (not particularly to fix this issue, but there's a lot of performance and thread safety issues too).
I think that @iuhilnehc-ynos comment is a good summary of what have happened here

I will try to explain how I understand things to work and why that's an issue:

My idea would be to limit how "executables" can be scheduled, so that when one worker has scheduled that "executable" for execution no other worker can take it as "ready". That completely forbids the case of wanting to execute a callback of the same executable in parallel (e.g. two messages of same subscription), but I guess forbidding that is fine as that can potentially led to "out of order" execution.
Anyway, that idea is also not simple to implement and I'm not completely sure that would solve the issue.


The single threaded executor also has its own problems:

@ivanpauno
Copy link
Member

I now remembered about #1328, which I opened a while before.
That's sort of a hack, but maybe it fixes this issue so I will give that a try before unassigning myself.

@ivanpauno
Copy link
Member

#1328 effectively fixes this issue.
It's not an ideal workaround, but I guess that's fine for the moment.

@wjwwood
Copy link
Member

wjwwood commented Jan 14, 2021

That completely forbids the case of wanting to execute a callback of the same executable in parallel (e.g. two messages of same subscription), but I guess forbidding that is fine as that can potentially led to "out of order" execution.

I don't think that's ok to forbid. It's completely reasonable to want to execute the same callback multiple times in parallel, less so for timers, but definitely so for subscriptions. Even for timers, I think it's a reasonable thing to want to do, though I don't know of any cases where that's what you'd want. Perhaps you're using a timer to poll for work to be done, and so sometimes the callback is quick (maybe just printing information) and occasionally it takes longer, even longer than the period of the timer. In that scenario you'd like to be able to have callbacks be concurrent so that while you're processing the work in the infrequent case, you can still be responsive and print status messages in the meantime from other instances of the timer callbacks.

It might make sense to make the trade-off in #1328, since the use cases it fixes are likely more common, but it definitely doesn't address the issue correctly.


#1168 tries to relax that problem by using two mutexes, but that reintroduces the race described in #621.

I suppose that's the case. I vaguely remember stalling that pull request with Pedro for that reason. I still think something in that direction could be the right and correct solution.

@fujitatomoya
Copy link
Collaborator

@buschbapti

would you mind trying to use @ivanpauno 's fix #1516 to check the performance? that would be really appreciated.

@buschbapti
Copy link
Author

it definitely improves things but the rate is inconsistent. Here are some results:

average rate: 940.311
	min: 0.000s max: 0.008s std dev: 0.00041s window: 4704
average rate: 945.242
	min: 0.000s max: 0.008s std dev: 0.00040s window: 5675
average rate: 944.665
	min: 0.000s max: 0.008s std dev: 0.00039s window: 6617
average rate: 942.920
	min: 0.000s max: 0.008s std dev: 0.00039s window: 7548
average rate: 939.133
	min: 0.000s max: 0.008s std dev: 0.00041s window: 8458
average rate: 936.163
	min: 0.000s max: 0.008s std dev: 0.00041s window: 9368
average rate: 935.432
	min: 0.000s max: 0.008s std dev: 0.00041s window: 10000
average rate: 937.177
	min: 0.000s max: 0.008s std dev: 0.00041s window: 10000
average rate: 937.314
	min: 0.000s max: 0.012s std dev: 0.00042s window: 10000
average rate: 937.874
	min: 0.000s max: 0.012s std dev: 0.00043s window: 10000
average rate: 933.531
	min: 0.000s max: 0.012s std dev: 0.00044s window: 10000
average rate: 927.938
	min: 0.000s max: 0.012s std dev: 0.00045s window: 10000
average rate: 923.789
	min: 0.000s max: 0.012s std dev: 0.00046s window: 10000
average rate: 919.730
	min: 0.000s max: 0.012s std dev: 0.00047s window: 10000
average rate: 919.797
	min: 0.000s max: 0.012s std dev: 0.00048s window: 10000
average rate: 926.284
	min: 0.000s max: 0.012s std dev: 0.00046s window: 10000
average rate: 925.490
	min: 0.000s max: 0.012s std dev: 0.00046s window: 10000
average rate: 929.108
	min: 0.000s max: 0.012s std dev: 0.00046s window: 10000
average rate: 931.791
	min: 0.000s max: 0.012s std dev: 0.00046s window: 10000

Expected rate is 1khz which the SingleThreadedExecutor can follow perfectly fine. See the results above @iuhilnehc-ynos solution could track it more accurately.

@fujitatomoya
Copy link
Collaborator

@buschbapti thanks for checking. that really helps. according to #1487 (comment), it does have multi-threaded problem as i mentioned. so i'd like to go with @ivanpauno 's fix at this moment.

@buschbapti
Copy link
Author

Perfect, at least it gives some pretty decent improvements and make it usable even at those frequencies. Thanks for taking care of this and thanks @ivanpauno for the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants