Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak #1034

Open
obi-t4 opened this issue Nov 1, 2022 · 6 comments
Open

memory leak #1034

obi-t4 opened this issue Nov 1, 2022 · 6 comments
Assignees

Comments

@obi-t4
Copy link

obi-t4 commented Nov 1, 2022

Bug report

Required Info:

  • Operating System:
    • Ubuntu20.04
  • Installation type:
    • binary and source, either is OK
  • Version or commit hash:
    • HEAD of galactic branch
  • DDS implementation:
    • rmw_cyclonedds_cpp (ros2 doctor --report says)
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

The following code is almost same as examples/rclpy/services/minimal_client/examples_rclpy_minimal_client/client.py

In this modification, main is called in the while loop and memory usage is observed.
Obviously everything is terminated cleanly, but the rss increases monotonically.

rosbridge is facing this memory leak issue.
RobotWebTools/rosbridge_suite#774
RobotWebTools/rosbridge_suite#772

from example_interfaces.srv import AddTwoInts

import rclpy


def main(args=None):
    rclpy.init(args=args)
    node = rclpy.create_node('minimal_client')
    cli = node.create_client(AddTwoInts, 'add_two_ints')

    req = AddTwoInts.Request()
    req.a = 41
    req.b = 1
    while not cli.wait_for_service(timeout_sec=1.0):
        node.get_logger().info('service not available, waiting again...')

    future = cli.call_async(req)
    rclpy.spin_until_future_complete(node, future)

    result = future.result()
    node.get_logger().info(
        'Result of add_two_ints: for %d + %d = %d' %
        (req.a, req.b, result.sum))

    node.destroy_node()
    rclpy.shutdown()


if __name__ == '__main__':
    import time
    import psutil

    while True:
        main()
        rss = psutil.Process().memory_info().rss
        print(rss)
        time.sleep(0.5)

Expected behavior

rss (=usage of memory) should not be increased.

Actual behavior

rss (=usage of memory) is increased.

Additional information

The following code also reproduces this issue.
So this is not only for service call problem.

def main(args=None):
    rclpy.init(args=args)
    node = rclpy.create_node('minimal_client')
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    import time
    import psutil

    while True:
        main()
        rss = psutil.Process().memory_info().rss
        print(rss)
        time.sleep(0.5)
@clalancette
Copy link
Contributor

A couple of observations about this example:

  1. rclpy.init and rclpy.shutdown aren't really meant to be run in a loop like that. I mean, you can do it, but it isn't totally expected. So I would pull those out of the loop and put them in the top-level.
  2. Because this is in a garbage collected language, you can't completely take the RSS as-is. At the very least, you need to call gc.collect during the loop there to attempt to make Python collect as much as it can.

All of that said, I tried this out on the latest Rolling with those changes, and it still looks like there is some leaking going on during the create_node/destroy_node pair. In a brief look with valgrind nothing obvious popped up, but there may be something deeper going on.

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Nov 1, 2022
  ros2/rclpy#1034

Signed-off-by: Tomoya Fujita <[email protected]>
@fujitatomoya
Copy link
Collaborator

Okay, this actually happens on current rolling. as far as i checked, this is RMW agnostic problem since we can make it happen with Fast-DDS and Cyclonedds.

Just FYI, similar problem for rclcpp.

@fujitatomoya
Copy link
Collaborator

CC: @iuhilnehc-ynos @Barry-Xu-2018

@iuhilnehc-ynos
Copy link
Contributor

iuhilnehc-ynos commented Nov 28, 2022

I don't think each call with "create_node" and "destroy_node" in the loop should get the same RSS(RES) value at the beginning. What I expect is it can get a steady value for RSS after a while.

I tested with fujitatomoya/ros2_test_prover@cdd3397, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes (Unfortunately, there is a tiny limitation about re-creating a participant, which is another topic.).

No matter how long I run the test with rmw_cyclonedds_cpp, the RSS value is always increased.
After applying the following two patches for rmw_cyclonedds_cpp and cyclonedds, the RSS used in the test can keep a steady value after a few minutes too.

I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below

  • rmw_cyclonedds_cpp
diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index 9b712ac..781e22a 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -1322,6 +1322,17 @@ void
 rmw_context_impl_t::clean_up()
 {
   discovery_thread_stop(common);
+  dds_guid_t guid;
+  rmw_gid_t gid;
+  auto sub = static_cast<CddsSubscription *>(common.sub->data);
+  dds_get_guid (sub->enth, &guid);
+  convert_guid_to_gid(guid, gid);
+  common.graph_cache.remove_entity(gid, true);
+  auto pub = static_cast<CddsPublisher *>(common.pub->data);
+  dds_get_guid (pub->enth, &guid);
+  convert_guid_to_gid(guid, gid);
+  common.graph_cache.remove_entity(gid, false);
+  common.graph_cache.remove_participant(common.gid);
   common.graph_cache.clear_on_change_callback();
   if (common.graph_guard_condition) {
     destroy_guard_condition(common.graph_guard_condition);
  • cyclonedds
diff --git a/src/core/ddsi/src/q_thread.c b/src/core/ddsi/src/q_thread.c
index 6de01fa1..f49b1885 100644
--- a/src/core/ddsi/src/q_thread.c
+++ b/src/core/ddsi/src/q_thread.c
@@ -329,12 +329,14 @@ dds_return_t create_thread (struct thread_state1 **ts1, const struct ddsi_domain
 
 static void reap_thread_state (struct thread_state1 *ts1, bool in_thread_states_fini)
 {
+  bool lazily_created_thread = false;
   ddsrt_mutex_lock (&thread_states.lock);
   switch (ts1->state)
   {
+    case THREAD_STATE_LAZILY_CREATED:
+      lazily_created_thread = true;
     case THREAD_STATE_INIT:
     case THREAD_STATE_STOPPED:
-    case THREAD_STATE_LAZILY_CREATED:
       ts1->state = THREAD_STATE_ZERO;
       break;
     case THREAD_STATE_ZERO:
@@ -360,6 +362,9 @@ static void reap_thread_state (struct thread_state1 *ts1, bool in_thread_states_
       assert (0);
   }
   ddsrt_mutex_unlock (&thread_states.lock);
+  if (in_thread_states_fini && lazily_created_thread) {
+    ddsrt_thread_cleanup_pop(1);
+  }
 }
 
 dds_return_t join_thread (struct thread_state1 *ts1)

NOTE: we can decrease the waiting time to time.sleep(0.1) in the loop to get a steady RSS value quickly.

@fujitatomoya
Copy link
Collaborator

I tested with fujitatomoya/ros2_test_prover@cdd3397, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes

confirmed, i guess i was impatient to wait ... sorry.

I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below

removing itself from graph cache entity makes sense. i think this is just a bug. but i am not sure about cyclonedds fix.

@eboasson would you mind taking a look at above patch?

@iuhilnehc-ynos can you create PR after the confirmation?

@eboasson
Copy link

eboasson commented Dec 2, 2022

I tested with fujitatomoya/ros2_test_prover@cdd3397, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes

confirmed, i guess i was impatient to wait ... sorry.

I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below

removing itself from graph cache entity makes sense. i think this is just a bug. but i am not sure about cyclonedds fix.

That was indeed simply a missing "remove", all the others are balanced (from a quick check of the code), so the RMW fix is I believe entirely correct.

@eboasson would you mind taking a look at above patch?

I do get it and I see why this works, but I need to spend some more mental effort on it before declaring it correct 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants