Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nodes spins rather than blocking and waiting, using 100% CPU [melodic] #1651

Closed
wants to merge 9 commits into from

Conversation

meyerj
Copy link
Contributor

@meyerj meyerj commented Mar 12, 2019

Rebase of #1557, a proposed fix for #1343.

From #1557 (comment) (@dirk-thomas):

Please retarget the melodic-devel branch with this PR. In general patches are only accepted to the default branch (except if a patch is only necessary for older distros).

The patch also applies to melodic-devel with some minor merge conflict resolutions. I tested this branch with C++14 enabled on Ubuntu 16.04 with Boost 1.58. All unit tests in test_roscpp still pass. I have not tested the patch in Ubuntu 18.04 or other distros yet, neither whether it actually fixes the original issue causing spinning at 100% CPU after the rebase.

I am not sure whether this patch should be applied to melodic-devel at all. If melodic only targets distributions which normally have at least Boost 1.62 according to REP-3, the backported condition_variable implementation from 1.61 is not used at all and the respective cmake and preprocessor checks can be removed. Some changes could still be merged, like the removal of the unnecessary specialization of TimerManager<SteadyTime, WallDuration, SteadyTimerEvent>::threadFunc() in steady_timer.cpp.

Edit: According to #1651 (comment) the patch has a positive effect on ROS timers to deal with system time changes (#1558). The reason is probably that BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC is defined unconditionally for roscpp as a whole now, while before it was only set in certain compilation units (but not in timer.cpp).

@dirk-thomas
Copy link
Member

@ros-pull-request-builder retest this please

@cwecht
Copy link
Contributor

cwecht commented Apr 5, 2019

This fixes #1558

@meyerj
Copy link
Contributor Author

meyerj commented Apr 5, 2019

@dirk-thomas I am not sure, but I think the remaining unit test failures in http://build.ros.org/job/Mpr_ds__ros_comm__debian_stretch_amd64/669/ for package test_rosbag are unrelated?

On the other hand I can reproduce the same test failures in test_rosbag locally in Ubuntu xenial using this branch. The bag file recorded during the test seems to be corrupted:

$ rostopic echo -b /tmp/test_rosbag_random_record_54321.bag /cwk
ERROR: unable to use bag file: Error reading header: expecting 2373722880 bytes, read 555313

But I do not see how this should be connected to the patch. There have been quite some patches to rosbag and rosbag_storage recently on melodic-devel.

For my local test run, it looks like two instances of rostest-test_random_record.xml are running in parallel (why two?) and both write to the same output bag at /tmp/test_rosbag_random_record_54321.bag.

If run the unit tests single-threaded with -j1 the bag file is valid and the test succeeds:

$ rostopic echo -b /tmp/test_rosbag_random_record_54321.bag /cwk --noarr -n1
wvgobncy: 
  secs: 823
  nsecs: 396963481
qismd: "<array type: int32[5, length: 5>"
kbst: "ufkpvlpzssxikhwizrwkziymonapxybvgsvgetotedmbxjbspnuussyfgqxaysqxmwrukepefvuknfgzylefanthqebficaqeiyo"
suqfeqcipp: "<array type: ndtknku/iqmkmgipxf, length: 73>"
---

Does the build farm strictly serialize unit tests or can different tests run in parallel?

@cwecht
Copy link
Contributor

cwecht commented Apr 11, 2019

It seems, that this is an issue of cmake: random_play and random_play_sim are depending both on random_record. Because add_rostest uses custom targets, this issue may kick in: the random_record test is executed twice. Both invocations of the test may run in parallel, which leads to two processes writing to the same file.

The above mentioned mailing suggest to serialize the two targets, which depend on the same target.
@meyerj you could try to add

add_dependencies(run_tests_test_rosbag_rostest_test_random_play.xml run_tests_test_rosbag_rostest_test_random_play_sim.xml)

to test_rosbag/bag_migration_tests/CMakeLists.txt. I have not been able to reproduce this failure yet.

Anyway, adding this dependency might also fix occasional failures of the random_play tests, which have been found in #1661.

meyerj added a commit to meyerj/ros_comm that referenced this pull request Apr 12, 2019
@meyerj
Copy link
Contributor Author

meyerj commented Apr 12, 2019

@meyerj you could try to add

add_dependencies(run_tests_test_rosbag_rostest_test_random_play.xml run_tests_test_rosbag_rostest_test_random_play_sim.xml)

to test_rosbag/bag_migration_tests/CMakeLists.txt. I have not been able to reproduce this failure yet.

Done: 76710de

@cwecht
Copy link
Contributor

cwecht commented Apr 12, 2019

@ros-pull-request-builder retest this please

Copy link
Contributor

@flixr flixr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@flixr
Copy link
Contributor

flixr commented May 11, 2019

@dirk-thomas friendly ping..
This PR definitely improves the situation and is cleaner, IMHO it is ready to merge. Tested under melodic.

@willdzeng
Copy link

Seems like the fixes are still open? any plan to check them in? I encountered the same issue in our ROS system. Would be nice to have them in.

@matthew-reynolds
Copy link

matthew-reynolds commented Jun 15, 2019

On my embedded linux system (custom kernel), when running an empty roscore w/ its single rosout node without this patch, rosout used ~75% of the CPU. With this patch, the usage is down to between 0 and 1%. Clearly, a significant improvement. For reference, this system uses Boost 1.63.

Obviously, this isn't a supported ROS platform or anything, I'm just sharing further anecdotal evidence that this patch has significant improvements to performance even with Boost > 1.62.

Looking forward to having these changes merged in.

@matthew-reynolds
Copy link

@dirk-thomas Is there anything outstanding that you'd like to be done before this PR can be merged?

@flixr
Copy link
Contributor

flixr commented Nov 12, 2019

@dirk-thomas friendly ping again... could we please get this merged?

@umaplehurst
Copy link

@dirk-thomas another gentle ping request to ask if this can be merged ... I'm also seeing the same issue here (w/Boost 1.58) and this pull request fixes everything nicely.

@ahoarau
Copy link
Contributor

ahoarau commented Jan 31, 2020

This is fixing issues on my systems as well. +1 for merging

@peci1
Copy link
Contributor

peci1 commented Feb 5, 2020

Problem is worst with this patch + clang++-6 + libboost 1.61 (Ubuntu 18.04 defaults).

Do you mean compiling roscpp + your stuff with clang, or compiling roscpp with default settings (i.e. gcc) and your stuff with clang?

@ahoarau
Copy link
Contributor

ahoarau commented Feb 5, 2020

Do you mean compiling roscpp + your stuff with clang, or compiling roscpp with default settings (i.e. gcc) and your stuff with clang?

compiling roscpp + my stuff with clang

dirk-thomas added a commit that referenced this pull request Feb 5, 2020
meyerj and others added 9 commits February 5, 2020 09:11
…tion_variable (fix ros#1343)

ros#1014 and ros#1250 introduced a backported
version of boost::condition_variable, where support for steady (monotonic) clocks has been added in version 1.61.
But the namespace of the backported version was not changed and the symbol names might clash with the original
version.

Because the underlying clock used for the condition_variable is set in the constructor and must be
consistent with the the expectations within member variables. The compiler might choose to inline one or the
other or both, and is more likely to do so for optimized Release builds. But if it does not, the symbol ends
up in the symbol table of roscpp and depending on which other libraries will be linked into the process it
is unpredictable which of the two versions will be actually called at the end. In case the constructor defined
in `/usr/include/boost/thread/pthread/condition_variable.hpp` was called and did not configure the internal
pthread condition variable for monotonic clock, each call to the backported do_wait_until() method with a
monotonic timestamp will return immediately and hence causes `CallbackQueue::callOne(timeout)` or
`CallbackQueue::callAvailable(timeout)` to return immediately.

This patch changes the namespace of the backported condition_variable implementation to boost_161. This
removes the ambiguity with the original definition if both are used in the same process.
…ed timed_wait()

This fixes ROS timers in combination with 2c18b9f. The timer
callbacks were not called because the TimerManager's thread function blocked indefinitely on
boost::condition_variable::timed_wait().

Relative timed_wait() uses the system clock (boost::get_system_time()) unconditionally to
calculate the absolute timestamp for do_wait_until(). If the condition variable has been
initialized with BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC, it compares this timestamp
with the monotonic clock and therefore blocks.

This issue has been reported in https://svn.boost.org/trac10/ticket/12728 and will not be
fixed. The timed_wait interface is apparently deprecated.
…eadFunc() in steady_timer.cpp

The updated generic definition in timer_manager.h should do the same with a minor update.
In all cases we can call boost::condition_variable::wait_until() with an absolute time_point of the respective clock.
The conversion from system_clock to steady_clock for Time and WallTime is done internally in
boost::condition_variable::wait_until(lock_type& lock, const chrono::time_point<Clock, Duration>& t).
find_package(Boost) has to come before checking the Boost version.
Otherwise BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC was not defined which
triggered the assertion in timer_manager.h:240.

Since Boost 1.67 BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC became the default
if the platform supports it and the macro is not defined anymore. Instead, check
for BOOST_THREAD_INTERNAL_CLOCK_IS_MONO.
…ITION_VARIABLE_HEADER macros by a typedef in internal_condition_variable.h
@dirk-thomas dirk-thomas force-pushed the fix-1343-melodic-devel branch from 05f86d9 to b9e8f3a Compare February 5, 2020 17:11
@dirk-thomas
Copy link
Member

I separated the test_rosbag test dependency into its own PR (#1877) and rebased this patch on top of the latest state of the target branch.

@dirk-thomas
Copy link
Member

While this might not be a blocker to get this important fix merged into Melodic I would still like to consider the effect of this PR to API/ABI compatibility. Can someone comment on the ABI compatibility of the changed condition variable type in the various headers. Is this expected to be ABI compatible? If yes or no, for which Boost versions?

@meyerj
Copy link
Contributor Author

meyerj commented Feb 5, 2020

Problem is worst with this patch + clang++-6 + libboost 1.61 (Ubuntu 18.04 defaults).
@meyerj Did you find anything similar with clang ?

We are typically building downstream packages which depend on roscpp using clang, but not roscpp itself. Apart from that, we are using this patch in production since several months, also on Ubuntu 18.04. But the default Boost version is 1.65 on 18.04.

I did a quick test with roscpp compiled from this branch, with clang version 6.0.0-1ubuntu2 and Boost 1.65, and ran the unit tests and some ros_tutorials nodes. No issues so far at first glance. rosout does not consume 100% CPU, while with the exact same config and ros_comm from the latest melodic-devel branch (d487d58) it does.

Note that with Boost 1.65 (>= 1.61) roscpp does not select the alternative implementation in namespace boost_161. The only change that actually matters is the setting of BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC for all compilation units in roscpp and the consistent use of boost::condition_variable instances from source files exclusively. The latter implies that...

  • all constructors of classes with boost::condition_variable members are defined in a roscpp source file (and not in a header or implicitly by the compiler)
  • all usages of wait_for() and wait_until() must be in roscpp source files and not exposed to users
  • no usage of the deprecated timed_wait() member function, which is not compatible to BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC

So instead of changing the namespace and selecting which implementation of boost::condition_variable to select depending on the system Boost version it would be straight-forward to drop the alternative implementation as a whole, as long as all target platforms have Boost 1.61 or above. I only kept it like this because it is required for Kinetic on Ubuntu 16.04 (#1557), and I would still hope that the patch will be backported before Kinetic's EOL.

@dirk-thomas
Copy link
Member

So instead of changing the namespace and selecting which implementation of boost::condition_variable to select depending on the system Boost version it would be straight-forward to drop the alternative implementation as a whole, as long as all target platforms have Boost 1.61 or above.

REP 3 specifies the targeted platforms and for Melodic that is:

  • Ubuntu Artful (which is already EOL and can therefore be ignored),
  • Ubuntu Bionic (coming with Boost 1.65.1),
  • Debian Stretch (coming with Boost 1.62), and
  • Fedora 28 (which we don't build binaries for and can therefore also ignore).

Since that means all relevant platforms have Boost 1.61 or above we should do exactly that for the melodic-devel branch. Can you (or someone) please create a pull request with that proposed change?

The question still stands: does that change pose an ABI break or not? If it does the next question would be is it possible to fix the problem in an ABI compatible way?

I would still hope that the patch will be backported before Kinetic's EOL.

My main focus is to get the current default branch `melodic-devel into a releasable state and release a new version for Melodic (which will also be the base for forking for Noetic). Only after that I will try to look at what changes should be backport (which can be completely different patches like in this case) to Kinetic.

@meyerj
Copy link
Contributor Author

meyerj commented Feb 5, 2020

The question still stands: does that change pose an ABI break or not? If it does the next question would be is it possible to fix the problem in an ABI compatible way?

On Ubuntu 18.04 with gcc 7.4.0-1ubuntu1~18.04.1

This is the result of running the abi-compliance-checker to compare both versions: compat_report.html

The renaming of the boost::condition_variable members to roscpp::internal::condition_variable is not problematic because it is just a typedef for the same type and the memory layout did not change. The same should be true for all platforms which have Boost 1.61 or newer.

Other platforms with Boost 1.60 or lower

Probably not an issue for melodic-devel then, but in this case the ABI breaks *intentionally because libroscpp.so does not provide boost::condition_variable symbols anymore, but uses its own symbols in namespace boost_161 instead. Because Boost defines all of them in headers, every library that requires those symbols must also provide them as weak symbols (at least in Linux). That exactly was the problem, that it is not predictable which of these conflicting definitions will actually be selected by the dynamic linker at run-time (ODR violation). So it is actually the intention to break the ABI in this case to fix the undefined behavior.

@cwecht
Copy link
Contributor

cwecht commented Feb 5, 2020

boost::condition_variable is not directly part of the public interface of roscpp. It is a private member of some classes. That would be a problem, if 'boost:condition_variable's memory layout had changed, which it didn't. The only thing which changes are implementations of a view functions and other functions have been added to boost:condition_variable' s interface. There might be a problem if these functions are exported as symbols by roscpp. If a client uses the old 'boost:condition_variable' implementations this might lead to an ODR-violation, which can result in strange behviour/bugs. Users which have ABI-stability requirement usually (I suppose) use the release-build of roscpp, where most of boost::condition_variable functions are probably inlines, but it might still be possible that they are exported anyway. A quick look at the currently released libroscpp.so reveals that in fact boost::condition_variable-symbols are exported but none of them which might be a problem.
If a user binaries a build in Release mode this problem gets even more unlikely.

After all ABI-stability is a tricky thing. From my experience I'd say that this change is probably (likely) safe for users of the released binaries (If a user build roscpp in Debug mode , that's a whole different story). Inline-functions are ODR-violation-prone. There might be edge-cases where this change breaks a user, but I don't think that they are likly to occure.

But note, that this actually change might actually have an impact on the behaviour of users, which rebuild their code with this changes if they are using boost::condition_variable.

@meyerj
Copy link
Contributor Author

meyerj commented Feb 5, 2020

Since that means all relevant platforms have Boost 1.61 or above we should do exactly that for the melodic-devel branch. Can you (or someone) please create a pull request with that proposed change?

Done: #1878

@ahoarau
Copy link
Contributor

ahoarau commented Feb 5, 2020

Yes I mispelled : it was boost 1.65.1 👍

@dirk-thomas
Copy link
Member

This is now a duplicate of #1878 but I am keeping it open since it might serve as a template for a Kinetic-specific patch.

@meyerj
Copy link
Contributor Author

meyerj commented Jul 31, 2020

This is now a duplicate of #1878 but I am keeping it open since it might serve as a template for a Kinetic-specific patch.

This PR is obsolete now with #1878 and #1932. My original PR targeting kinetic-devel was #1557, but if you consider to backport at all, then the newer approach from #1932 should be applied, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants