Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nodes spins rather than blocking and waiting, using 100% CPU [kinetic] #1557

Closed
wants to merge 12 commits into from

Conversation

meyerj
Copy link
Contributor

@meyerj meyerj commented Dec 6, 2018

Fixes #1343, and other potential issues. Most likely SteadyTimer never has been steady.

As commented in the original issue, there is no guarantee that the compiler will inline the calls to boost::condition_variable constructor, boost::condition_variable::timed_wait() and boost::condition_variable::wait_for() and it might or might not use the backported version from Boost 1.61 introduced in #1014. The namespace has not been changed and even within roscpp some compilation units used the patched and some used the unpatched implementation (by including or not including boost_161_condition_variable.h). If those methods are mixed, because one gets inlined and the other not, or because the same instance is used in different compilation units with different includes and preprocessor macros defined, the underlying pthread condition variable gets initialized with CLOCK_MONOTONIC but is called with system clock timestamps or vice-versa. This situation ultimately leads to the 100% busy-wait problem in CallbackQueue::callAvailable() as reported in #1343, if the CallbackQueue constructor for whatever reason calls the boost::condition_variable constructor from a compilation unit where BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC was not defined.

The first patch resolves this link ambiguity by changing the namespace of the included condition_variable and condition_variable_any implementations to boost_161. New preprocessor macros ROSCPP_BOOST_CONDITION_VARIABLE_HEADER and ROSCPP_BOOST_CONDITION_VARIABLE have been introduced and exposed in ros/common.h to select the implementation when roscpp is compiled. Because the memory layout does not change, I would carefully claim that the patch is fully ABI compatible. I copied libroscpp.so compiled from a patched workspace to /opt/ros/kinetic/lib and used it as a drop-in replacement over the past days, without having to recompile any other package installed from debians.

The second patch, 56ecdfa, fixes an issue with ROS timers that popped up in combination with the first. The boost::condition_variable::timed_wait() methods are considered deprecated and uses boost::get_system_time() (system clock) unconditionally to calculate the absolute timestamp here - as called by TimerManager::threadFunc(). If the condition variable would have been constructed as expected, with the implementation from Boost 1.61 or above and BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC defined, this call blocks the thread almost indefinitely and breaks timers (failing unit tests in roscpp_tests) - the inverse problem. It has already been reported here and marked as wontfix. The solution is to switch to the Boost Chrono interface as already done in #1250 for ros::CallbackQueue. I have no good explanation why this problem has not been triggered before. It only makes sense if the backported constructor of boost::condition_variable would have never been called because of the order of source files in the roscpp library and hence SteadyTimer never has been fully steady (#1014). Any better ideas?

Last but not least, fdca164 applies the patch from #1464 to the other timer types and adds WallTimer::hasStarted() and SteadyTimer::hasStarted(), just for completeness and to match the three interfaces except for the time, duration and event types. (replaced by #1565)

To be absolutely certain that TimerManager<T, D, E> will never be instantiated outside of a compilation unit that defines BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC (see timer_manager.h:215), I suggest to move the implementation of the constructor and member functions to a source file timer_manager.cpp with explicit template instantiation for the three timer types. I tested this approach to exclude that TimerManager<T, D, E> and the internal condition variable is constructed elsewhere, without the definition of BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC. But at the end I reverted that patch because this was not the case. An alternative, but more intrusive solution would be to move the timer_manager.h header to the src folder and to not expose it at all.

Sorry for targeting kinetic-devel although it should probably be melodic-devel, but this is the version I have been working on. In general, the patch applies to all versions since kinetic (not tested yet).

…tion_variable (fix ros#1343)

ros#1014 and ros#1250 introduced a backported
version of boost::condition_variable, where support for steady (monotonic) clocks has been added in version 1.61.
But the namespace of the backported version was not changed and the symbol names might clash with the original
version.

Because the underlying clock used for the condition_variable is set in the constructor and must be
consistent with the the expectations within member variables. The compiler might choose to inline one or the
other or both, and is more likely to do so for optimized Release builds. But if it does not, the symbol ends
up in the symbol table of roscpp and depending on which other libraries will be linked into the process it
is unpredictable which of the two versions will be actually called at the end. In case the constructor defined
in `/usr/include/boost/thread/pthread/condition_variable.hpp` was called and did not configure the internal
pthread condition variable for monotonic clock, each call to the backported do_wait_until() method with a
monotonic timestamp will return immediately and hence causes `CallbackQueue::callOne(timeout)` or
`CallbackQueue::callAvailable(timeout)` to return immediately.

This patch changes the namespace of the backported condition_variable implementation to boost_161. This
removes the ambiguity with the original definition if both are used in the same process.
…ed timed_wait()

This fixes ROS timers in combination with 2c18b9f. The timer
callbacks were not called because the TimerManager's thread function blocked indefinitely on
boost::condition_variable::timed_wait().

Relative timed_wait() uses the system clock (boost::get_system_time()) unconditionally to
calculate the absolute timestamp for do_wait_until(). If the condition variable has been
initialized with BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC, it compares this timestamp
with the monotonic clock and therefore blocks.

This issue has been reported in https://svn.boost.org/trac10/ticket/12728 and will not be
fixed. The timed_wait interface is apparently deprecated.
@fujitatomoya
Copy link
Contributor

confirmed #1343 is fixed with this patch.

@flixr
Copy link
Contributor

flixr commented Dec 11, 2018

Thanks a lot for digging into this!
Looks like a much better/cleaner solution to me, will try to also test this asap.

*/
void setPeriod(const WallDuration& period, bool reset=true);

bool hasStarted() const { return impl_->hasStarted(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corresponding Impl::hasStarted() seem to be still missing in steady_timer.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed in c37a75b.

@flixr
Copy link
Contributor

flixr commented Dec 13, 2018

Fixes #1343, and other potential issues. Most likely SteadyTimer never has been steady.

Maybe I never ran into any issues like this, because I'm always using Release builds and then the SteadyTimer works as expected.

Sorry for targeting kinetic-devel although it should probably be melodic-devel

Melodic (or rather Ubuntu bionic) already has boost 1.65, so at least the problem with the backported headers should not arise there.

@meyerj
Copy link
Contributor Author

meyerj commented Dec 14, 2018

Maybe I never ran into any issues like this, because I'm always using Release builds and then the SteadyTimer works as expected.

I overlooked the specialization of TimerManager<T,D,E>::threadFunc() in steady_timer.cpp. This one already used the Boost Chrono interface and was working as expected.

However, the code duplication is not required anymore now that the general definition of threadFunc() in timer_manager.h also uses Boost Chrono. It has some additional checks whether time jumped backwards which are not required for the steady timer, but they also should not hurt. I merged the two implementations in bb1c24f.

@flixr
Copy link
Contributor

flixr commented Dec 14, 2018

I guess it would actually make sense to create a separate PR for the added hasStarted methods...
They have nothing to do with the rest and can then more easily be reviewed/merged separately.

…eadFunc() in steady_timer.cpp

The updated generic definition in timer_manager.h should do the same with a minor update.
In all cases we can call boost::condition_variable::wait_until() with an absolute time_point of the respective clock.
The conversion from system_clock to steady_clock for Time and WallTime is done internally in
boost::condition_variable::wait_until(lock_type& lock, const chrono::time_point<Clock, Duration>& t).
@meyerj
Copy link
Contributor Author

meyerj commented Dec 14, 2018

I guess it would actually make sense to create a separate PR for the added hasStarted methods...
They have nothing to do with the rest and can then more easily be reviewed/merged separately.

Done: #1565

@meyerj
Copy link
Contributor Author

meyerj commented Dec 14, 2018

Result of running abi-compliance-checker on libroscpp.so compiled from 1.12.14 and fix-1343 (this PR) on Ubuntu Xenial, compiled in RelWithDebInfo mode:
https://gistpreview.github.io/?e7db43ad2f12f9c44a1df0336983d131/compat_report.html

As expected, the type of two condition variable members has been changed, but because the memory layout did not, this should not be a problem.

@meyerj
Copy link
Contributor Author

meyerj commented Jan 10, 2019

Regarding the failing unit test:

13:19:44 [ RUN      ] CallbackQueue.raceConditionCallback
13:19:45 /tmp/ws/src/ros_comm/test/test_roscpp/test/test_callback_queue.cpp:466: Failure
13:19:45 Value of: condition_object.condition_one.load()
13:19:45   Actual: true
13:19:45 Expected: false
13:19:45 [  FAILED  ] CallbackQueue.raceConditionCallback (61 ms)

(http://build.ros.org/job/Kpr__ros_comm__ubuntu_xenial_amd64/464)

I could reproduce the same failure locally when running with kinetic-devel (fb6a6bd). The test case fails approx. 1 out of 20 times. The test case was introduced in #1054. From the description, I would assume that it is not related to #1343 or this patch.

Ah, kinetic is missing the patch from 507299f. Are there plans to backport bug fixes from lunar or melodic to kinetic or even indigo?

@meyerj
Copy link
Contributor Author

meyerj commented Jan 10, 2019

Ah, kinetic is missing the patch from 507299f. Are there plans to backport bug fixes from lunar or melodic to kinetic or even indigo?

I opened another PR to backport this one: #1578

@dirk-thomas dirk-thomas changed the title Fix nodes spins rather than blocking and waiting, using 100% CPU Fix nodes spins rather than blocking and waiting, using 100% CPU [kinetic] Jan 31, 2019
@dirk-thomas
Copy link
Member

Sorry for targeting kinetic-devel although it should probably be melodic-devel

Please retarget the melodic-devel branch with this PR. In general patches are only accepted to the default branch (except if a patch is only necessary for older distros).

@meyerj Can you please check #1608 and comment how this patch compares to the other one.

@dirk-thomas
Copy link
Member

@meyerj Friendly ping.

@ahoarau

This comment has been minimized.

@dirk-thomas

This comment has been minimized.

@ahoarau
Copy link
Contributor

ahoarau commented Feb 25, 2019

Does not seem to like the PR with a c++14 library (unique_lock from boost/std)

install/include/boost_161_pthread_condition_variable_fwd.h:52:13: error: reference to ‘unique_lock’ is ambiguous
             unique_lock<mutex>& lock,

@peci1
Copy link
Contributor

peci1 commented Feb 25, 2019

This PR is against ROS Kinetic, which does not officially support C++14, if I'm not mistaken

@ahoarau
Copy link
Contributor

ahoarau commented Feb 25, 2019

c++11 causes the same problems. Very identical compatibility

@ahoarau
Copy link
Contributor

ahoarau commented Feb 25, 2019

I removed all the using namespace, and now it builds fine.
fix-fix-1343.tar.gz

@dirk-thomas
Copy link
Member

@meyerj Another friendly ping. If this should be included in an upcoming release please consider rebasing this against melodic-devel.

@meyerj
Copy link
Contributor Author

meyerj commented Mar 12, 2019

I merged the patch from @ahoarau in meyerj#1, which seems to solve the C++11/C++14 compatibility issues according to #1557 (comment). Thanks for the patch!

Unfortunately I do not have the time at the moment to test this myself or to compare with the solution proposed in #1608.

Please retarget the melodic-devel branch with this PR. In general patches are only accepted to the default branch (except if a patch is only necessary for older distros).

I opened another PR against melodic-devel: #1651

However, if melodic normally demands at least Boost 1.62 according to REP-3, it might not be necessary to apply the patch there and we could rather remove the Boost 1.61 backports only required for older versions completely. See my comment in the new PR.

meyerj added 2 commits April 5, 2019 20:41
…n timer_manager.h

Since Boost 1.67 BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC became the default
if the platform supports it and the macro is not defined anymore. Instead, check
for BOOST_THREAD_INTERNAL_CLOCK_IS_MONO.
…ITION_VARIABLE_HEADER macros by a typedef in internal_condition_variable.h
@flixr
Copy link
Contributor

flixr commented Nov 12, 2019

We can close this in favor of #1651

@meyerj
Copy link
Contributor Author

meyerj commented Nov 13, 2019

We can close this in favor of #1651

The backport of #1651 to kinetic-devel is non-trivial. I am fine with closing this PR, but I would be happy if the patch could be backported and released in ROS kinetic, too. In that case the diff here could serve as a reference.

@meyerj
Copy link
Contributor Author

meyerj commented Jul 31, 2020

See #2011 for a newer patch targeting kinetic-devel that is supposed to fix the same issue.

@meyerj meyerj deleted the fix-1343 branch July 31, 2020 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants