-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👨🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs #862
Comments
After #863 was merged, has been failing in 21 out of 35 (60%) of the latest debug builds. The test failing is Most recent case: |
Without taking into account the repeated jobs, this one has occurred 13 times in the last 20 days. In the Here's a recent reference: https://ci.ros2.org/view/nightly/job/nightly_win_rel/2278/testReport/junit/(root)/rosbag2_transport/test_play_services__rmw_fastrtps_cpp_gtest_missing_result/ It affects all the |
@Blast545 I've spent some of my time and tried to make analysis for failures in tests:
There are no guarantee that messages will be delivered on transport layer. Those tests are flaky by design and I am honestly don't know how to rewrite them to be more deterministic.
This is very strange failure which I wouldn't expect to happen. |
@Blast545 BTW. I see that I've tried to make a brief analysis of the failure. play_next_response = successful_call(cli_play_next_); consistently fails in 101 line. It would be actually better to ask someone from Meanwhile to mitigate this failure I would suggest to try to increase const std::chrono::seconds service_call_timeout_ {1}; up to 3 seconds. |
Thanks for digging into this! @MichaelOrlov Yeah, the On Linux it happens only for It happens on other distributions for `rmw_cyclonedds`, but it's not a common scenarioOn the nightly_win_rep it only fails with On the nightly_linux_repeated it could fail either only And there's an ultra rare scenario where it fails for the three rmw vendors on nightly_win_deb and nightly_win_rel: I will open the PR with your suggestion tomorrow morning @MichaelOrlov and get more feedback there as needed. |
@Blast545 I see how test fails on Let's try to increase [const std::chrono::seconds service_call_timeout_ {1}; up to 5 seconds and see if CI will only fail on |
@Blast545 @clalancette I have a good news about this annoying failing First of all I was able to reproduce it locally with some extra load for my machine. The second good news I found a breaking PR and commit: I've tried to revert commit ros2/rclcpp@679fb2b locally and failure doesn't reproduce any more. @ivanpauno Could you please pick up further analysis of the failing https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ from this point since you was an author of the breaking commit ros2/rclcpp@679fb2b |
Could you summarize the analysis you have done up to now? The problem seems to be a race. |
Hi @ivanpauno Sorry my late response.
Please let me know if you need more information or details about this issue or if something unclear. |
could be related to ros2/rmw_fastrtps#616. |
@fujitatomoya Unlikely, since in test_play_services we are not sending many requests in burst. |
@MichaelOrlov thanks for the comment. i wasnt sure, just came up to mind. |
I was able to reproduce the issue, I don't fully understand why it happens yet (and it's pretty hard to reproduce it). |
Could be related to the ros2/ros2#922 and https://answers.ros.org/question/411860/can-ros2-services-still-be-expected-to-be-flakey/ |
Could be related to the ros2/rclcpp#2039, eProsima/Fast-DDS#3087 and osrf/docker_images#628 |
Could be related to the ros2/rmw_cyclonedds#187, ros2/rmw_cyclonedds#74, ros2/rmw_fastrtps#392, ros2/rmw_cyclonedds#191, |
Description
The following tests have started to fail consistently (three days in a row) in the CI of https://ci.ros2.org/job/nightly_linux_repeated/:
If I'm not wrong, the build displays that the commit used is 891e081 that correspond to the pull #848 .
Expected Behavior
Tests should pass :)
Actual Behavior
Timeout
To Reproduce
Check CI job
System (please complete the following information)
The text was updated successfully, but these errors were encountered: