Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test in test_rosbag2_record_end_to_end #790

Closed
Barry-Xu-2018 opened this issue Jun 24, 2021 · 7 comments
Closed

Flaky test in test_rosbag2_record_end_to_end #790

Barry-Xu-2018 opened this issue Jun 24, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@Barry-Xu-2018
Copy link
Contributor

Barry-Xu-2018 commented Jun 24, 2021

Description

test cases for test_rosbag2_record_end_to_end sometimes fails

Expected Behavior

Get stabilization result

Actual Behavior

Sometimes failure

To Reproduce

Repeat run test_rosbag2_record_end_to_end in rosbags_test
colcon test --event-handlers console_direct+ --packages-select rosbag2_tests --ctest-args -R test_rosbag2_record_end_to_end --repeat-until-fail 1000

System (please complete the following information)

  • OS: Ubuntu 20.04
  • ROS 2 Distro: rolling
  • Version: master

Additional context

This problem also occurs in CI system

@Barry-Xu-2018 Barry-Xu-2018 added the bug Something isn't working label Jun 24, 2021
@Barry-Xu-2018
Copy link
Contributor Author

This problem was found at #788 (comment).
And then I can reproduce this in my local environment if execute enough times.

@Barry-Xu-2018
Copy link
Contributor Author

I have analyzed problems at #788 (comment), but the solution need to be discussed here.
So I copy my analysis to here.


I find CI report failure.

Rpr__rosbag2__ubuntu_focal_amd64 --- failure
According to https://build.ros2.org/job/Rpr__rosbag2__ubuntu_focal_amd64/881/testReport/, 3 test case failed.

The failure reason includes 2 points.

  1. While publishing the message, sometimes rosbag doesn't discover the topic to be recorded.
    e.g.
    rosbag2_test_common::PublicationManager pub_manager;
    pub_manager.setup_publisher("/test_topic", message, expected_test_messages);
    pub_manager.setup_publisher("/wrong_topic", wrong_message, 3);
    auto process_handle = start_execution(
    "ros2 bag record --max-cache-size 0 --output " + root_bag_path_.string() + " /test_topic");
    wait_for_db();
    pub_manager.run_publishers();

This leads to the error as the below (The number of publishing message doesn't match the number of recorded message)
https://build.ros2.org/job/Rpr__rosbag2__ubuntu_focal_amd64/881/testReport/rosbag2_tests/RecordFixture/record_end_to_end_test_with_cache/

Now we have no interface to make sure rosbag finish discovering topic, so I want to add sleep before publishing message to make sure discover is finished.

  1. It is possible that 2 processes access the created sqlite3 database at the same time.
    auto process_handle = start_execution(
    "ros2 bag record --max-cache-size 0 --output " + root_bag_path_.string() + " /test_topic");
    wait_for_db();

start_execution() will start rosbag command in new process.

void wait_for_db()
{
const auto database_path = get_bag_file_path(0);
while (true) {
try {
std::this_thread::sleep_for(50ms); // wait a bit to not query constantly
if (database_path.exists()) {
rosbag2_storage_plugins::SqliteWrapper db{
database_path.string(), rosbag2_storage::storage_interfaces::IOFlag::READ_ONLY};
return;
}

As the above codes, current process also access sqlite3 database.
This leads that the lock occurs.

[ RUN      ] RecordFixture.rosbag2_record_and_play_multiple_topics_with_filter
[INFO] [1624441972.716871010] [rosbag2_storage]: Opened database '/tmp/tmp_test_dir_OVIieg/rosbag2_record_and_play_multiple_topics_with_filter/rosbag2_record_and_play_multiple_topics_with_filter_0.db3' for READ_WRITE.
[INFO] [1624441972.716994585] [rosbag2_recorder]: Listening for topics...
[INFO] [1624441972.718430617] [rosbag2_recorder]: Subscribed to topic '/test_topic1'
[INFO] [1624441972.719189448] [rosbag2_cpp]: Writing remaining messages from cache to the bag. It may take a while
Error when processing SQL statement. SQLite error (5): database is locked

@Barry-Xu-2018
Copy link
Contributor Author

Problem 1. While publishing the message, sometimes rosbag doesn't discover the topic to be recorded

This leads that first message is lost (Not record it)

rosbag2_test_common::PublicationManager pub_manager;
pub_manager.setup_publisher("/test_topic", message, expected_test_messages);
pub_manager.setup_publisher("/wrong_topic", wrong_message, 3);
auto process_handle = start_execution(
"ros2 bag record --max-cache-size 0 --output " + root_bag_path_.string() + " /test_topic");
wait_for_db();
pub_manager.run_publishers();

But I have no idea on how to confirm rosbag find all topics which will be recorded before pub_manager.run_publishers()
Anybody has an idea ?

Problem 2. It is possible that 2 processes access the created sqlite3 database at the same time.

void wait_for_db()
{
const auto database_path = get_bag_file_path(0);
while (true) {
try {
std::this_thread::sleep_for(50ms); // wait a bit to not query constantly
if (database_path.exists()) {
rosbag2_storage_plugins::SqliteWrapper db{
database_path.string(), rosbag2_storage::storage_interfaces::IOFlag::READ_ONLY};
return;
}

I think we need not open this file as sqlite3 data file to check. Instead, only check if file exists.

@MichaelOrlov
Copy link
Contributor

@Barry-Xu-2018 For the Problem 1 related to the discovery it would be more appropriate to add wait_for_matched(..) API for the PublicationManager. Similar as made spin_and_wait_for_matched in SubscriptionManager.

I can do a PR if you are not sure how to do it, please let me know.

As regards to the Problem 2. It might be a race condition and fail on further stages if we will not wait while DB will be created. It could be situation when path exist but DB hasn't be created yet. If we will proceed we have a high chances to fail further on, when we will be doing request to the DB. Need to find a way how to check DB existence without locking access to it. May be some specific flags exists.
However I think the Problem 2 will go away automatically as soon as we will solve Problem 1 with wait for matched. Since first DB will be created in constructor of the recorder and than will be created subscribers and we will not try open DB until it will be actually created since we will be waiting for subscribers in test.

@Barry-Xu-2018
Copy link
Contributor Author

@MichaelOrlov

For the Problem 1 related to the discovery it would be more appropriate to add wait_for_matched(..) API for the PublicationManager. Similar as made spin_and_wait_for_matched in SubscriptionManager.

Thanks. I will add this function and update test codes.

Need to find a way how to check DB existence without locking access to it. May be some specific flags exists.

However I think the Problem 2 will go away automatically as soon as we will solve Problem 1 with wait for matched. Since first DB will be created in constructor of the recorder and than will be created subscribers and we will not try open DB until it will be actually created since we will be waiting for subscribers in test.

Okay. I will confirm this. If also have problems, we can directly use sqlite3 API to check if this new DB file can be opened.

@Barry-Xu-2018
Copy link
Contributor Author

@MichaelOrlov

Please help to review #797

@Barry-Xu-2018
Copy link
Contributor Author

Fix #797 was merged.
So I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants