Issue in thread safety after #1462 #1468

Aneoshun · 2020-05-15T17:41:40Z

Hi,

It seems that #1462 makes dart less thread-safe.
I am currently unable to provide a lot of details at the moment (lack of time), but I thought I had to report this asap given that it has been merged to master recently.

Following the use case that I mentioned in #1460
I clone a lot of skeletons and then run simulations in parallel (in short, the for loop is replaced by a parallel_for from tbb).

This has been working for years and is still working with commit (3 commits ago) becbead
But using the commit associated with #1462 leads to systematic segfaults happening at a random point in time.

Would you have an idea of what in the #1462 fix could be causing this? Otherwise, I will investigate this as soon as I have some time.

The text was updated successfully, but these errors were encountered:

mxgrey · 2020-05-16T11:29:53Z

I wonder if maybe the signals aren't being cloned correctly. When a skeleton is cloned, all of its connections internally and externally are supposed to be completely independent of the original. If events that are happening in parallel threads are affecting each other because of the signals, then that suggests to me that the signals are being shallow copied instead of deep copied when a cloning happens.

One possible workaround you could try if you don't want to be blocked by this bug is to completely set up all of your simulation worlds before the parallel loops begin and keep all of them alive until all the threads are joined back together. I think that should prevent this bug from negatively impacting you until it can be fixed.

Aneoshun · 2020-05-16T11:55:56Z

@mxgrey Thanks for your reply. I share your intuition, as it is clear from the problem that I raised last week (#1460) that the signals are not independent between the clones and the original. i.e., making millions of clones makes the original to have millions of (mainly disconnected) connections. I thought this was made on purpose (even if I did not understand why), and thought that the most important thing was to clear the closed connections.

I am using tens of millions of cloned robots, so instantiating them all at the same time would not be feasible. An alternative would be to instantiate just enough clones for the ones that will concurrently used, but that won't be trivial to implement.
For now, I simply pined the dart commit SHA we used to a previous commit. We are still facing #1460, but this becomes a problem only after cloning a couple of millions of robots. We can live with that for a short period of time.

mxgrey · 2020-05-16T12:02:28Z

If all the clones are identical and don't need to be modified, then I think a "clone pool" would be a good solution. Just have something like a std::vector<SkeletonPtr> to draw from, with a std::mutex protecting it. When a thread is finished, it locks the mutex and dumps its skeleton into the container instead of just discarding it. When a new thread is starting, it locks the mutex, and pulls a skeleton from the container (unless the container is empty, in which case the thread can clone from the original). If you need to make sure they start out with the same state, then you can copy the state of the original into the clone at the start of the thread. Besides being a workaround for this specific bug, you might also see some modest performance improvements by doing this.

costashatz · 2020-05-24T18:48:33Z

If all the clones are identical and don't need to be modified, then I think a "clone pool" would be a good solution. Just have something like a std::vector<SkeletonPtr> to draw from, with a std::mutex protecting it. When a thread is finished, it locks the mutex and dumps its skeleton into the container instead of just discarding it.

I am using a clone pool in all of my tests. This is working quite robustly and I do not have any big leaks. This is not very difficult to do. @Aneoshun I can share the code if you wish.

Besides being a workaround for this specific bug, you might also see some modest performance improvements by doing this.

Indeed this is faster than re-cloning every-time. I can confirm this in my tests. But of course, this specific bug should be addressed.

Aneoshun · 2020-06-11T09:09:29Z

Yes, I think this bug should be addressed, mainly to avoid future problems.

@costashatz your clone pool looks great. Could you please share your code? (you can send it by email if you prefer).

costashatz · 2020-06-11T21:43:30Z

@costashatz your clone pool looks great. Could you please share your code? (you can send it by email if you prefer).

Here's one way of doing the pool:

class MyClonePool {
public:
    static MyClonePool* instance()
    {
        static MyClonePool gdata;
        return &gdata;
    }

    MyClonePool(const MyClonePool&) = delete;
    void operator=(const MyClonePool&) = delete;

    dart::dynamics::SkeletonPtr get_skeleton()
    {
        std::lock_guard<std::mutex> lock(_skeleton_mutex);
        for (size_t i = 0; i < _num_skeletons; i++) {
            if (_free[i]) {
                _free[i] = false;
                return _skeletons[i];
            }
        }

        return nullptr;
    }

    void free_skeleton(const dart::dynamics::SkeletonPtr& skel)
    {
        std::lock_guard<std::mutex> lock(_skeleton_mutex);
        for (size_t i = 0; i < _num_skeletons; i++) {
            if (_skeletons[i] == skel) {
                _set_init_pos(skel);
                _free[i] = true;
                break;
            }
        }
    }

protected:
    std::vector<dart::dynamics::SkeletonPtr> _skeletons;
    std::vector<bool> _free;
    size_t _num_skeletons = 10; // set this number to your maximum number of concurrent skeletons (or slightly bigger to be sure! ;))
    std::mutex _skeleton_mutex;

    MyClonePool() {
        // load one skeleton from file and clone it _num_skeletons times
        // OR load _num_skeletons different skeletons from file
        // do not forget to fill _skeletons with your skeletons

        // Initialize all skeletons to free
        for (size_t i = 0; i < _num_skeletons; i++) {
            _free.push_back(true);
        }
    }

    void _set_init_pos(const dart::dynamics::SkeletonPtr& skel)
    {
        skel->resetPositions();
        skel->resetVelocities();
        skel->resetAccelerations();
        skel->clearExternalForces();

        // reset any altered properties (e.g., altered mass of a link) and set initial positions
    }
};

Then you use it like this:

void evaluation_function() {
    // Acquire a freed skeleton
    dart::dynamics::SkeletonPtr skel = nullptr;
    while (skel == nullptr) skel = MyClonePool::instance()->get_skeleton();

    // do your stuff with skel

   // Free your skeleton so that a new function can take it
   MyClonePool::instance()->free_skeleton(skel);
}

If you do not like singletons, you can define a global namespace and put your SkeletonPool object there (this is still a singleton but in a different way). If you do not like globals/singletons, you can create an object and pass it around your structures.

I hope this is helpful.

Aneoshun · 2020-06-12T07:59:21Z

Thanks a lot. I will use this while the clone function of the skeleton get fixed.

costashatz · 2021-09-27T19:38:44Z

@mxgrey @jslee02 any updates on this one? It's not good to have non-thread safe code hanging around..

costashatz · 2022-09-12T09:47:06Z

@mxgrey @jslee02 any updates?

Aneoshun added the type: bug Indicates an unexpected problem or unintended behavior label May 15, 2020

This was referenced May 16, 2020

Adding ghost robots and other visualizations NOSALRO/robot_dart#57

Merged

Thread safety issue coming from recent fix in dart NOSALRO/robot_dart#59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in thread safety after #1462 #1468

Issue in thread safety after #1462 #1468

Aneoshun commented May 15, 2020

mxgrey commented May 16, 2020

Aneoshun commented May 16, 2020

mxgrey commented May 16, 2020

costashatz commented May 24, 2020

Aneoshun commented Jun 11, 2020

costashatz commented Jun 11, 2020

Aneoshun commented Jun 12, 2020

costashatz commented Sep 27, 2021

costashatz commented Sep 12, 2022

Issue in thread safety after #1462 #1468

Issue in thread safety after #1462 #1468

Comments

Aneoshun commented May 15, 2020

mxgrey commented May 16, 2020

Aneoshun commented May 16, 2020

mxgrey commented May 16, 2020

costashatz commented May 24, 2020

Aneoshun commented Jun 11, 2020

costashatz commented Jun 11, 2020

Aneoshun commented Jun 12, 2020

costashatz commented Sep 27, 2021

costashatz commented Sep 12, 2022