SIGSEGV libosrm with node12 #5841

vladf90 · 2020-09-29T10:32:35Z

Hey! I am receiving segfault when using libosrm 5.22 with node12 bindings. The service crashes when we are using route method and in the same time we are loading new dataset into shared memory with osrm-datastore. Crashes are happening when using exclude parameter from route method, otherwise seems to work fine. I couldn't reproduce the problem on localhost, so I don't have a dataset that could help debugging this. The node process runs in docker container with --ipc=host.

I also have a stack trace:

#0  0x00007f7c0bc05820 in __gnu_cxx::__ops::_Iter_equals_val<unsigned char const>::operator()<unsigned char const*> (this=this@entry=0x7f7c20ff8a28, __it=__it@entry=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>)
    at /usr/include/c++/7/bits/predefined_ops.h:241
#1  0x00007f7c0bc0585b in std::__find_if<unsigned char const*, __gnu_cxx::__ops::_Iter_equals_val<unsigned char const> > (__first=__first@entry=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>,
    __last=__last@entry=0x7f7bd9cfe7dc <error: Cannot access memory at address 0x7f7bd9cfe7dc>, __pred=...) at /usr/include/c++/7/bits/stl_algo.h:120
#2  0x00007f7c0bc0593c in std::__find_if<unsigned char const*, __gnu_cxx::__ops::_Iter_equals_val<unsigned char const> > (__first=__first@entry=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>,
    __last=__last@entry=0x7f7bd9cfe7dc <error: Cannot access memory at address 0x7f7bd9cfe7dc>, __pred=...) at /usr/include/c++/7/bits/stl_algo.h:162
#3  0x00007f7c0bc05965 in std::find<unsigned char const*, unsigned char> (__first=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>,
    __last=__last@entry=0x7f7bd9cfe7dc <error: Cannot access memory at address 0x7f7bd9cfe7dc>, __val=@0x7f7c20ff8aa7: 16 '\020') at /usr/include/c++/7/bits/stl_algo.h:3908
#4  0x00007f7c0bc27651 in osrm::engine::DataFacadeFactory<osrm::engine::datafacade::ContiguousInternalMemoryDataFacade, osrm::engine::routing_algorithms::mld::Algorithm>::Get (this=0x3e04170, params=...)
    at /osrm/include/engine/datafacade_factory.hpp:139
#5  0x00007f7c0bc276d0 in osrm::engine::DataFacadeFactory<osrm::engine::datafacade::ContiguousInternalMemoryDataFacade, osrm::engine::routing_algorithms::mld::Algorithm>::Get<osrm::engine::api::BaseParameters> (this=<optimized out>,
    params=...) at /osrm/include/engine/datafacade_factory.hpp:41
#6  0x00007f7c0bc27715 in osrm::engine::detail::DataWatchdogImpl<osrm::engine::routing_algorithms::mld::Algorithm, osrm::engine::datafacade::ContiguousInternalMemoryDataFacade<osrm::engine::routing_algorithms::mld::Algorithm> >::Get (
    this=<optimized out>, params=...) at /osrm/include/engine/data_watchdog.hpp:78
#7  0x00007f7c0bc27758 in osrm::engine::detail::WatchingProvider<osrm::engine::routing_algorithms::mld::Algorithm, osrm::engine::datafacade::ContiguousInternalMemoryDataFacade>::Get (this=<optimized out>, params=...)
    at /osrm/include/engine/datafacade_provider.hpp:93
#8  0x00007f7c0bc27dbf in osrm::engine::Engine<osrm::engine::routing_algorithms::mld::Algorithm>::GetAlgorithms<osrm::engine::api::RouteParameters> (this=this@entry=0x46e04c0, params=...) at /osrm/include/engine/engine.hpp:129
#9  0x00007f7c0bc27e33 in osrm::engine::Engine<osrm::engine::routing_algorithms::mld::Algorithm>::Route (this=0x46e04c0, params=..., result=...) at /osrm/include/engine/engine.hpp:95
#10 0x00007f7c0bc10851 in osrm::OSRM::Route (this=<optimized out>, params=..., result=...) at /osrm/src/osrm/osrm.cpp:62
#11 0x00007f7c0bbfa017 in void node_osrm::async<std::unique_ptr<osrm::engine::api::RouteParameters, std::default_delete<osrm::engine::api::RouteParameters> > (*)(Nan::FunctionCallbackInfo<v8::Value> const&, bool), osrm::engine::Status (osrm::OSRM::*)(osrm::engine::api::RouteParameters const&, osrm::util::json::Object&) const>(Nan::FunctionCallbackInfo<v8::Value> const&, std::unique_ptr<osrm::engine::api::RouteParameters, std::default_delete<osrm::engine::api::RouteParameters> > (*)(Nan::FunctionCallbackInfo<v8::Value> const&, bool), osrm::engine::Status (osrm::OSRM::*)(osrm::engine::api::RouteParameters const&, osrm::util::json::Object&) const, bool)::Worker::Execute() (this=0x4748fa0)
    at /osrm/src/nodejs/node_osrm.cpp:155
#12 0x00007f7c0bbdd401 in Nan::AsyncExecute (req=<optimized out>) at /osrm/node_modules/nan/nan.h:2280
#13 0x000000000132e72e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:122
#14 0x00007f7c2972b6db in start_thread (arg=0x7f7c20ff9700) at pthread_create.c:463
#15 0x00007f7c29454a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The text was updated successfully, but these errors were encountered:

danpat · 2020-09-29T22:56:59Z

I've got a possible fix over in:

#5844

however, I'm not 100% confident it's correct. I was able to reproduce the issue above by hammering a local osrm-routed instance with requests with exclude=toll, and running a data update in a loop. It would usually reproduce within a minute.

After the PR above, I've been able to run it for 30 minutes with no crash.

I think the problem is that we've got threads calling Get() on an object that's being destroyed:

https://github.com/Project-OSRM/osrm-backend/blob/master/include/engine/data_watchdog.hpp#L114-L118

^^ this re-assignment of facade_factory should probably be protected by a mutex, and calls to Get() should obtain shared locks to that mutex.

The fix I implemented makes the segfault less likely, but I don't think addresses the core problem of calling functions on an object that's undergoing destruction.

danpat · 2020-09-30T04:13:24Z

Ran this under -fsanitize=address and it confirmed my hunch - when a dataswap is triggered, data_watchdog is replacing facade_factory, trigger the old version's destruction, but it may be in use by a call to Get() on another thread.

I reverted the previous change and pushed a better guarantee for the lifetime of the facade_factory - I'm not able to reproduce the crash local using the scenario that could reproduce it quite quickly without this fix. Edit: nevermind, it's still crashing for roughly the same reason.

vladf90 · 2020-09-30T11:50:02Z

Perhaps we could block calls with a lock when data swap is triggered?

danpat · 2020-09-30T15:16:05Z

@vladf90 Yeah, I went ahead and did that in the PR. I think the original either didn't realize this particular bit of code wasn't safe, or was attempting to make it lock-free to not impact request performance. I benchmarked having a lock in place and it doesn't seem to have any significant impact.

vladf90 · 2020-10-09T12:20:00Z

I can confirm that it works as expected now. I've been running the new build for almost 24h and no crash so far. Thank you for solving this so fast!

danpat mentioned this issue Sep 29, 2020

Lock access to facade_factory in data_watchdog to avoid accessing destructed object #5844

Merged

6 tasks

vladf90 closed this as completed Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV libosrm with node12 #5841

SIGSEGV libosrm with node12 #5841

vladf90 commented Sep 29, 2020

danpat commented Sep 29, 2020 •

edited

Loading

danpat commented Sep 30, 2020 •

edited

Loading

vladf90 commented Sep 30, 2020

danpat commented Sep 30, 2020

vladf90 commented Oct 9, 2020

SIGSEGV libosrm with node12 #5841

SIGSEGV libosrm with node12 #5841

Comments

vladf90 commented Sep 29, 2020

danpat commented Sep 29, 2020 • edited Loading

danpat commented Sep 30, 2020 • edited Loading

vladf90 commented Sep 30, 2020

danpat commented Sep 30, 2020

vladf90 commented Oct 9, 2020

danpat commented Sep 29, 2020 •

edited

Loading

danpat commented Sep 30, 2020 •

edited

Loading