Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV libosrm with node12 #5841

Closed
vladf90 opened this issue Sep 29, 2020 · 5 comments
Closed

SIGSEGV libosrm with node12 #5841

vladf90 opened this issue Sep 29, 2020 · 5 comments

Comments

@vladf90
Copy link

vladf90 commented Sep 29, 2020

Hey! I am receiving segfault when using libosrm 5.22 with node12 bindings. The service crashes when we are using route method and in the same time we are loading new dataset into shared memory with osrm-datastore. Crashes are happening when using exclude parameter from route method, otherwise seems to work fine. I couldn't reproduce the problem on localhost, so I don't have a dataset that could help debugging this. The node process runs in docker container with --ipc=host.

I also have a stack trace:

#0  0x00007f7c0bc05820 in __gnu_cxx::__ops::_Iter_equals_val<unsigned char const>::operator()<unsigned char const*> (this=this@entry=0x7f7c20ff8a28, __it=__it@entry=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>)
    at /usr/include/c++/7/bits/predefined_ops.h:241
#1  0x00007f7c0bc0585b in std::__find_if<unsigned char const*, __gnu_cxx::__ops::_Iter_equals_val<unsigned char const> > (__first=__first@entry=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>,
    __last=__last@entry=0x7f7bd9cfe7dc <error: Cannot access memory at address 0x7f7bd9cfe7dc>, __pred=...) at /usr/include/c++/7/bits/stl_algo.h:120
#2  0x00007f7c0bc0593c in std::__find_if<unsigned char const*, __gnu_cxx::__ops::_Iter_equals_val<unsigned char const> > (__first=__first@entry=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>,
    __last=__last@entry=0x7f7bd9cfe7dc <error: Cannot access memory at address 0x7f7bd9cfe7dc>, __pred=...) at /usr/include/c++/7/bits/stl_algo.h:162
#3  0x00007f7c0bc05965 in std::find<unsigned char const*, unsigned char> (__first=0x7f7bd9cfe7d4 <error: Cannot access memory at address 0x7f7bd9cfe7d4>,
    __last=__last@entry=0x7f7bd9cfe7dc <error: Cannot access memory at address 0x7f7bd9cfe7dc>, __val=@0x7f7c20ff8aa7: 16 '\020') at /usr/include/c++/7/bits/stl_algo.h:3908
#4  0x00007f7c0bc27651 in osrm::engine::DataFacadeFactory<osrm::engine::datafacade::ContiguousInternalMemoryDataFacade, osrm::engine::routing_algorithms::mld::Algorithm>::Get (this=0x3e04170, params=...)
    at /osrm/include/engine/datafacade_factory.hpp:139
#5  0x00007f7c0bc276d0 in osrm::engine::DataFacadeFactory<osrm::engine::datafacade::ContiguousInternalMemoryDataFacade, osrm::engine::routing_algorithms::mld::Algorithm>::Get<osrm::engine::api::BaseParameters> (this=<optimized out>,
    params=...) at /osrm/include/engine/datafacade_factory.hpp:41
#6  0x00007f7c0bc27715 in osrm::engine::detail::DataWatchdogImpl<osrm::engine::routing_algorithms::mld::Algorithm, osrm::engine::datafacade::ContiguousInternalMemoryDataFacade<osrm::engine::routing_algorithms::mld::Algorithm> >::Get (
    this=<optimized out>, params=...) at /osrm/include/engine/data_watchdog.hpp:78
#7  0x00007f7c0bc27758 in osrm::engine::detail::WatchingProvider<osrm::engine::routing_algorithms::mld::Algorithm, osrm::engine::datafacade::ContiguousInternalMemoryDataFacade>::Get (this=<optimized out>, params=...)
    at /osrm/include/engine/datafacade_provider.hpp:93
#8  0x00007f7c0bc27dbf in osrm::engine::Engine<osrm::engine::routing_algorithms::mld::Algorithm>::GetAlgorithms<osrm::engine::api::RouteParameters> (this=this@entry=0x46e04c0, params=...) at /osrm/include/engine/engine.hpp:129
#9  0x00007f7c0bc27e33 in osrm::engine::Engine<osrm::engine::routing_algorithms::mld::Algorithm>::Route (this=0x46e04c0, params=..., result=...) at /osrm/include/engine/engine.hpp:95
#10 0x00007f7c0bc10851 in osrm::OSRM::Route (this=<optimized out>, params=..., result=...) at /osrm/src/osrm/osrm.cpp:62
#11 0x00007f7c0bbfa017 in void node_osrm::async<std::unique_ptr<osrm::engine::api::RouteParameters, std::default_delete<osrm::engine::api::RouteParameters> > (*)(Nan::FunctionCallbackInfo<v8::Value> const&, bool), osrm::engine::Status (osrm::OSRM::*)(osrm::engine::api::RouteParameters const&, osrm::util::json::Object&) const>(Nan::FunctionCallbackInfo<v8::Value> const&, std::unique_ptr<osrm::engine::api::RouteParameters, std::default_delete<osrm::engine::api::RouteParameters> > (*)(Nan::FunctionCallbackInfo<v8::Value> const&, bool), osrm::engine::Status (osrm::OSRM::*)(osrm::engine::api::RouteParameters const&, osrm::util::json::Object&) const, bool)::Worker::Execute() (this=0x4748fa0)
    at /osrm/src/nodejs/node_osrm.cpp:155
#12 0x00007f7c0bbdd401 in Nan::AsyncExecute (req=<optimized out>) at /osrm/node_modules/nan/nan.h:2280
#13 0x000000000132e72e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:122
#14 0x00007f7c2972b6db in start_thread (arg=0x7f7c20ff9700) at pthread_create.c:463
#15 0x00007f7c29454a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
@danpat
Copy link
Member

danpat commented Sep 29, 2020

I've got a possible fix over in:

#5844

however, I'm not 100% confident it's correct. I was able to reproduce the issue above by hammering a local osrm-routed instance with requests with exclude=toll, and running a data update in a loop. It would usually reproduce within a minute.

After the PR above, I've been able to run it for 30 minutes with no crash.

I think the problem is that we've got threads calling Get() on an object that's being destroyed:

https://github.com/Project-OSRM/osrm-backend/blob/master/include/engine/data_watchdog.hpp#L114-L118

^^ this re-assignment of facade_factory should probably be protected by a mutex, and calls to Get() should obtain shared locks to that mutex.

The fix I implemented makes the segfault less likely, but I don't think addresses the core problem of calling functions on an object that's undergoing destruction.

@danpat
Copy link
Member

danpat commented Sep 30, 2020

Ran this under -fsanitize=address and it confirmed my hunch - when a dataswap is triggered, data_watchdog is replacing facade_factory, trigger the old version's destruction, but it may be in use by a call to Get() on another thread.

I reverted the previous change and pushed a better guarantee for the lifetime of the facade_factory - I'm not able to reproduce the crash local using the scenario that could reproduce it quite quickly without this fix. Edit: nevermind, it's still crashing for roughly the same reason.

@vladf90
Copy link
Author

vladf90 commented Sep 30, 2020

Perhaps we could block calls with a lock when data swap is triggered?

@danpat
Copy link
Member

danpat commented Sep 30, 2020

@vladf90 Yeah, I went ahead and did that in the PR. I think the original either didn't realize this particular bit of code wasn't safe, or was attempting to make it lock-free to not impact request performance. I benchmarked having a lock in place and it doesn't seem to have any significant impact.

@vladf90
Copy link
Author

vladf90 commented Oct 9, 2020

I can confirm that it works as expected now. I've been running the new build for almost 24h and no crash so far. Thank you for solving this so fast!

@vladf90 vladf90 closed this as completed Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants