Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Core] No RAY_LOG in the constructor of DelayManager (ray-project#26958)
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 ray-project#1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 ray-project#7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 ray-project#8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 ray-project#9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 ray-project#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 ray-project#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 ray-project#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 ray-project#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 ray-project#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 ray-project#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 ray-project#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe. Signed-off-by: Stefan van der Kleij <[email protected]>
- Loading branch information