Skip to content

Commit

Permalink
[Core] No RAY_LOG in the constructor of DelayManager (ray-project#26958)
Browse files Browse the repository at this point in the history
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:

```
#0  0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
   from /lib64/libstdc++.so.6
ray-project#1  0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
ray-project#2  0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
ray-project#3  0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
ray-project#4  0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
   from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
ray-project#5  0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
ray-project#6  0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
ray-project#7  0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
ray-project#8  0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
ray-project#9  0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
ray-project#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
ray-project#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
ray-project#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
ray-project#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
ray-project#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
ray-project#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
ray-project#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
    at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```

The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).

It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.

The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.

BTW, I've tried different approaches:

1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.

Signed-off-by: Stefan van der Kleij <[email protected]>
  • Loading branch information
kfstorm authored and Stefan van der Kleij committed Aug 18, 2022
1 parent 732c6c1 commit b3de112
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 10 deletions.
23 changes: 13 additions & 10 deletions src/ray/common/asio/asio_chaos.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
#include "absl/strings/numbers.h"
#include "absl/strings/str_split.h"
#include "ray/common/ray_config.h"
#include "ray/util/logging.h"

namespace ray {
namespace asio {
Expand Down Expand Up @@ -65,7 +64,7 @@ class DelayManager {
if (delay_env.empty()) {
return;
}
RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;
std::cerr << "RAY_testing_asio_delay_us is set to " << delay_env << std::endl;
std::vector<std::string_view> items = absl::StrSplit(delay_env, ",");
for (const auto &item : items) {
ParseItem(item);
Expand All @@ -77,8 +76,9 @@ class DelayManager {
void ParseItem(std::string_view val) {
std::vector<std::string_view> item_val = absl::StrSplit(val, "=");
if (item_val.size() != 2) {
RAY_LOG(FATAL) << "Error in syntax: " << val
<< ", expected method=min_us:max:ms. Skip this entry.";
std::cerr << "Error in syntax: " << val
<< ", expected method=min_us:max:ms. Skip this entry." << std::endl;
_Exit(1);
}
auto delay_us = ParseVal(item_val[1]);
if (item_val[0] == "*") {
Expand All @@ -91,18 +91,21 @@ class DelayManager {
std::pair<int64_t, int64_t> ParseVal(std::string_view val) {
std::vector<std::string_view> delay_str_us = absl::StrSplit(val, ":");
if (delay_str_us.size() != 2) {
RAY_LOG(FATAL) << "Error in syntax: " << val
<< ", expected method=min_us:max:ms. Skip this entry";
std::cerr << "Error in syntax: " << val
<< ", expected method=min_us:max:ms. Skip this entry" << std::endl;
_Exit(1);
}
std::pair<int64_t, int64_t> delay_us;
if (!absl::SimpleAtoi(delay_str_us[0], &delay_us.first) ||
!absl::SimpleAtoi(delay_str_us[1], &delay_us.second)) {
RAY_LOG(FATAL) << "Error in syntax: " << val
<< ", expected method=min_us:max:ms. Skip this entry";
std::cerr << "Error in syntax: " << val
<< ", expected method=min_us:max:ms. Skip this entry" << std::endl;
_Exit(1);
}
if (delay_us.first > delay_us.second) {
RAY_LOG(FATAL) << delay_us.first << " is bigger than " << delay_us.second
<< ". Skip this entry.";
std::cerr << delay_us.first << " is bigger than " << delay_us.second
<< ". Skip this entry." << std::endl;
_Exit(1);
}
return delay_us;
}
Expand Down
5 changes: 5 additions & 0 deletions src/ray/common/test/asio_defer_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,8 @@ TEST(AsioChaosTest, WithGlobal) {
ASSERT_TRUE(EnsureBelow("method2", 20, 30));
ASSERT_TRUE(EnsureBelow("others", 100, 200));
}

int main(int argc, char **argv) {
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}

0 comments on commit b3de112

Please sign in to comment.