-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Don't merge] Fix bug of raft request cv missing signal #3152
[Don't merge] Fix bug of raft request cv missing signal #3152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noticed this problem as well when trying to figure out other issues, this PR LGTM. I'll open another PR later.
This issue may or may not fix #3019, need to check later. In some cases, when storage exit and leader election on going, the cv would no be waked up. |
Hold on a second, the |
See https://en.cppreference.com/w/cpp/thread/condition_variable/notify_all. The lock is to protect the critical section, which is |
I noticed this problem too in a sleepless night,the good news is that the spot is still available for us to debug :-) . |
@critical27 I see this from gdb: (gdb) up
#3 0x0000000003d448a5 in nebula::raftex::Host::reset (this=0x7f4d3843d310) at /root/src/wwl/nebula/src/kvstore/raftex/Host.h:55
55 noMoreRequestCV_.wait(g, [this] { return !requestOnGoing_; });
(gdb) p requestOnGoing_
$1 = true see that (gdb) up
#6 0x0000000003d530bd in nebula::raftex::RaftPart::processAppendLogRequest (this=0x7f4d384be010, req=..., resp=...)
at /root/src/wwl/nebula/src/kvstore/raftex/RaftPart.cpp:1373
warning: Source file is more recent than executable.
1373 std::lock_guard<std::mutex> g(raftLock_);
(gdb) p bufferOverFlow_
$2 = {_M_base = {static _S_alignment = 1, _M_i = true}, static is_always_lock_free = true}
(gdb) p replicatingLogs_
$3 = {_M_base = {static _S_alignment = 1, _M_i = true}, static is_always_lock_free = true}
(gdb) both |
new discovery, the blocking raft instance might be waiting response from another peer (gdb) up
#3 0x0000000003d448a5 in nebula::raftex::Host::reset (this=0x7f4d3843d310) at /root/src/wwl/nebula/src/kvstore/raftex/Host.h:55
55 noMoreRequestCV_.wait(g, [this] { return !requestOnGoing_; });
(gdb) p addr_
$1 = {host = {static npos = 18446744073709551615,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0x7f4d3843d340 "192.168.15.11"}, _M_string_length = 13, {_M_local_buf = "192.168.15.11\000\000", _M_allocated_capacity = 3330471509296822577}},
port = 37550} it's(the last instance) waiting on the second instance, which is still alive: (root@nebula) [(none)]> show hosts;
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| Host | Port | Status | Leader count | Leader distribution | Partition distribution |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 59007 | "ONLINE" | 1 | "ttos_3p3r:1" | "ttos_3p3r:1" |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 37549 | "ONLINE" | 0 | "No valid partition" | "ttos_3p3r:1" |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 56039 | "OFFLINE" | 0 | "No valid partition" | "ttos_3p3r:1" |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "Total" | | | 1 | "ttos_3p3r:1" | "ttos_3p3r:3" |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
Got 4 rows (time spent 1248/11351 us)
Wed, 20 Oct 2021 10:22:36 CST |
self->setResponse(resp); | ||
self->lastLogIdSent_ = self->logIdToSend_ - 1; | ||
} | ||
std::lock_guard<std::mutex> g(self->lock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guard expanded it's lifetime( include this log), why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just forget about this pr at this moment, it's probably not the right fix
This pr try to fix #3147 , which might be a series of typical missing signal problems of condition variable. The key point is that we should grab the lock associated with cond var at the time we are sending signal to make sure that the waiting thread will not miss this signal.