Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Don't merge] Fix bug of raft request cv missing signal #3152

Closed

Conversation

kikimo
Copy link
Contributor

@kikimo kikimo commented Oct 19, 2021

This pr try to fix #3147 , which might be a series of typical missing signal problems of condition variable. The key point is that we should grab the lock associated with cond var at the time we are sending signal to make sure that the waiting thread will not miss this signal.

Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed this problem as well when trying to figure out other issues, this PR LGTM. I'll open another PR later.

@critical27
Copy link
Contributor

This issue may or may not fix #3019, need to check later. In some cases, when storage exit and leader election on going, the cv would no be waked up.

@critical27
Copy link
Contributor

Hold on a second, the requestOnGoing itself is protected by the lock already, I doubt if the cv.notify is the root cause. Once requestOnGoing is protected, whether notify is protected is not so important. Why we will miss the signal?

@critical27
Copy link
Contributor

critical27 commented Oct 19, 2021

See https://en.cppreference.com/w/cpp/thread/condition_variable/notify_all. The lock is to protect the critical section, which is requestOnGoing in this case

@kikimo
Copy link
Contributor Author

kikimo commented Oct 19, 2021

Hold on a second, the requestOnGoing itself is protected by the lock already, I doubt if the cv.notify is the root cause. Once requestOnGoing is protected, whether notify is protected is not so important. Why we will miss the signal?

I noticed this problem too in a sleepless night,the good news is that the spot is still available for us to debug :-) .

@kikimo
Copy link
Contributor Author

kikimo commented Oct 19, 2021

@critical27 I see this from gdb:

(gdb) up
#3  0x0000000003d448a5 in nebula::raftex::Host::reset (this=0x7f4d3843d310) at /root/src/wwl/nebula/src/kvstore/raftex/Host.h:55
55	    noMoreRequestCV_.wait(g, [this] { return !requestOnGoing_; });
(gdb) p requestOnGoing_
$1 = true

see that requestOnGoing_ is set to true, that's why this cv is blocking, there is no missing signal. Now I think that #3141 is the fix for this problem. Since I've disabled the log, I can't verify this from the log, but still I managed to attach to thread 3413735, a thread running RaftPart code which was recorded in the pstack attachment uploaded to issue #3147 associated with this pr, and I see something interesting:

(gdb) up
#6  0x0000000003d530bd in nebula::raftex::RaftPart::processAppendLogRequest (this=0x7f4d384be010, req=..., resp=...)
    at /root/src/wwl/nebula/src/kvstore/raftex/RaftPart.cpp:1373
warning: Source file is more recent than executable.
1373	  std::lock_guard<std::mutex> g(raftLock_);
(gdb) p bufferOverFlow_
$2 = {_M_base = {static _S_alignment = 1, _M_i = true}, static is_always_lock_free = true}
(gdb) p replicatingLogs_
$3 = {_M_base = {static _S_alignment = 1, _M_i = true}, static is_always_lock_free = true}
(gdb)

both bufferOverFlow_ and replicatingLogs_ are set to true, that means there there might be ongoning log replications but somehow they stuck... Will work on this later...

@kikimo
Copy link
Contributor Author

kikimo commented Oct 20, 2021

new discovery, the blocking raft instance might be waiting response from another peer 192.168.15.11:37550:

(gdb) up
#3  0x0000000003d448a5 in nebula::raftex::Host::reset (this=0x7f4d3843d310) at /root/src/wwl/nebula/src/kvstore/raftex/Host.h:55
55	    noMoreRequestCV_.wait(g, [this] { return !requestOnGoing_; });
(gdb) p addr_
$1 = {host = {static npos = 18446744073709551615,
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
      _M_p = 0x7f4d3843d340 "192.168.15.11"}, _M_string_length = 13, {_M_local_buf = "192.168.15.11\000\000", _M_allocated_capacity = 3330471509296822577}},
  port = 37550}

it's(the last instance) waiting on the second instance, which is still alive:

(root@nebula) [(none)]> show hosts;
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| Host            | Port  | Status    | Leader count | Leader distribution  | Partition distribution |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 59007 | "ONLINE"  | 1            | "ttos_3p3r:1"        | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 37549 | "ONLINE"  | 0            | "No valid partition" | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 56039 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "Total"         |       |           | 1            | "ttos_3p3r:1"        | "ttos_3p3r:3"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
Got 4 rows (time spent 1248/11351 us)

Wed, 20 Oct 2021 10:22:36 CST

self->setResponse(resp);
self->lastLogIdSent_ = self->logIdToSend_ - 1;
}
std::lock_guard<std::mutex> g(self->lock_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard expanded it's lifetime( include this log), why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just forget about this pr at this moment, it's probably not the right fix

@critical27 critical27 changed the title Fix bug of raft request cv missing signal [Don't merge] Fix bug of raft request cv missing signal Oct 20, 2021
@kikimo kikimo closed this Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raft deadlock
3 participants