-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #2146 E112 EHOSTDOWN in short and pooled connection #2177
base: master
Are you sure you want to change the base?
Conversation
test/brpc_channel_unittest.cpp
Outdated
@@ -1824,6 +1824,39 @@ class ChannelTest : public ::testing::Test{ | |||
MyEchoService _svc; | |||
}; | |||
|
|||
void TestBlockServer(bool single_server, bool short_connection, const char* lb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestBlockServer 应该 是ChannelTest 的成员函数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestBlockServer 应该 是ChannelTest 的成员函数。
哦哦 知道了
@@ -719,6 +719,37 @@ inline bool does_error_affect_main_socket(int error_code) { | |||
error_code == EINVAL/*returned by connect "0.0.0.1"*/; | |||
} | |||
|
|||
inline void maybe_block_server(int error_code, Controller* cntl, SharedLoadBalancer* lb, SocketId sock) { | |||
if (!does_error_affect_main_socket(error_code)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
连接超时的情况可以优化吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我这周再研究研究。。
int rc = _lb->SelectServer(sel_in, &sel_out); | ||
if (rc == EHOSTDOWN) { | ||
// If no server is available, include accessed server and try to SelectServer again | ||
sel_in.excluded = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样是不是就会选到期望要excluded的实例呢?这样合理吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前excluded的实例是在之前的重试中选过的实例,如果返回EHOSTDOWN说明已经没有实例可选,RPC必然失败,那就退而求其次,选择excluded中的实例,这样还有成功的可能
CI单测失败了 |
嗯嗯。。 我先在本地调试一下。 |
@@ -608,11 +608,11 @@ TEST_F(SocketTest, app_level_health_check) { | |||
|
|||
for (int i = 0; i < 4; ++i) { | |||
// although ::connect would succeed, the stall in hc_service makes | |||
// the health check rpc fail. | |||
// the health check rpc overtime but not fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的备注需要修改一下
@@ -1070,8 +1074,9 @@ TEST_F(LoadBalancerTest, revived_from_all_failed_intergrated) { | |||
// all servers are down, the very first call that trigger recovering would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的备注需要修改一下
(sending_sock == NULL || sending_sock->id() != peer_id)) { | ||
Socket::SetFailed(peer_id); | ||
maybe_block_server(error_code, c, c->_lb.get(), peer_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
实现是"block socket"不是block server". 或许叫checkAndSetFailed更明确一点?
// main socket should die as well | ||
// NOTE: main socket may be wrongly set failed (provided that | ||
// short/pooled socket does not hold a ref of the main socket). | ||
// E.g. a in-parallel RPC sets the peer_id to be failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'an'
@@ -719,6 +719,37 @@ inline bool does_error_affect_main_socket(int error_code) { | |||
error_code == EINVAL/*returned by connect "0.0.0.1"*/; | |||
} | |||
|
|||
inline void maybe_block_server(int error_code, Controller* cntl, SharedLoadBalancer* lb, SocketId sock) { | |||
if (!does_error_affect_main_socket(error_code)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么要把does_error_affect_main_socket放到这个函数里?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为这个函数最终的动作是SetFailed main socket,所以要先检查一下错误号是否满足影响main socket 的条件
What problem does this PR solve?
Issue Number:
Problem Summary:
以下引起E112错误之后无法自行恢复的几种场景:
配置dns访问外网,dns只解析出一个ip,然后网络抖动过程中出现少量连接失败,返回错误码为ECONNREFUSED或ENETUNREACH或EHOSTUNREACH或EINVAL,满足does_error_affect_main_socket从而触发main socket SetFailed。
但是正常来说,SetFailed之后过100ms就开始健康检查,每3秒重试一次。所以按理当网格抖动恢复后,最长3秒后,就不应该再出现E112了。但是实际上仍有E112持续出现,如果重启进程,则不会再出现。
What is changed and the side effects?
Changed:
目前针对各种原因导致的E112错误,提供一套通用的解决方案:
一、防止实例封禁导致的E112:在对实例进行封禁的时候,加上是否为唯一可用实例判断
二、防止重试去重或LB调权导致的E112:在选择实例的时候,增加一次额外的尝试
注意:以上方案仅对连接池和短连接模式下生效,单连接的情况下,main socket和rpc实际通信的socket是同一个,通信过程可能有各种原因导致socket被SetFail,从而导致main socket失效,无法被LB选择。解决该问题的根本办法是将单连接的main socket和rpc通信的socket进行拆分(有点类似于只有一个连接的连接池),当rpc通信socket SetFailed之后,可以从main socket新建新的socket进行连接,但这个方案对brpc的改动较大。
Side effects:
Performance effects(性能影响):
Breaking backward compatibility(向后兼容性):
Check List: