Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent_monitor线程死锁问题 #146

Open
weipatty opened this issue Nov 16, 2017 · 5 comments
Open

agent_monitor线程死锁问题 #146

weipatty opened this issue Nov 16, 2017 · 5 comments

Comments

@weipatty
Copy link

weipatty commented Nov 16, 2017

大概情况为:
三台机子,一台由于末知原因一直追同步不上。然后剩下两台选不出主(主信息过期,一直选不出)。集群不工作。

调查发现剩下的两台机子的agent_monitor线程都hang住了
一台估计hang在

int Committer :: NewValueGetIDNoRetry(const std::string & sValue, uint64_t & llInstanceID, SMCtx * poSMCtx)
{
    LogStatus();

    int iLockUseTimeMs = 0;
    bool bHasLock = m_oWaitLock.Lock(m_iTimeoutMs, iLockUseTimeMs);
    if (!bHasLock)
    {
        if (iLockUseTimeMs > 0)
        {
            BP->GetCommiterBP()->NewValueGetLockTimeout();
            PLGErr("Try get lock, but timeout, lockusetime %dms", iLockUseTimeMs);
            return PaxosTryCommitRet_Timeout; 
        }
        else
        {
            BP->GetCommiterBP()->NewValueGetLockReject();
            PLGErr("Try get lock, but too many thread waiting, reject");
            return PaxosTryCommitRet_TooManyThreadWaiting_Reject;
        }
    }

    int iLeftTimeoutMs = -1;
    if (m_iTimeoutMs > 0)
    {
        iLeftTimeoutMs = m_iTimeoutMs > iLockUseTimeMs ? m_iTimeoutMs - iLockUseTimeMs : 0;
        if (iLeftTimeoutMs < 200)
        {
            PLGErr("Get lock ok, but lockusetime %dms too long, lefttimeout %dms", iLockUseTimeMs, iLeftTimeoutMs);

            BP->GetCommiterBP()->NewValueGetLockTimeout();

            m_oWaitLock.UnLock();
            return PaxosTryCommitRet_Timeout;
        }
    }

    PLGImp("GetLock ok, use time %dms", iLockUseTimeMs);
    
    BP->GetCommiterBP()->NewValueGetLockOK(iLockUseTimeMs);

    //pack smid to value
    int iSMID = poSMCtx != nullptr ? poSMCtx->m_iSMID : 0;
    
    string sPackSMIDValue = sValue;
    m_poSMFac->PackPaxosValue(sPackSMIDValue, iSMID);

    m_poCommitCtx->NewCommit(&sPackSMIDValue, poSMCtx, iLeftTimeoutMs);
    m_poIOLoop->AddNotify();

    int ret = m_poCommitCtx->GetResult(llInstanceID);

    m_oWaitLock.UnLock();
    return ret;
}

估计在bool bHasLock = m_oWaitLock.Lock(m_iTimeoutMs, iLockUseTimeMs);这一句
是取不到锁

hang住前日志如下:

I1115 04:23:19.002990 92537 phx_glog.cpp:86]  Process check running ret 0 
I1115 04:23:20.000658 92537 phx_glog.cpp:86]  DoQuery mysql_query show variables like 'super_read_only'; done 0,  
E1115 04:23:22.002826 92537 phx_glog.cpp:78]  DoQuery[mysql] mysql_query fail 1, Lost connection to MySQL server during query, command show status like 'Slave_running'; 
I1115 04:23:22.003125 92537 phx_glog.cpp:86]  DoQuery mysql_query show status like 'Slave_running'; done 5002, Lost connection to MySQL server during query 
I1115 04:23:22.003175 92537 phx_glog.cpp:86]  CheckRunning check super read only OFF is master 1 ret 0 
I1115 04:23:22.003180 92537 phx_glog.cpp:86]  Process check running ret 0 
I1115 04:23:22.003768 92537 phx_glog.cpp:86]  DoQuery mysql_query show global variables like 'gtid_executed'; done 0,  
I1115 04:23:22.003829 92537 phx_glog.cpp:86]  IsGTIDCompleted get gtid 61a0f7cd-c47b-11e7-996a-246e960fda64:1173213 max gtid 61a0f7cd-c47b-11e7-996a-246e960fda64:1173210, master start 

I1115 04:23:22.003849 92537 phx_glog.cpp:86]  SendMasterInfo ip xxx.xxx.xxx.xxx svr id -611884052 version 92025 expire time 0 admin username size 2 replica username size 2 
E1115 04:23:22.003861 92537 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait ms 0 reject rate 0

hang住前有个 DoQuery[mysql] mysql_query fail 1, Lost connection to MySQL server during query, command show status like 'Slave_running';

在15号04:23,然后就没日志了,其中xxx.xxx.xxx.xxx是自己,然后15号白天重启过一次,mysql一直卡在so的2017-11-16 02:35:52 24033 [Note] get last send gtid failed, svrid [61a0f7cd-c47b-11e7-996a-246e960fda64] ret -1启不来, agent_monitor一直连不上mysql报错。然后是在1116 02:35:54左右可以了,选出了主。因为重启了hang后日志就不发了?

另一台hang在
Committer :: NewValueGetIDNoRetry的GetResult中

int CommitCtx :: GetResult(uint64_t & llSuccInstanceID)
{
    while (!m_bIsCommitEnd)
    {
        m_oSerialLock.WaitTime(1000);
    }

    if (m_iCommitRet == 0)
    {
        llSuccInstanceID = m_llInstanceID;
        PLGImp("commit success, instanceid %lu", llSuccInstanceID);
    }
    else
    {
        PLGErr("commit fail, ret %d", m_iCommitRet);
    }
    
    m_oSerialLock.UnLock();

    return m_iCommitRet;
}

是在等提交吧,

hang住前日志如下:

I1115 04:23:48.003327 28923 phx_glog.cpp:86]  SendMasterInfo ip xxx.xxx.xxx.xxx svr id -1222346155 v
```ersion 92025 expire time 0 admin username size 2 replica username size 2 
E1115 04:23:48.003347 28923 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait `ms` 0 reject rate 0
I1115 04:23:48.003535 28923 phx_glog.cpp:86]  Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms 
W1115 04:23:48.003551 28923 phx_glog.cpp:74]  Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135 

然后是过了快一天,
hang住后日志如下:

E1116 02:35:52.248410 28923 phx_glog.cpp:78]  ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14 
E1116 02:35:52.250871 28923 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait ms 0 reject rate 0
I1116 02:35:52.256392 28923 phx_glog.cpp:86]  Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms 
W1116 02:35:52.272595 28923 phx_glog.cpp:74]  Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135 
E1116 02:35:52.282235 28923 phx_glog.cpp:78]  ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14 
I1116 02:35:52.282516 28923 phx_glog.cpp:86]  Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms 
W1116 02:35:52.282620 28923 phx_glog.cpp:74]  Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135 
E1116 02:35:52.293189 28923 phx_glog.cpp:78]  ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14 
E1116 02:35:52.293473 28923 phx_glog.cpp:78]  proposal fail, ret -2500 
E1116 02:35:52.293752 28923 phx_glog.cpp:78]  SendMasterInfo set master fail, ret -2500 
I1116 02:35:52.294033 28923 phx_glog.cpp:86]  GetCurrentInstanceInterval 0x2c2f4f0 get newest instance id 35503177 oldest instanceid 26270944 

卡了一天最终返回 14 PaxosTryCommitRet_Conflict 是一个冲突了

请问是什么原因会导致这种死锁?

@weipatty
Copy link
Author

image

新跟踪发现: 8001的paxos协议端口,应该不可能有这么大的包吧?要分片了,有一台机子没认这个包,上层没收到,导致选主一直选不通过。

@wodesuck
Copy link
Collaborator

请问这个问题有没有复现的方法呢?

@weipatty
Copy link
Author

暂时没有重现方法,但我们用的是比较旧的版本,有没有可能是这个bug的原因?
https://github.com/Tencent/phxpaxos/releases/tag/v1.1.1

修复了一个重大BUG,某些编译环境下会导致单调时间获取不正确,从而使得设置了CLOCK_MONOTONIC的condition工作不正常,表现为运行卡死,Master无法选举等。如出现类似现象的,建议立即更新此版本。

@wodesuck
Copy link
Collaborator

我也不确定是不是版本问题,复现不了的话有点难查。。你现在用的版本是多少?

@weipatty
Copy link
Author

具体哪个版本不清楚,是在phxpaxos更新这个bug之前的一个版本,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants