-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust hash ring failure retry mechanism #14
Comments
I too thought about this. Implementing a background heartbeat check, however, may actually be too difficult and not worth the effort (I'm not sure that twmemproxy's design supports anything background at all; I may be wrong though). What could be done instead is something along these lines (pseudocode):
It will allow this given request not to fail and be dispatched to another host if the original server is still down after the retry_timeout has expired. The cost of such approach will be one client request delayed for |
+1 for this request, but being done in the background... |
@shapirus would you be interested to submit a patch for this? :) |
Probably, but I will need to understand the current code and algorithms. Where do the server failure handling routines live and where do they get called from? |
server failure handing routing is in nc_server.c:server_failure(). This is triggered whenever a server is closed - nc.sever.c:server_close() which gets invoked from conn->close() in core_close() The server struct in nc_server.h maintains two fields to keep track of failures:
The server_pool struct in nc_server.h keep track of how many failures to allow before letting dead/live servers to be ejected and/or put-back
Whenever a server is ejected or put back, we have to rebuild the hash ring. See server_pool_run() and server_pool_update(). We call server_pool_update() when:
One idea in terms of implementing what @yashh is asking for is to use the "normal traffic" to eject a server on a failure. But once a server is ejected we use a timer to do background "heartbeat traffic" on the bad server and add the server back only if these hearbeat checks succeed. Timers and timer expiry is implemented in twemproxy use event loop. See core_timeout() @shapirus let me know if you have questions or thoughts |
@manjuraj Hi, Could you review my code for this issue. I don't send patch for this yet. briefly speacking, |
@manjuraj, I add a function which check servers' status with 'get command' There are 2 steps. 1] check connection 2] check command But there is a limitation. it needs timeout setting I think it is better that implementing cron module like redis's serverCron and regularly check all servers' status. |
"What will your client do when a server is unavailable or provides an invalid response? In the dark days of memcached, the default was to always "failover", by trying the next server in the list. That way if a server crashes, its keys will get reassigned to other instances and everything moves on happily. However there're many ways to kill a machine. Sometimes they don't even like to stay dead. Given the scenario: Sysadmin Bob walks by Server B and knocks the ethernet cable out of its port. Server B's ethernet clip was broken by Bob's folly and later falls out of its port unattended. Another erroneous client feature would actually amend the server list when a server goes out of commission, which ends up remapping far more keys than it should. Modern life encourages the use of "Failure", when possible. That is, if the server you intend to fetch or store a cache entry to is unavailable, simply proceed as though it was a cache miss. You might still flap between old and new data if you have a Server B situation, but the effects are reduced." I copied this from https://code.google.com/p/memcached/wiki/NewConfiguringClient - Failure or Failover, |
So once a host failed and we have a server_retry_timeout of 30 secs nutcracker retries the failed host on production traffic. I think nutcracker needs to perform a background hearbeat request like fetch a simple key and assure that host is up.
The text was updated successfully, but these errors were encountered: