-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a backoff to master_call()
and router_call()
when master discovery fails multiple times
#447
Comments
I don't understand the motivation of ticket and it roots. Is the problem the fact, that too many errors are written to the log? If it's so, then how will backoff help us to reduce their amount?
Should this be done inside
How will we distinguish master discovery failure from all other errors? You say, that this is not
Do we want to backoff on all returned errors now? Currently it's only special ones, defined in If we're trying to reduce load on masters, then we should probably implement backoff right inside If we're trying to reduce number of error messages, then we have to not log error, if it happens several times in a row everywhere it can happen, which I don't really like and don't understand, why this should be done. |
Yes.
It does. It calls
Services already have backoffs, they have nothing to do with this ticket, from what I saw.
There is no place to add it inside replicaset. The retrying is done on the layer above, in storage and router code.
Doesn't have to be. Currently the problem appears with the single call.
I honestly don't know. I didn't do any design for this ticket yet. I've only identified the issue and filed a ticket for it. I would appreciate, that if you want to do the ticket, then you would investigate it a bit more yourself, what are the options here.
Could be done, right. I don't remember why wasn't it done for |
I guess that a call to master assumes that the request can possibly write data. If it is unclear, whether the data is written (say, a network breaks after transmitting the request), we can't retry. It doesn't mean that there are no error types that could be retried to master. I don't know, whether it is the situation you're discussing, it is just general thought. |
Reproduced on https://github.com/ImeevMA/tarantool/tree/imeevma/gh-8862-vshard-auto-masters. Caused by the router, which constantly retries a call, because diff --git a/test/router-luatest/router_test.lua b/test/router-luatest/router_test.lua
index 28a3685..dbe2dc8 100644
--- a/test/router-luatest/router_test.lua
+++ b/test/router-luatest/router_test.lua
@@ -64,7 +64,7 @@ g.before_all(function()
end)
g.after_all(function()
- g.cluster:drop()
+ g.cluster:stop()
end)
g.test_basic = function(g)
@@ -645,3 +645,30 @@ g.test_router_service_info = function(g)
-- Restore everything back.
vtest.router_cfg(g.router, global_cfg)
end
+
+g.test_router_auto_master = function(g)
+ local new_cfg_template = table.deepcopy(cfg_template)
+ for _, rs in pairs(new_cfg_template.sharding) do
+ rs.master = 'auto'
+ for _, r in pairs(rs.replicas) do
+ r.master = nil
+ end
+ end
+
+ local new_cluster_cfg = vtest.config_new(new_cfg_template)
+ vtest.router_cfg(g.router, new_cluster_cfg)
+ vtest.cluster_cfg(g, new_cluster_cfg)
+
+ -- Change master of the first replicaset.
+ g.replica_1_a:eval('box.cfg{read_only = true}')
+ g.replica_1_b:eval('box.cfg{read_only = false}')
+
+ local bid = vtest.storage_first_bucket(g.replica_1_b)
+ local res, err = g.router:exec(callrw_get_uuid, {bid})
+ t.assert_equals(err, nil, 'no error')
+ t.assert_equals(res, g.replica_1_b:instance_uuid(), 'b is master')
+
+ -- Restore everything
+ vtest.cluster_cfg(g, global_cfg)
+ vtest.router_cfg(g.router, global_cfg)
+end
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 4457b4c..316118b 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -1183,7 +1183,9 @@ local function master_search_service_f(router, service)
end
local timeout
local start_time = fiber_clock()
+ lfiber.testcancel()
local is_done, is_nop, err = master_search_step(router)
+ lfiber.testcancel()
if err then
log.error(service:set_status_error(
'Error during master search: %s', lerror.make(err)))
It takes 5 seconds for a new master no notice its transition. This is probably is caused by the fact, that router is spamming with calls without giving a fiber to wakeup and set |
Solution: rework backoff of
@Gerold103, are you ok with such solution? |
Thanks for the investigation, sounds all good. Yes, 3.2 is preferable for now (until we find how to make |
If master consistently reports garbage (for example, it says it is master, but refuses all write-mode requests), then the code shouldn't retry the calls right after each other. It might make sense to add a backoff timeout which grows with each weird error like that until some maximal value.
The alternative would be to remember the previous error and not report it when it happens again in a row. But a backoff timeout is probably better because it would reduce pressure on the faulty node.
The text was updated successfully, but these errors were encountered: