[DocDb] client: reduce excessive retries on NotFound errors #5932
Labels
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-10827
There are some cases where client
RetryFunc
keeps retrying RPCs that keep returningNotFound
. For example, forYBClient::Data::IsCreateTableInProgress
, it may getNotFound
from master becauseIn the first case, it is good to keep retrying while receiving
NotFound
because we expect the table to eventually get created. In the second case, it is useless to keep retrying because the table is dead. This second case will costdeadline
amount of time, default 120 seconds.This is particularly a pain for index backfill
YBClient::Data::WaitUntilIndexPermissionsAtLeast
. If we wait on some permission likeREAD_WRITE_AND_DELETE
, we could haveNotFound
that should be retriedNotFound
that shouldn't be retriedDistinguishing between the two
NotFound
cases is hard, especially when the permission changes can just finish in a snap.I think this can be improved:
NotFound
, we can safely say it is deleted rather than waiting to be created (there will still be cases where things happen so quickly that we don't see the table and it gets deleted)src/yb/master/async_rpc_tasks.cc
, and don't retry when we see one of those (this is a generic enhancement)IsCreateTableInProgress
) and don't delete the table until the waiters are done, instead marking the table as deleted. That way, we can distinguishNotFound
from precreation to some other status/message for deletion. (This should solve the problem, but it opens up other handling concerns.)To avoid making the issue too large, I say it should be good to close when 2 of 3 above items are done. It can just be for one client function. Things that aren't covered can get separate smaller issues created for them.
The text was updated successfully, but these errors were encountered: