[DocDb] client: reduce excessive retries on NotFound errors #5932

jaki · 2020-10-03T00:23:50Z

Jira Link: DB-10827
There are some cases where client RetryFunc keeps retrying RPCs that keep returning NotFound. For example, for YBClient::Data::IsCreateTableInProgress, it may get NotFound from master because

the table hasn't been created yet
the table failed to create and was deleted

In the first case, it is good to keep retrying while receiving NotFound because we expect the table to eventually get created. In the second case, it is useless to keep retrying because the table is dead. This second case will cost deadline amount of time, default 120 seconds.

This is particularly a pain for index backfill YBClient::Data::WaitUntilIndexPermissionsAtLeast. If we wait on some permission like READ_WRITE_AND_DELETE, we could have

the index isn't even created yet: NotFound that should be retried
the index is at a permission before that: retry
the index is at that permission: done
the index is past that permission: done
the index is deleted: NotFound that shouldn't be retried

Distinguishing between the two NotFound cases is hard, especially when the permission changes can just finish in a snap.

I think this can be improved:

Keep track of whether we ever saw the table live so that if we get a NotFound, we can safely say it is deleted rather than waiting to be created (there will still be cases where things happen so quickly that we don't see the table and it gets deleted)
Check the status for select fatal errors like we do in src/yb/master/async_rpc_tasks.cc, and don't retry when we see one of those (this is a generic enhancement)
(Stretch) Make master aware of waiters (e.g. client calling IsCreateTableInProgress) and don't delete the table until the waiters are done, instead marking the table as deleted. That way, we can distinguish NotFound from precreation to some other status/message for deletion. (This should solve the problem, but it opens up other handling concerns.)

To avoid making the issue too large, I say it should be good to close when 2 of 3 above items are done. It can just be for one client function. Things that aren't covered can get separate smaller issues created for them.

The text was updated successfully, but these errors were encountered:

tedyu · 2020-10-03T01:52:21Z

bq. don't delete the table until the waiters are done

Perhaps this should be bounded by certain time limit. Otherwise the table wouldn't be deleted for extended period of time.

jaki added the kind/enhancement This is an enhancement of an existing feature label Oct 3, 2020

jaki assigned bmatican Oct 3, 2020

nyndyny mentioned this issue Oct 28, 2023

[Snyk] Security upgrade axios from 0.21.1 to 1.6.0 nyndyny/yugabyte-db#164

Open

ryan-ally mentioned this issue Oct 28, 2023

[Snyk] Security upgrade axios from 0.21.1 to 1.6.0 ryan-ally/yugabyte-db#196

Open

rthallamko3 added the area/docdb YugabyteDB core features label Apr 11, 2024

yugabyte-ci added the priority/medium Medium priority issue label Apr 11, 2024

rthallamko3 assigned lingamsandeep and unassigned bmatican Apr 11, 2024

rthallamko3 changed the title ~~client: reduce excessive retries on NotFound errors~~ [DocDb] client: reduce excessive retries on NotFound errors Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDb] client: reduce excessive retries on NotFound errors #5932

[DocDb] client: reduce excessive retries on NotFound errors #5932

jaki commented Oct 3, 2020 •

edited by jira bot

Loading

tedyu commented Oct 3, 2020

[DocDb] client: reduce excessive retries on NotFound errors #5932

[DocDb] client: reduce excessive retries on NotFound errors #5932

Comments

jaki commented Oct 3, 2020 • edited by jira bot Loading

tedyu commented Oct 3, 2020

jaki commented Oct 3, 2020 •

edited by jira bot

Loading