Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDb] client: reduce excessive retries on NotFound errors #5932

Open
jaki opened this issue Oct 3, 2020 · 1 comment
Open

[DocDb] client: reduce excessive retries on NotFound errors #5932

jaki opened this issue Oct 3, 2020 · 1 comment
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@jaki
Copy link
Contributor

jaki commented Oct 3, 2020

Jira Link: DB-10827
There are some cases where client RetryFunc keeps retrying RPCs that keep returning NotFound. For example, for YBClient::Data::IsCreateTableInProgress, it may get NotFound from master because

  • the table hasn't been created yet
  • the table failed to create and was deleted

In the first case, it is good to keep retrying while receiving NotFound because we expect the table to eventually get created. In the second case, it is useless to keep retrying because the table is dead. This second case will cost deadline amount of time, default 120 seconds.

This is particularly a pain for index backfill YBClient::Data::WaitUntilIndexPermissionsAtLeast. If we wait on some permission like READ_WRITE_AND_DELETE, we could have

  • the index isn't even created yet: NotFound that should be retried
  • the index is at a permission before that: retry
  • the index is at that permission: done
  • the index is past that permission: done
  • the index is deleted: NotFound that shouldn't be retried

Distinguishing between the two NotFound cases is hard, especially when the permission changes can just finish in a snap.

I think this can be improved:

  • Keep track of whether we ever saw the table live so that if we get a NotFound, we can safely say it is deleted rather than waiting to be created (there will still be cases where things happen so quickly that we don't see the table and it gets deleted)
  • Check the status for select fatal errors like we do in src/yb/master/async_rpc_tasks.cc, and don't retry when we see one of those (this is a generic enhancement)
  • (Stretch) Make master aware of waiters (e.g. client calling IsCreateTableInProgress) and don't delete the table until the waiters are done, instead marking the table as deleted. That way, we can distinguish NotFound from precreation to some other status/message for deletion. (This should solve the problem, but it opens up other handling concerns.)

To avoid making the issue too large, I say it should be good to close when 2 of 3 above items are done. It can just be for one client function. Things that aren't covered can get separate smaller issues created for them.

@jaki jaki added the kind/enhancement This is an enhancement of an existing feature label Oct 3, 2020
@tedyu
Copy link
Contributor

tedyu commented Oct 3, 2020

bq. don't delete the table until the waiters are done

Perhaps this should be bounded by certain time limit. Otherwise the table wouldn't be deleted for extended period of time.

@rthallamko3 rthallamko3 added the area/docdb YugabyteDB core features label Apr 11, 2024
@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Apr 11, 2024
@rthallamko3 rthallamko3 changed the title client: reduce excessive retries on NotFound errors [DocDb] client: reduce excessive retries on NotFound errors Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

6 participants