Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexers occasionally crash due to timeout on Hasura credentials fetch when creating DmlHandler #532

Closed
Tracked by #396
darunrs opened this issue Jan 29, 2024 · 0 comments · Fixed by #547
Closed
Tracked by #396
Assignees
Labels
bug Something isn't working

Comments

@darunrs
Copy link
Collaborator

darunrs commented Jan 29, 2024

DmlHandler is lazy loaded before running user's indexer code. The problem is part of the class creation involves polling Hasura for credentials for the user's DB. These are used to then initialize a postgres client. The hasura call can fail due to timeout. If this happens, the error is stored inside the promise until the DML Handler is awaited. This happens when context.db is next called and this call is not within a try catch. The result is that the error is not caught. Unhandled exceptions in NodeJS actually kill the thread. Since the DmlHandler.create was written in a try catch but the code execution context leaves that code to continue, the promise is considered unhandled and uncaught, it seems. As a result, the worker thread is killed immediately.

We can fix the issue in two phases:

Phase 1 (Immediate patch)
DmlHandler's create call needs to be reworked so that it always happens AFTER provisioning (Which requires it to stay in runFunctions), but leverages the natural retry mechanism of runFunctions by catching and throwing the error. Ideally, we also want to ensure if DmlHandler functions were not called, that a failed DmlHandler connect doesn't cause a retry when there was no need.

Phase 2 (Integration into Control Plane)
Migrate the Hasura call outside of runFunction call path as part of #426. We can do this because when the executor is started, we can expect Coordinator to let Runner know of provisioning status, and we don't have to do guess work on if Hasura even has user DB credentials or not. DmlHandler likely can also be moved outside of runFunctions as well, as a result.

In addition, moving the Hasura calls outside of runFunctions will improve latency of calls, speeding up backfill performance. Actual impact of migrating the calls out will be known once Instrumentation is completed and released.

@darunrs darunrs added the bug Something isn't working label Jan 29, 2024
@darunrs darunrs linked a pull request Feb 5, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant