Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Executors would crash when DmlHandler.create times out #547

Merged
merged 8 commits into from
Feb 5, 2024

Conversation

darunrs
Copy link
Collaborator

@darunrs darunrs commented Feb 2, 2024

In each invocation of runFunction, on a block, DML Handler is created. It's responsible for making calls to the database. Part of setting it up involves fetching the credentials for the user's DB from Hasura, and creating a PG Client. This takes time so the process was run through an unawaited async request. While other setup and some of user code is ran, the setting up of DML Handler would be completed. The first context.db call would await its completion and subsequent calls would have it ready.

However, it was observed that when the call to Hasura for the DB credentials times out, the error, instead of propagating into a try catch, would instead be considered by the Javascript runtime as an unhandled Promise Exception, and would terminate the worker thread, stopping the indexer.

In order to fix this problem, we need to transition away from keeping DmlHandler.create as an unresolved Promise across multiple contexts. The approach I've decided to take is to defer the creation of the Hasura call promise until the first call of context.db. This adds more latency to the first context.db call as it now must wait for the entire process to complete. However, this also does not penalize Indexers that don't use context.db as their code does not need to connect to Hasura unless needed.

Very soon, we will in fact overhaul this existing logic by migrating the Hasura credentials call away from runFunctions. This eliminates the underlying problem of unresolved promises as none remain afterward. So, the focus here is to address the bug, which is a critical problem, without too much change, as the workflow will be refactored again soon anyway.

I also fix a small bug where context.db calls were getting logged under the wrong indexer logs table function name.

@darunrs darunrs force-pushed the crashing-executors branch from b34f142 to 11b80e8 Compare February 3, 2024 00:19
@@ -298,7 +287,8 @@ export default class Indexer {
}, {});
return result;
} catch (error) {
console.warn('Caught error when generating context.db methods. Building no functions. You can still use other context object methods.\n', error);
const errorContent = error as Error;
console.warn('Caught error when generating context.db methods. Building no functions. You can still use other context object methods.', errorContent.message);
Copy link
Collaborator Author

@darunrs darunrs Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schemas that don't adhere to the limitations of context.db emit a very verbose error message. That message is helpful in the frontend when you can change the schema, but not in the backend. Logging just the message prevents GCP logs from looking cluttered and also reduces log congestion.


// Wait for initialiiation of DmlHandler
const dmlHandler = await dmlHandlerLazyLoader;
await this.writeLog(functionName, blockHeight,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced incorrect functionName with the correct one. Context.db logs will now appear in the user's indexer logs.

return await this.getPgClientPromise;
}

async getPgClient (): Promise<PgClientModule> {
Copy link
Collaborator Author

@darunrs darunrs Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error in this function, it is now thrown into the user's context. Their code controls if it is re-thrown, swallowed, logged, etc.

I think we can accept this for the time being. This specific scenario will be better handled after Hasura calls are moved out of runFunction. However, other failures such as PG connections being unavailable, will still be thrown into user context. It's probably worth updating prod indexers to not blindly eat errors.

@darunrs darunrs changed the title fix: Executors would Crash when DmlHandler.create times out fix: Executors would crash when DmlHandler.create times out Feb 5, 2024
@darunrs darunrs marked this pull request as ready for review February 5, 2024 18:17
@darunrs darunrs requested a review from a team as a code owner February 5, 2024 18:17
@darunrs darunrs merged commit 2c28dc3 into main Feb 5, 2024
3 checks passed
@darunrs darunrs deleted the crashing-executors branch February 5, 2024 19:35
morgsmccauley pushed a commit that referenced this pull request Feb 6, 2024
In each invocation of runFunction, on a block, DML Handler is created.
It's responsible for making calls to the database. Part of setting it up
involves fetching the credentials for the user's DB from Hasura, and
creating a PG Client. This takes time so the process was run through an
unawaited async request. While other setup and some of user code is ran,
the setting up of DML Handler would be completed. The first context.db
call would await its completion and subsequent calls would have it
ready.

However, it was observed that when the call to Hasura for the DB
credentials times out, the error, instead of propagating into a try
catch, would instead be considered by the Javascript runtime as an
unhandled Promise Exception, and would terminate the worker thread,
stopping the indexer.

In order to fix this problem, we need to transition away from keeping
DmlHandler.create as an unresolved Promise across multiple contexts. The
approach I've decided to take is to defer the creation of the Hasura
call promise until the first call of context.db. This adds more latency
to the first context.db call as it now must wait for the entire process
to complete. However, this also does not penalize Indexers that don't
use context.db as their code does not need to connect to Hasura unless
needed.

Very soon, we will in fact overhaul this existing logic by migrating the
Hasura credentials call away from runFunctions. This eliminates the
underlying problem of unresolved promises as none remain afterward. So,
the focus here is to address the bug, which is a critical problem,
without too much change, as the workflow will be refactored again soon
anyway.


I also fix a small bug where context.db calls were getting logged under
the wrong indexer logs table function name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexers occasionally crash due to timeout on Hasura credentials fetch when creating DmlHandler
2 participants