Health / liveness check for Worker #994

michaelbromley · 2021-07-15T14:57:19Z

Is your feature request related to a problem? Please describe.
Since shifting to the standalone NestJS architecture for the worker in v1.0, we've lost the ability to use the HealthCheckRegistryService to monitor the health of worker processes.

There is also no easy way to do a liveness check for things like kubernetes deployments of the worker process.

Since the worker no longer has a networking layer, we cannot just add an HTTP endpoint, which is the usual way of handling it.

Describe the solution you'd like
Some way to support both the internal HealthCheckRegistryService checks (so it appears in the Admin UI) and also support for a generic liveness check.

Describe alternatives you've considered
Some ideas:

Add a http layer just for the liveness check. This might imply using a Nest Microservice rather than a Standalone app. What are the perf / other implications of this?
Investigate a non-http way of doing a liveness check. Honestly no idea what this might look like, I'm asking on the Nest discord to see if anyone has any idea about this.

The text was updated successfully, but these errors were encountered:

coderjib · 2021-07-19T13:32:14Z

Just to add, as a use case for which we'd benefit of the worker listening to HTTP requests, the need for it to receive a "push" request to perform jobs. This is needed, for example, when running the worker in a serverless environment, where it is not guaranteed to run continuously but must be "awaken" with an HTTP request.
Ref. discussion #968

tpimh · 2021-07-19T19:30:10Z

One non-http way to implement this would be to create a simple job on health status request, not only it would show if the worker is alive or dead, it would also show how busy it is at the moment based on how long it took to complete the job.

michaelbromley · 2021-07-22T08:18:27Z

Some suggestions from the NestJS discord:

tak1n — Yesterday at 10:30 AM

Guess it should be a full-blown Nest application which is able to serve HTTP traffic.
Have not that much experience with Nest Microservices or Standalone applications so not sure if you can just enable HTTP transport in a standalone/microservice app. Even if its possible I would first see what are the benefits (performance wise - boot time, runtime, memory usage) and whether its even worth doing that (could be premature optimization in the end no?)

BrunnerLivio — Yesterday at 6:47 PM

Yeah I think it makes sense to go with @tak1n solution. Or are there other ways to do health checks in whatever platform you’re using, aside from HTTP? (Either way I’d expose the endpoint via HTTP since that is anyway kinda the standard)
or maybe from the nest application context you could extract the HealthService and all the indicators you need & simply call these services from a e.g. express service?

Something like (await NestFactory.createApplicationContext(AppModule)).get(HealthService)
https://docs.nestjs.com/standalone-applications#getting-started
Though not 100% sure whether that is possible honestly (did not account for such a use-case)

michaelbromley · 2021-07-29T10:27:55Z

Ran a few performance tests comparing the difference between bootstrapping the worker as a standalone app (the way we currently do it) vs as a full-blown Nest server including http layer. I also created a new WorkerModule which is just the same as the AppModule but omits the ApiModule import, since we would not need that for the worker.

The code looks like this (irrelevant parts elided):

export async function bootstrapWorker(
    userConfig: Partial<VendureConfig>,
): Promise<{ app: INestApplicationContext; startJobQueue: () => Promise<void> }> {
    const mode: 'standalone' | 'server' = 'standalone';
    const module: 'app' | 'worker' = 'worker';
    console.time('bootstrap');

    // ...

    let workerApp: INestApplication | INestApplicationContext;
    let moduleClass: any;
    if (module === 'app') {
        const appModule = await import('./app.module');
        moduleClass = appModule.AppModule;
    } else {
        const workerModule = await import('./worker.module');
        moduleClass = workerModule.WorkerModule;
    }
    if (mode === 'standalone') {
        const appModule = await import('./app.module');
        workerApp = await NestFactory.createApplicationContext(moduleClass, {
            logger: new Logger(),
        });
    } else {
        const workerModule = await import('./worker.module');
        workerApp = await NestFactory.create(moduleClass, {
            logger: new Logger(),
        });
        await (workerApp as any).listen(config.apiOptions.port + 1, config.apiOptions.hostname);
    }
    // ...
    console.log(
        `Mode: ${mode}, module: ${moduleClass.name} -  The script uses approximately ${
            Math.round(used * 100) / 100
        } MB`,
    );
    console.timeEnd('bootstrap');
    return { app: workerApp, startJobQueue };
}

Here are the results (3 runs of each):

standalone with AppModule (status quo)

The script uses approximately 267.68 MB
bootstrap: 2.174s
The script uses approximately 274.61 MB
bootstrap: 1.882s
The script uses approximately 267.47 MB
bootstrap: 1.964s

server with AppModule

The script uses approximately 292.27 MB
bootstrap: 2.466s
The script uses approximately 290.86 MB
bootstrap: 2.474s
The script uses approximately 290.69 MB
bootstrap: 2.486s

standalone with WorkerModule

The script uses approximately 225.57 MB
bootstrap: 552.972ms
The script uses approximately 224.37 MB
bootstrap: 578.833ms
The script uses approximately 224.32 MB
bootstrap: 561.531ms

server with WorkerModule

The script uses approximately 224.01 MB
bootstrap: 418.242ms
The script uses approximately 224.44 MB
bootstrap: 410.18ms
The script uses approximately 224.87 MB
bootstrap: 431.568ms

Conclusion

My intuition that bootstrapping a full server is much heavier than a standalone app is not correct. The main overhead is the ApiModule which, when omitted by using the WorkerModule, has a drastic impact on memory usage and especially startup time.

So it looks perfectly viable to bootstrap the worker as a full NestJS server app, which will give us an HTTP layer we can use for the health check and at the same time reduce startup time by a factor of ~4-5x, (2s down to 400ms) which is also very helpful for the serverless use-case!

API proposal

When bootstrapping the worker as a Nest server, we need to provide a post & hostname. I propose adding an optional 2nd argument to the bootstrapWorker() function which allows these to be provided. If provided, a server is created. If not provided, a standalone app is created. Both server and standalone app return an object implementing INestApplicationContext, so nothing needs to change regarding the return type of the function.

Health Checks

Still to be solved is how to enable a health check indicator for the worker in the Admin UI using the existing HealthCheckRegistryService. Problems:

How do we tell the server at which address to ping the worker?
There can be multiple workers. They can come on and offline during the lifecycle of the server. How do we communicate this to the server so that it can add/remove checks for each worker?

Perhaps we can have an endpoint on the server than can be called to register a worker? Need to think more about this one.

Relates to #994

coderjib · 2021-08-25T12:47:00Z

Thanks for this improvement, just wanted to understand if this new feature is usable also to let the worker listen to incoming (REST) push requests regarding the tasks to perform, as for the use case I mentioned in my comment above

michaelbromley · 2021-08-25T12:54:43Z

@artcoded-dev This change will allow the worker to listen for HTTP requests like this:

bootstrapWorker(config)
  .then(worker => worker.startJobQueue())
  .then(worker => worker.startHealthCheckServer({ port: 3020 }))
  .catch(err => {
    console.log(err);
  });

GET http://<host>:3020

{ status: 'ok' }

If your use-case requires some other kind of http endpoint, please provide more details.

michaelbromley added @vendure/core design 📐 This issue deals with high-level design of a feature labels Jul 15, 2021

michaelbromley mentioned this issue Jul 15, 2021

worker fails to start: relation "zone" does not exist vendure-ecommerce/vendure-docker-compose#6

Closed

michaelbromley added a commit that referenced this issue Jul 29, 2021

feat(core): Implement health check server for worker

fd374b3

Relates to #994

michaelbromley added a commit that referenced this issue Jul 29, 2021

feat(core): Implement internal health check for worker

812b2cb

Relates to #994

michaelbromley added a commit that referenced this issue Jul 29, 2021

docs: Add docs on worker health check

0d33601

Relates to #994

michaelbromley closed this as completed Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health / liveness check for Worker #994

Health / liveness check for Worker #994

michaelbromley commented Jul 15, 2021

coderjib commented Jul 19, 2021

tpimh commented Jul 19, 2021 •

edited

Loading

michaelbromley commented Jul 22, 2021

michaelbromley commented Jul 29, 2021

coderjib commented Aug 25, 2021

michaelbromley commented Aug 25, 2021

Health / liveness check for Worker #994

Health / liveness check for Worker #994

Comments

michaelbromley commented Jul 15, 2021

coderjib commented Jul 19, 2021

tpimh commented Jul 19, 2021 • edited Loading

michaelbromley commented Jul 22, 2021

michaelbromley commented Jul 29, 2021

standalone with AppModule (status quo)

server with AppModule

standalone with WorkerModule

server with WorkerModule

Conclusion

API proposal

Health Checks

coderjib commented Aug 25, 2021

michaelbromley commented Aug 25, 2021

tpimh commented Jul 19, 2021 •

edited

Loading