Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a simple healthcheck endpoint for the items service #739

Merged
merged 3 commits into from
Jan 25, 2024

Conversation

kenoir
Copy link
Contributor

@kenoir kenoir commented Jan 14, 2024

What does this change?

This change follows #736, and adds an HTTP healthcheck to the items API to ensure the scala service has started before it is registered healthy at the NLB and starts serving requests.

How to test?

  • Manually check this healthcheck behaves as expected locally running the items API edit: we've improved the testing locally situation but not resolved it, I feel ok to try this in stage for now
  • Deploy this change to stage to ensure the terraform applies the change as expected and tasks register as healthy
  • Perform a stage deployment while sending requests to understand if we have eliminated the deployment errors.

How can we measure success?

No downtime during deployments resulting in a better experience for visitors to the site, and fewer errors that we cannot effectively respond to in the alerts channel.

Currently we do see only items errors during deployment of the catalogue API, see from #wc-platform-alerts in slack during a deployment following updating the search :

Screenshot 2024-01-14 at 15 05 34 service healthcheck:

Have we considered potential risks?

Changing the health-checks changes the failure modes for the API, although we have tested this principle successfully in #736 so the risk is reduced.

@kenoir kenoir requested a review from a team January 14, 2024 15:06
@kenoir kenoir self-assigned this Jan 14, 2024
@kenoir kenoir linked an issue Jan 14, 2024 that may be closed by this pull request
13 tasks
@kenoir kenoir force-pushed the rk/add-healthcheck-endpoint-items branch from 6bd3205 to 9a24603 Compare January 19, 2024 15:04
@kenoir kenoir force-pushed the rk/add-sbt-dependency-graph branch from 1cf0fc1 to dace036 Compare January 19, 2024 15:09
@kenoir kenoir force-pushed the rk/add-healthcheck-endpoint-items branch from c019866 to 44d7813 Compare January 23, 2024 11:45
Base automatically changed from rk/add-sbt-dependency-graph to main January 24, 2024 10:09
This change follows #736, and adds an HTTP healthcheck to the items API to ensure the scala service has started before it is registered healthy at the NLB and starts serving requests.
@kenoir kenoir force-pushed the rk/add-healthcheck-endpoint-items branch from f864ffe to 102de84 Compare January 25, 2024 13:43
@kenoir kenoir force-pushed the rk/add-healthcheck-endpoint-items branch from 20fb200 to 3f7c0cb Compare January 25, 2024 13:46
@kenoir kenoir marked this pull request as ready for review January 25, 2024 13:46
@agnesgaroux
Copy link
Contributor

Not sure how this works. Does it check /management/healthcheck before hitting /works every time? Then if the healthcheck fails (for whatever reason, could be something other than the instance being currently in deployment) the LB what? tries another instance?

@kenoir
Copy link
Contributor Author

kenoir commented Jan 25, 2024

Not sure how this works. Does it check /management/healthcheck before hitting /works every time? Then if the healthcheck fails (for whatever reason, could be something other than the instance being currently in deployment) the LB what? tries another instance?

This change only provides a new endpoint at /management/healthcheck that serves the following json and doesn't do anything else:

{
  "message": "ok"
}

The load balancer (NLB) uses this endpoint to determine if the instance is healthy, and if it is allowed to serve requests. At present the NLB uses a TCP healthcheck that only relies on the nginx sidecar that proxies requests to the app to be available. Nginx comes up very quick while the slowpoke scala app is still yawning and blinking itself awake.

This change makes sure the scala app is up by forcing it to serve requests before the load balancer determines it to be healthy. It doesn't do any more sophisticated checks as to whether it can actually serve works, there are some musings about that in slack.

@kenoir kenoir merged commit b600b23 into main Jan 25, 2024
1 check passed
@kenoir kenoir deleted the rk/add-healthcheck-endpoint-items branch January 25, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Meaningful ECS service health checks
3 participants