Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to GCS sink healthcheck #4

Open
wants to merge 2 commits into
base: data_infra_vector_stable
Choose a base branch
from

Conversation

alexander-jiang
Copy link

Request to update the data_infra_vector_stable branch, which is used to build the discord_data_vector_base image.

Summary

We've noticed that GCS sink healthcheck in Vector will sometimes fail temporarily (due to timeout) and then will self-recover. The Data Infra team hasn't found a clear reason as to why the healthchecks time-out (e.g. I haven't been able to replicate the healthcheck timeouts in staging), and we want to reduce the impact of ephemeral/noisy healthcheck failures or timeouts on the Vector deployments which are part of our critical event ingestion pipeline. At the same time, we shouldn't completely ignore healthcheck failures.

This PR makes the following changes:

  • adds a retry-loop within the GCS sink healthcheck. Instead of only making one attempt, the healthcheck will now make up to 3 attempts to send the HTTP request, with a 5-second delay between each attempt. (Note the overall healthcheck timeout duration is still 10 seconds, so we expect this to make 2-3 attempts per Vector container restart.)
  • If any attempt fails, the HTTP response object is printed in logs for debugging purposes. This can help us identify the cause of healthcheck failures (we may be able to identify a root cause and implement a better solution)
  • The overall GCS sink healthcheck succeeds as soon as any individual attempt succeeds, avoiding unnecessary retries.

Documentation update:
The PR also updates the patches/README.md file: our Vector build pipeline no longer uses the *.patch files to apply patches onto a commit from the vector repository, but instead builds from a Discord-owned fork of the Vector repository.

@@ -1,5 +1,5 @@
diff --git a/src/gcp.rs b/src/gcp.rs
index bfc486f92..148fa9dec 100644
index bfc486f92..baa8e143d 100644

Check warning

Code scanning / check-spelling

Candidate Pattern Warning

Line matches candidate pattern "index (?:[0-9a-z]{7,40},|)[0-9a-z]{7,40}..[0-9a-z]{7,40}" (candidate-pattern)
@@ -1,5 +1,5 @@
diff --git a/src/gcp.rs b/src/gcp.rs
index bfc486f92..148fa9dec 100644
index bfc486f92..baa8e143d 100644

Check failure

Code scanning / check-spelling

Unrecognized Spelling Error

bfc is not a recognized word. (unrecognized-spelling)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant