Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable observability of VRO services #2335

Closed
7 tasks done
dianagriffin opened this issue Dec 12, 2023 · 1 comment
Closed
7 tasks done

Enable observability of VRO services #2335

dianagriffin opened this issue Dec 12, 2023 · 1 comment
Labels
epic A collection of user stories spanning multiple repositories VRO-team

Comments

@dianagriffin
Copy link
Contributor

dianagriffin commented Dec 12, 2023

Epic Description:
As members of the VRO Team, our overarching objective is to boost the observability of our production systems, aligning with the strategic priority set by the Benefits Portfolio. To accomplish this, we plan to conduct a thorough assessment of the current state of our platform's observability. This initiative is fueled by the proactive identification and resolution of potential issues, ultimately minimizing the risk of critical incidents going unnoticed. By obtaining a deeper understanding of our system's observability, we can make informed improvements that significantly contribute to the overall stability and performance of our platform.

In Scope:

Assess the current state of observability on the VRO platform, covering all services and applications, Identify gaps in observability and develop an MVP.

MVP

  • Develop a single health dashboard tailored for the VRO Team encompassing all services
  • Incorporate essential metrics such as CPU usage, memory utilization, network traffic, pod availability, and recent deployments.
  • Establish benchmarks for CPU (X), memory (X), and network traffic (X).
  • Incorporate custom metrics for our platform applications so we can be aware when there are service outages.
  • Establish benchmarks for monitoring partner applications
  • Implement monitoring and proactive alerting mechanisms
  • Configure alerts to notify the VRO Team promptly in case of suboptimal application performance.

Not In Scope:

Incident response plan with defined SLAs. This will be addressed through a separate initiative.

Hypothesis:
By implementing a proactive monitoring system capable of detecting potential stability issues preemptively, we aim to prevent issues before the arise. We hope to lesson any adverse impact to our partners applications and provide them insight into what we are monitoring and why it matters.

@dianagriffin dianagriffin added VRO-team epic A collection of user stories spanning multiple repositories labels Dec 12, 2023
@meganhicks meganhicks reopened this Feb 22, 2024
@meganhicks
Copy link

@agile-josiah please feel free to edit the hypothesis with one you think may be better as you wrap up the tech spec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic A collection of user stories spanning multiple repositories VRO-team
Projects
None yet
Development

No branches or pull requests

2 participants