Enable observability of VRO services #2335

dianagriffin · 2023-12-12T05:12:53Z

Epic Description:
As members of the VRO Team, our overarching objective is to boost the observability of our production systems, aligning with the strategic priority set by the Benefits Portfolio. To accomplish this, we plan to conduct a thorough assessment of the current state of our platform's observability. This initiative is fueled by the proactive identification and resolution of potential issues, ultimately minimizing the risk of critical incidents going unnoticed. By obtaining a deeper understanding of our system's observability, we can make informed improvements that significantly contribute to the overall stability and performance of our platform.

In Scope:

Assess the current state of observability on the VRO platform, covering all services and applications, Identify gaps in observability and develop an MVP.

MVP

Develop a single health dashboard tailored for the VRO Team encompassing all services
Incorporate essential metrics such as CPU usage, memory utilization, network traffic, pod availability, and recent deployments.
Establish benchmarks for CPU (X), memory (X), and network traffic (X).
Incorporate custom metrics for our platform applications so we can be aware when there are service outages.
Establish benchmarks for monitoring partner applications
Implement monitoring and proactive alerting mechanisms
Configure alerts to notify the VRO Team promptly in case of suboptimal application performance.

Not In Scope:

Incident response plan with defined SLAs. This will be addressed through a separate initiative.

Hypothesis:
By implementing a proactive monitoring system capable of detecting potential stability issues preemptively, we aim to prevent issues before the arise. We hope to lesson any adverse impact to our partners applications and provide them insight into what we are monitoring and why it matters.

meganhicks · 2024-02-22T16:23:54Z

@agile-josiah please feel free to edit the hypothesis with one you think may be better as you wrap up the tech spec

dianagriffin added VRO-team epic A collection of user stories spanning multiple repositories labels Dec 12, 2023

meganhicks closed this as completed Feb 22, 2024

meganhicks reopened this Feb 22, 2024

This was referenced Sep 11, 2024

Sprint 2 (9.24.2024 - 10.7.2024) #3303

Closed

Sprint 1 (9.10.2024 - 9.23.2024) #3248

Closed

bianca-rivera mentioned this issue Sep 17, 2024

Silent Failures #3471

Open

This was referenced Sep 18, 2024

Investigate Per-Service Monitoring Progress #3473

Closed

Sprint 3 (10.8.2024 - 10.21.2024) #3335

Closed

This was referenced Oct 3, 2024

Sprint 4 (10.22.2024- 11.4.2024) #3511

Closed

Sprint 5 (11.5.2024-11.18.2024) #3512

Open

meganhicks mentioned this issue Oct 21, 2024

Sprint 6 Plan #3601

Open

5 tasks

meganhicks closed this as completed Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable observability of VRO services #2335

Enable observability of VRO services #2335

dianagriffin commented Dec 12, 2023 •

edited by meganhicks

Loading

meganhicks commented Feb 22, 2024

Enable observability of VRO services #2335

Enable observability of VRO services #2335

Comments

dianagriffin commented Dec 12, 2023 • edited by meganhicks Loading

meganhicks commented Feb 22, 2024

dianagriffin commented Dec 12, 2023 •

edited by meganhicks

Loading