Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update our Incident Response process and Slack workflow #3478

Open
7 of 8 tasks
dfitchett opened this issue Sep 18, 2024 · 4 comments
Open
7 of 8 tasks

Update our Incident Response process and Slack workflow #3478

dfitchett opened this issue Sep 18, 2024 · 4 comments

Comments

@dfitchett
Copy link
Contributor

dfitchett commented Sep 18, 2024

Description

As the on-call engineer for the VRO platform team, I need as many tasks automated as possible to reduce the risk of missing manual follow-ups. Specifically, I want to update our existing Incident Response process and Slack workflow to automatically incorporate the new step of reaching out to partner teams to assess how our downed services impacted their processes and, ultimately, the veterans we serve. This will ensure that we are documenting the effects of failures on the VRO platform and reduce the potential for silent failures.

AC

  • add a new task in the Incident Report Slack workflow to gather impact metrics from partner teams when a high severity incident (as part of or new message sent to the #benefits-vro-on-call channel)
    • specify in the instructions that the step should be done for SEV 1 and SEV 2 incidents
    • specify a timebox or SLA for completing the task
  • update Incident Response wiki page with the new step clearly stated
    • add any other updates based on latest changes regarding silent failures and deployment workflow (optional/as time allows)
  • create an impact metric table on the incident response page where these metrics can be populated and shared easily
  • share with VRO team
  • inform partners of the change so they know what to expect - send an update message #benefits-vro-support in the thread of the Incident Report workflow launch- talk to Derek about what alerts already exist for BIP and BGS

Resources

Reference Incidents Epic

@dfitchett dfitchett added VRO-team needs-refinement needs refinement before it's ready to work labels Sep 18, 2024
@bianca-rivera bianca-rivera changed the title Incident Response Workflow - Add Impact Section Update the Incident Response Workflow Sep 23, 2024
@bianca-rivera bianca-rivera self-assigned this Sep 25, 2024
@bianca-rivera bianca-rivera changed the title Update the Incident Response Workflow Update our Incident Response process and Slack workflow Oct 1, 2024
@gabezurita
Copy link
Collaborator

Note: I'm on this issue as part of my onboarding to help learn about the VRO on-call process 😄

@bianca-rivera bianca-rivera removed the needs-refinement needs refinement before it's ready to work label Oct 10, 2024
@bianca-rivera
Copy link

bianca-rivera commented Oct 29, 2024

made more significant updates and formatting changes; added new AC bullet to share with VRO team before sending message to partner teams of latest update regarding impact metrics and include highlighting any other substantial changes

@gabezurita
Copy link
Collaborator

Excellent job, @bianca-rivera !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants