You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue came up as a discussion point in a team meeting recently. I am filing this issue to start a wider discussion since the issue that we're facing seem to span more generally than our team.
Problem
During the stability period, as tests have failed, it has become increasingly important to be able to find a team that owns/is familiar with a given roachtest. The fact that these tests don't have an owner leads to a lot of inefficiency in triaging failures of these tests.
This idea could potentially be expanded to tests beyond roachtests, but these have been the ones that have been the most painful to deal with and are what sparked this discussion.
Possible Solution
Firstly, one important property that we think that a solution should have is that if we were to automatically triage, it should be to a team rather than an individual. This would be to be resilient to people joining and leaving teams.
Each roachtest could have a static definition which indicates which team owns that given roachtest. The team identifier could be a Slack channel. Assigning each test to a particular Slack channel has the benefit that the relevant team could be notified automatically when a roachtest fails during the nightlies.
One limitation to this approach is that a given roachtest may be testing different parts of the system and therefore it is not clear which team should own a particular roachtest. There have been some proposed issues to this type of problem (for example, possibly introducing phases to roachtests, then each phase could be assigned to a particular team).
This is just one possible solution, discussion is welcome!
Additional Information
(This information is second-hand, so feel free to correct me if this is not the case)
I believe that we have previously experimented with automatically triaging such issues by looking at the commit history. But it seems like this method would frequently be incorrect, and also has the issue that it relates the roachtest to an individual, rather than a team.
The text was updated successfully, but these errors were encountered:
This issue came up as a discussion point in a team meeting recently. I am filing this issue to start a wider discussion since the issue that we're facing seem to span more generally than our team.
Problem
During the stability period, as tests have failed, it has become increasingly important to be able to find a team that owns/is familiar with a given roachtest. The fact that these tests don't have an owner leads to a lot of inefficiency in triaging failures of these tests.
This idea could potentially be expanded to tests beyond roachtests, but these have been the ones that have been the most painful to deal with and are what sparked this discussion.
Possible Solution
Firstly, one important property that we think that a solution should have is that if we were to automatically triage, it should be to a team rather than an individual. This would be to be resilient to people joining and leaving teams.
Each roachtest could have a static definition which indicates which team owns that given roachtest. The team identifier could be a Slack channel. Assigning each test to a particular Slack channel has the benefit that the relevant team could be notified automatically when a roachtest fails during the nightlies.
One limitation to this approach is that a given roachtest may be testing different parts of the system and therefore it is not clear which team should own a particular roachtest. There have been some proposed issues to this type of problem (for example, possibly introducing phases to roachtests, then each phase could be assigned to a particular team).
This is just one possible solution, discussion is welcome!
Additional Information
(This information is second-hand, so feel free to correct me if this is not the case)
I believe that we have previously experimented with automatically triaging such issues by looking at the commit history. But it seems like this method would frequently be incorrect, and also has the issue that it relates the roachtest to an individual, rather than a team.
The text was updated successfully, but these errors were encountered: