Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REQUEST: New project roles: "Test Maintainer" and "Subject Matter Expert" #9735

Closed
jackfrancis opened this issue Nov 17, 2023 · 24 comments
Closed
Labels
area/release Issues or PRs related to releasing kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@jackfrancis
Copy link
Contributor

What would you like to be added (User Story)?

As a release manager, I would like to be able to easily triage test failures by escalating to a "Test Maintainer" in order to resolve test failures and unblock release progress.

As a "Test Maintainer", I would like to be able to rapidly diagnose test failures by escalating to a "Subject Matter Expert" in order to debug the specific underlying failures in a critical CI test.

Detailed Description

At present, the CAPI project has a notion of a "maintainer" (see the cluster-api-maintainers in the OWNERS_ALIASES file in the root of the project repository), who is responsible for approving reviewed PRs and merging them, as well as having various admin-level privileges to execute important tasks such as cutting a new release. Additionally, there is a notion of a "reviewer" as well (see the cluster-api-maintainers in the OWNERS_ALIASES file in the root of the project repository) whose sole responsibility is to review PRs and provide feedback to submitters, and finally to lgtm a PR when it can be accepted into the project.

This request suggests that of the current formal roles defined in CAPI, only "maintainer" suggests a role that can be used to aid test failure triage (basically you'd try your best to understand what test is failing, and perhaps why, and then reach out to a maintainer for help actually fixing things).

Because the CAPI project has grown considerably since these roles were codified, perhaps a more effective way to triage test failures is to introduce a dedicated "Test Maintainer" role with the following responsibilities:

  • Regularly observe test signal from CI (PR and periodic jobs)
  • Investigate test failures
    • Look for recurring patterns
    • Is the test itself buggy?
    • Is the failure a result of the test doing its job (i.e., the test detects something actually wrong)
  • Triage to an appropriate subject matter expert, or fallback to a project maintainer

The above describes a new "Subject Matter Expert" role, whose responsibilities are as follows:

  • For a particular surface area of the code, understands intent and implementation at a deep level
  • Understands how this surface area is tested, and how to interpret test failures resultant from this surface area

Anything else you would like to add?

How would this work in practice?

For "Test Maintainer", we would want to agree upon a set of folks that would participate in a regular rotation. Defining how that rotation works is probably beyond this issue.

For "Subject Matter Expert", we would put an OWNERS file describing one or more maintainers for the code in the parent directory of the source that contains that code. That ownership is inherited by child directories, until another OWNERS file is encountered, which overridees the parent OWNERS file.

For example, this describes the current project-spanning ownership (i.e., current maintainers and reviewers:

/OWNERS

This would define a set of subject matter experts for the v1beta1 API:

/api/v1beta1/OWNERS

This would define a set of subject matter experts who are responsible for E2E tests:

test/OWNERS

Taking a concrete example, here is a recently failing CAPI test:

Note the stacktrace:

[FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:389 @ 11/17/23 11:42:09.424

A person fulfilling the "Test Maintainer" role could then do the following:

  • Find the most specific OWNERS set for /test/e2e/clusterctl_upgrade.go.
  • Follow code references in the test code to gather an additional set of OWNERS references, if appropriate
  • Correlate CI test failures with similarly timed errors in capi-system controller logs, and for each relevant failure, correlate source references to their nearest OWNERS file

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@furkatgofurov7
Copy link
Member

Thanks for putting together your thoughts @jackfrancis. Overall I like the idea of having new roles for fixing CI failures more efficiently compared to what currently we have.

Because the CAPI project has grown considerably since these roles were codified, perhaps a more effective way to triage test failures is to introduce a dedicated "Test Maintainer" role with the following responsibilities:

Regularly observe test signals from CI (PR and periodic jobs)
Investigate test failures
Look for recurring patterns
Is the test itself buggy?
Is the failure a result of the test doing its job (i.e., the test detects something actually wrong)
Triage to an appropriate subject matter expert, or fallback to a project maintainer

Most of the above responsibilities described already fall into the responsibilities of the current CI team in the Release Team, so we may not need a new dedicated role for it and CI Lead + CI Lead team Members should be able to cover those.

For "Test Maintainer", we would want to agree upon a set of folks that would participate in a regular rotation. Defining how that rotation works is probably beyond this issue.

CI team has multiple members and RT already has a concept of membership rotation, in which they take turns watching CI/triaging etc.

The above describes a new "Subject Matter Expert" role, whose responsibilities are as follows:

For a particular surface area of the code, understands intent and implementation at a deep level
Understands how this surface area is tested, and how to interpret test failures resultant from this surface area

This I think is the most important role in the context of this issue and it should help in making sure CI issues are resolved on timely manner without blocking CAPI releases.

Would be great to hear others opinion as well.

cc @CecileRobertMichon @fabriziopandini @sbueringer @killianmuldoon @vincepri @enxebre

@furkatgofurov7
Copy link
Member

cc @kubernetes-sigs/cluster-api-release-team

@fabriziopandini
Copy link
Member

I personally like the idea of having people on ration focused on investigating flakes for a couple of reasons:

  • It is a time/effort-bound commitment
  • Triaging issues is a great way to learn CAPI and all the machinery we use in CI
  • Also users can benefit from improving their skills in triaging CI failures

Given that, IMO this fits naturally in the CI team (which is on rotation for a release) or in a working group inside it (the triage team), but ultimately everything works for me.

My main concern is to onboarding enough people willing to do the work so the project does not depend anymore on a very limited set of people to ensure the quality of upcoming releases; with this regard, let's consider if titles like "Test Maintainer" and "Subject Matter Expert" might be intimidating for the folks approaching this job or available for a time-boxed/limited commitment only.

@sbueringer
Copy link
Member

I'm not sure about the "Subject Matter Expert" role. Wouldn't this ~ match to what we have reviewers / maintainers for? (the problem is just that by far we don't find enough folks for these roles already).

Not sure if it makes sense to create a new dedicated / separate role of a subject matter expert, given that reviewers / approvers should already be subject matter experts of the code they are responsible for and are also aware of recent changes (which are quite often the cause of new occuring test failures).

@jackfrancis
Copy link
Contributor Author

reviewers / approvers should already be subject matter experts of the code they are responsible for

I think that is what I was describing. How do the current organizational structures support the "of the code they are responsible for" part of that equation? My understanding is that a reviewer or owner of CAPZ is scoped to the entire project. If there are specific OWNERS configurations in specific parts of the code, then that's already the pattern I'm describing.

@sbueringer
Copy link
Member

sbueringer commented Nov 20, 2023

In core CAPI we have additional maintainer/reviewer aliases for parts of the project: https://github.com/kubernetes-sigs/cluster-api/blob/main/OWNERS_ALIASES

These aliases are then referenced by OWNERS files in various directories.

@jackfrancis
Copy link
Contributor Author

@sbueringer @fabriziopandini I think this has been a good conversation. It seems to me that the two roles we're discussing largely already exist. Perhaps a good way to better formalize (I'm assuming a little more formality is helpful, but could be wrong there) this is to clarify the following:

  • The release team creates a new sub-role w/ rotation to monitor test signal (maybe they are already doing this and taking turns monitoring tests, but we could potentially help by defining a calendar structure)
  • The responsibilities of CI test monitor is not to fix the tests but to triage to an appropriate owner

The 2nd line item may not be super smooth at first, as lots of things may just result in calls for help to CAPI project maintainers, but this could act as a forcing function to better represent specific areas of the code as owned by certain folks. Over time this could lead to more effective test troubleshooting, and ultimately, better code ownership.

We could write the above as a community in a doc, w/ the release team being the key stakeholder in how we decide to do things.

WDYT?

@fabriziopandini
Copy link
Member

The responsibilities of CI test monitor is not to fix the tests but to triage to an appropriate owner

This is what we are doing now, and it doesn't work because we ultimately depend on a very limited set of people to ensure the quality of upcoming releases. (now triage to an appropriate owner = ping maintainer)

We need more support from the community if we don't want to put our future release at risk

What I propose is to try to assign the responsibility of fixing tests to a team of folks, let's call it "CI Reliability team"; people will try to get systems back to a steady state as quickly as possible, or escalate to maintainers when necessary.

In order to get people to volunteer for this role, I suggest that this team should be on rotation with a time/effort-bound commitment, like the release team, and to keep things simple, I think that this team could actually be an addition to the release team roster (Note: I understand the objection that this is not usually part of a release team tasks, but I'm trying to be pragmatic and use the process we already have without adding further complexity)

If we advertise this new "CI Reliability team" team properly, it becomes a matter of picking the right folks in the list of candidates for the next release team, the headline for this role could be:

  • Triaging issues is a great way to learn CAPI and all the machinery we use in CI
  • Not only developers but also end users or CAPI administrators can benefit from improving their skills in triaging and solving CI failures
  • By removing flakes and test failures, you will provide a great contribution to make the next release awesome
  • ...

@sbueringer
Copy link
Member

sbueringer commented Nov 22, 2023

Just to be clear, I'm 100% in favor of experimenting in this area to try to get more help because we clearly need it.

I'm just doubting the original idea of this issue as I understood it, which would have been to establish an additional role specified in owner files in addition to reviewers/ approvers for specific areas (owner files exist to define review & approve permissions)

@CecileRobertMichon
Copy link
Contributor

FWIW the Kubernetes release team used to have two seperate roles for "CI signal" and
"bug triage" and they are now combining those two roles into one: https://groups.google.com/a/kubernetes.io/g/dev/c/9eT38nEQUZ4/m/kNPgzhM3AAAJ?utm_medium=email&utm_source=footer

It might be interesting to understand their reasoning so we can learn from their experience

@furkatgofurov7
Copy link
Member

FWIW the Kubernetes release team used to have two seperate roles for "CI signal" and "bug triage" and they are now combining those two roles into one: https://groups.google.com/a/kubernetes.io/g/dev/c/9eT38nEQUZ4/m/kNPgzhM3AAAJ?utm_medium=email&utm_source=footer

It might be interesting to understand their reasoning so we can learn from their experience

We discussed the idea of introducing a dedicated "Bug Triage" team in #9271 and at the end not to do so, since the current CI team already has the responsibilities that partly matches what we would be introducing.

@fabriziopandini
Copy link
Member

fabriziopandini commented Nov 23, 2023

I think that in our model "CI signal" and "Bug Triage" somehow are already merged and it works well, and the output of their work is a set of issues with all the flakes/test failures with all the details the managed to collect.
What I'm proposing with the "CI Reliability team" is something different; the responsibility of this team is to take charge of the issue above and to try to fix flakes/failures (or escalate to maintainers when necessary).

@furkatgofurov7
Copy link
Member

cc @cahillsf FYI (since you are a release lead for 1.7)

@CecileRobertMichon
Copy link
Contributor

cc @adilGhaffarDev as well since you are CI signal manager for the 1.7 release

@fabriziopandini fabriziopandini added the area/release Issues or PRs related to releasing label Jan 19, 2024
@nawazkh
Copy link
Member

nawazkh commented Jan 24, 2024

Thanks for putting this together @jackfrancis !

I like the concept suggested by this request. However, from my experience, there seems to be a significant overlap in responsibilities between the "Test Maintainer" and "SME" roles and the existing duties of the CI Team(as @furkatgofurov7 already pointed out in #9735 (comment)).

I also like the idea of a "CI Reliability Team," but the expectations for this team appear to coincide with the responsibilities of the existing CI Team.

Perhaps a reassessment of the CI Team's role and expectations could provide insights into the core issue? Below is my perspective. (Also, because we need to clarify the CI Team's scope via #9849)


Summary of CI Duties:

The responsibilities section of CI Signal/Bug Triage/Automation Manager team mainly outlines three broad duties.

  1. CI/Testgrid
    1. Continuously monitor testgrid
    2. Continuously reduce the amount of flaky tests
  2. Bug Triage
  3. Automation (of release related tooling and documentation)

The expectations surrounding "Bug Triage" and "Automation" have been somewhat unclear.

  1. "Bug Triage":

    • Overview:
      • "Bug Triage" originally expected the CI Team to triage all the issues logged onto CAPI.
      • But in practice, however, an issue can only be effectively triaged by someone with a deeper understanding of CAPI.
      • Unless a CAPI maintainer or someone deeply knowledgeable about CAPI leads the CI Team, triaging all CAPI issues seems overly broad.
    • Proposal:
      • The CI Team should only be expected to triage test-related issues..
        • The team would be responsible for reproducing and tracking down issues tagged with "area/testing," "area/e2e-test," or "kind/failing-test".
      • Bug triaging should then fall to the maintainers and community members creating and scanning the GitHub issues.
      • It should be the responsibility of the maintainers and issue creators to assign the correct labels, enabling the CI Team to effectively track and address issues.
    • Result:
      • The CI Team now has a clear understanding of the issues it owns, can reproduce, track, and hopefully resolve.
      • There is now a shared responsibility between the CI Team and the maintainers in resolving CI-related issues.
  2. Automation (of release related tooling and documentation):

    • Overview:
    • Proposal:
      • Maintenance of release automation tooling should be the responsibility of the entire Release Team rather than solely the CI Team.
    • Result:
      • This adjustment narrows the CI Team's scope, allowing them to focus primarily on maintaining Testgrid's health.
      • Release Team members, being part of the release process, are encouraged to undertake any automation-related tasks.
  3. CI/Testgrid:

    • Overview:
      • The CI Team traditionally limits to 4-5 volunteers per release cycle, including a mix of beginners and experienced CAPI contributors/maintainers.
      • As far as I know, CI Team has always maintained a rotation in monitoring test grid.(@adilGhaffarDev please update if it is otherwise now or if we need to formalize the idea of rotating the CI monitoring duties) A rotation system for monitoring Testgrid has been a key method for involving new members in CAPI through CI Team membership.
        • Monitoring CI also implied that the person monitoring CI were free to run the tests locally and poke around.
      • Since CI Team consisted of healthy mix of members, fixing a test issues usually involved debugging it themselves and if not fixed, collaborate with the maintainers to fix the issue.
        • In my experience I found this quite enriching as, as a beginner, I was able to collaborate directly with the maintainers; and also was able to connect CI team members with the tests SMEs while resolving the test issue.
    • Proposal:
      • The CI Team should continue to focus on "Monitoring CI" and "resolving CI-related bugs in collaboration with maintainers."
      • The fundamental purpose of the CI Team's monitoring efforts is to share the maintenance burden of test signals and involve new members in CAPI.
      • We should perhaps revisit the CI team duties and enlist these expectation so that CI lead (and the team) are aware of the idea behind having a CI team in the first place.
    • Result:
      • The CI Team is clearly tasked with owning test-related issues and is expected to attempt debugging before seeking help from CAPI maintainers/SMEs.
      • CI leads are empowered to structure their approach to continuous monitoring, triaging test issues, and practicing leadership skills.
      • The CI Team continues to share the burden of maintaining CI signals with key stakeholders.

  • The release team creates a new sub-role w/ rotation to monitor test signal (maybe they are already doing this and taking turns monitoring tests, but we could potentially help by defining a calendar structure)
  • The responsibilities of CI test monitor is not to fix the tests but to triage to an appropriate owner

I hope my above explanation adds more clarity regarding the practices and functioning of the CI Team.

Moreover, I believe that the duties of the proposed "CI Reliability Team" are essentially those of the existing CI Team within the CAPI Release Team. The main difference is that bug triage might fall outside the CI Team's scope, focusing instead on addressing test-related issues.

@adilGhaffarDev
Copy link
Contributor

In order to get people to volunteer for this role, I suggest that this team should be on rotation with a time/effort-bound commitment, like the release team, and to keep things simple, I think that this team could actually be an addition to the release team roster (Note: I understand the objection that this is not usually part of a release team tasks, but I'm trying to be pragmatic and use the process we already have without adding further complexity)

If we advertise this new "CI Reliability team" team properly, it becomes a matter of picking the right folks in the list of candidates for the next release team, the headline for this role could be:

* Triaging issues is a great way to learn CAPI and all the machinery we use in CI

* Not only developers but also end users or CAPI administrators can benefit from improving their skills in triaging and solving CI failures

* By removing flakes and test failures, you will provide a great contribution to make the next release awesome

I agree with @fabriziopandini on this, but not sure if it should be separate team or we just add this to the responsibility of CI Team. I am ok with both, if we plan to add it to responsibility of "CI Team" we just have to keep it in mind when we are making CI Team and communicate it properly to the team.

For this release cycle we are doing following:

  • On daily basis we keep checking CAPI CI to see if there are any new flakes.
  • We have cycle for CI signal monitoring.
  • We have also started weekly meetings, in the meetings we go through CAPI triage board to see if there are new flakes, we do check these flakes regularly too but in weekly meetings we try to summarize it.
  • Regarding fixing flakes, we are planning debug sessions from next week, where we will focus on one flake at a time and try to debug it and share over findings on the GitHub issue for the related flake so maintainers are aware of it. We might not be able to fix the flake but it will be good learning experience for team.
  • Our team is also checking automation tasks assigned to CI team, if they are available.

@adilGhaffarDev
Copy link
Contributor

  • "Bug Triage" originally expected the CI Team to triage all the issues logged onto CAPI.

I don't think this is done by CI team since last 2 release cycles. I also think we should update documentation and remove bug triaging from CI teams responsibilities. This can be done weekly in release team meeting by Release Lead when whole team is present.

2. Automation (of release related tooling and documentation):

I agree I think this is also documented wrong, we need to fix it. But in practice things are happening in right direction, automation tasks in last 2 release cycles and this one too were divided accordingly, all tasks were not assigned to CI Team.

@cahillsf
Copy link
Member

cahillsf commented Feb 28, 2024

brought up in CAPI office hours (see discussion details here: https://docs.google.com/document/d/1X5_qNUvoY0Tk3BwXODWHAl72L-APM_GiA-IE4WoitYY/edit#heading=h.y2rfv7e8f0wp)

Next steps as understood from the release team:

  1. Clarify our team documentation for the CI team to:
    • remove any misleading tasks that currently are documented under CI team scope (i.e. general "bug triage")
    • remove automation tooling from CI team name (this should be documented as falling under the general responsibility of whole release team and assigned to the specific teams as applicable)
    • indicate that desired state of CI team responsibility for CI related flakes/bugs is not only to identify the issue, but find root cause and introduce a fix. This also means that maintainers and other invested community members should be available to help/assist/educate the CI team when they need an escalation path. This effectively encapsulates the concept of CI reliability (as mentioned here by Fabrizio) into the existing CI team's responsibility
  2. Take steps necessary to bring the ideal state of having CI team empowered/capable/responsible for complete resolution of CI issues as quickly as possible closer to reality. Some ideas:
    • increase the size of the CI team for release-1.8 cycle
    • define what the CI team's "escalation path" should look like when they have reached the limits of their understanding/troubleshooting paths

feel free to correct me on any of the above. as advised by Fabrizio, I believe these action items are "small steps" that we can iterate on

@fabriziopandini
Copy link
Member

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 11, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 9, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release Issues or PRs related to releasing kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

10 participants