Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356

Closed
austinlparker opened this issue Sep 18, 2024 · 30 comments
Closed

Comments

@austinlparker
Copy link
Member

austinlparker commented Sep 18, 2024

This issue briefly covers the incident that took place on 09/18/24 starting at 8:49 AM Pacific/11:49 AM Eastern. All times in the document will be in Eastern time.

Timeline of events

Approx. 11:45 AM - A change was pushed to the OpenTelemetry CLOWarden configuration to address a new configuration file location (open-telemetry/people). This configuration file was deliberately limited -- it defined one team, and one repository. Prior to this change, a separate installation of CLOWarden was used in a separate organization to validate the expected outcome with no issues.

11:48 AM - A PR was made to the new config.yaml file. This change added a single user. CLOWarden picked up this change and began to validate the current state vs. the desired state. For an unknown reason, the behavior of CLOWarden differed in the live environment, and it began to apply changes to the current state of the open-telemetry organization prior to approving the PR.

11:49 AM - Teams are deleted via CLOWarden and the GitHub API.

11:49 AM - Infra SIG (@austinlparker and @trask) pull the most recent CLOWarden state backup (from community/config.yaml) and begin to address statefile issues in order to re-create teams from this backup.

~12:20 PM - The statefile passes validation, but CLOWarden fails to process it due to a lack of repository information in the config.yaml. Out of an abundance of caution, we decide that we do not want to add repository information without fully understanding the application code as the existing behavior was a surprise/unexpected. A Rust SME is invited into the SIG call (@TommyCpp) who suggests that we pursue an alternative restoration strategy via Python script and remove CLOWarden from the loop entirely. A brief discussion ensues, and all parties agree on this course of action.

~1:45 PM - A script is developed and tested to re-create the teams based on the backup directly via GitHub API.

~1:49 - The team recreation job begins. We begin to work on restoring team/repository permissions.

2:30 PM - A script to repair repository/team permissions has been written and tested. Beginning to roll out updates.

2:39 PM - Permissions have been modified for all repositories based on the backup and we have confirmed this in the GitHub UI.

Current Status

  • The CLOWarden app has been disabled.
  • Teams and Repository Permissions have been restored from backup.
  • The backup is from July 12th, 2024. Please double check permissions and CI/CD, as well as updating team affiliations for any modifications since July.

This incident is now closed. This thread will be left open for discussion until we have a more comprehensive retrospective and learnings available, which will be posted on opentelemetry.io. Thank you for your patience as we resolved this issue.

@trask
Copy link
Member

trask commented Sep 18, 2024

it began to apply changes to the current state of the open-telemetry organization prior to approving the PR

this is the part that most baffles me

@mtwo
Copy link
Member

mtwo commented Sep 18, 2024

Thanks @austinlparker and @trask for working so quickly to fix this! Infra / tooling changes like this are always fragile and are often impossible to test, especially when they do things like that ^

@trask
Copy link
Member

trask commented Sep 18, 2024

TODO

@dyladan

This comment was marked as outdated.

@marcalff
Copy link
Member

marcalff commented Sep 18, 2024

Also missing configuration-approvers, which was created very recently:

@TommyCpp

This comment was marked as resolved.

@austinlparker
Copy link
Member Author

Missing rust-approvers and rust-maintainers

https://github.com/orgs/open-telemetry/teams?query=rust

Looks like they're there?

@austinlparker
Copy link
Member Author

@open-telemetry/rust-approvers @open-telemetry/rust-maintainers

@trask
Copy link
Member

trask commented Sep 18, 2024

TODO

  • review teams that were created since the backup that we used from 2 months ago
  • ping all teams here to ask them to validate their current membership since the backup we used was from 2 months ago

it looks like we can get all of the changes since the backup via the GitHub audit logs:

I'll work on applying these changes...

@austinlparker
Copy link
Member Author

Thanks, @trask!

@florianl
Copy link

florianl commented Sep 18, 2024

ebpf-profiler-approvers and ebpf-profiler-maintainers (please see also recent changes from open-telemetry/opentelemetry-ebpf-profiler#151) seems to be missing for https://github.com/open-telemetry/opentelemetry-ebpf-profiler/.

@trask
Copy link
Member

trask commented Sep 18, 2024

all changes to team membership since the July 12 backup have been applied (from the github audit log)

looking now to see if I can get repo permissions updates from the audit log...

@trask
Copy link
Member

trask commented Sep 18, 2024

everything should be resolved now

please report any issues here, thanks!

@yurishkuro
Copy link
Member

  1. Could we link to the PRs mentioned in the timeline, to better understand the changes?
  2. Was CLOWarden run from a workflow or as an external app?
  3. Is there a dry-run option in CLOWarden where it logs what changes it would execute?

@austinlparker
Copy link
Member Author

austinlparker commented Sep 18, 2024

  1. Could we link to the PRs mentioned in the timeline, to better understand the changes?

  2. Was CLOWarden run from a workflow or as an external app?

  3. Is there a dry-run option in CLOWarden where it logs what changes it would execute?

  1. This PR (now repurposed) shows the changes from CLOWardens perspective. add manual management script gh-manager#2 (comment) - I linked to the specific comment it gave when modifying the limited state file; You can look at it in the PR history. Subsequent commits/messages are from when we tried to recover using CLOWarden vs. writing our own script.

  2. CLOWarden is run as an external application that watches PRs in specified repositories.

  3. No, there is no dry run option that I'm aware of or found. This is why we tried a small test on a separate organization. The setup for that test was an organization with an existing team that was not managed in the state file. A change was made to add a new team. In that instance, it did not automatically delete the existing team. The behavior in todays case was rather surprising given both our prior tests, as well as our expectations for how a desired state engine would function (for example, we expected that it would not make changes without human intervention/approval for a PR, which is what the docs suggest) lol

@tegioz
Copy link

tegioz commented Sep 19, 2024

Hi 👋

This is Sergio, one of the CLOWarden maintainers.

First of all, I'm sorry you run into some problems while testing CLOWarden. Please see my response here for more details about this incident: cncf/clowarden#261 (comment)

@tegioz
Copy link

tegioz commented Sep 19, 2024

  1. No, there is no dry run option that I'm aware of or found.

You can simulate a dry-run by using the diff subcommand of the CLI tool.

The behavior in todays case was rather surprising given both our prior tests, as well as our expectations for how a desired state engine would function (for example, we expected that it would not make changes without human intervention/approval for a PR, which is what the docs suggest).

From CLOWarden's README file, in the How it works section:

CLOWarden's main goal is to ensure that the resources' desired state, as defined in the configuration files, matches the actual state in the corresponding services. This is achieved by running reconciliation jobs, that can be triggered on-demand or periodically.

CLOWarden monitors pull requests created in the configuration repository and, when applicable, it validates the proposed changes to access rules and creates reconciliation jobs to apply the necessary changes. This is what we call an on-demand reconciliation job: it's created as a result of a user's submitted pull request, and changes are applied immediately once the pull request is approved and merged.

Sometimes, however, this may not be enough. Changes can be applied manually to the service bypassing the configuration files (i.e. from the GitHub settings UI), and CLOWarden still needs to make sure that the actual state matches the desired state documented in the configuration files. So in addition to on-demand reconciliation jobs, CLOWarden runs periodic ones to ensure everything is all right all the time.

In any case, we'll try to improve the docs to make it clearer how it works (i.e. highlighting some important parts) and avoid potential problems in the future to other users.

@austinlparker
Copy link
Member Author

I didn't see a way to configure the periodic reconciliation, or even an explanation of when it runs. Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version. When I ran the diff (which is how we had the backup to begin with), CLOWarden complained about it (you can see the complaints in the linked PR above).

Ultimately, we did not fully understand or appreciate how CLOWarden functioned here, and did not have the requisite expertise in Rust to really audit the code or appreciate its functionality. I believe that better documentation would have helped here, or perhaps the ability to clearly configure/understand the periodic reconciliation loop.

@tegioz
Copy link

tegioz commented Sep 19, 2024

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

I understand, that's fine. We needed to be able to remove teams/permissions/etc manually added from the GitHub UI without maintainers interaction or approval, it was a requirement for us.

When I ran the diff (which is how we had the backup to begin with), CLOWarden complained about it (you can see the complaints in the linked PR above).

Could you point me to those logs please? I saw the clowarden-server logs in the issue you created, but I didn't see any run of the diff subcommand (dry-run). Would love to see what it reported.

I mean the output of this command, which should detail all changes that would be applied when the service is launched (all resources to be deleted in this case):

export GITHUB_TOKEN=...
clowarden-cli diff --org open-telemetry --repo gh-manager --branch main

@austinlparker
Copy link
Member Author

I don't have the logs for the diff command unfortunately, I ran it several weeks ago.

@yurishkuro
Copy link
Member

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

Isn't that what you would expect from the automation? The config becomes the source of truth, and if the change to config is merged the automation should apply it without additional manual confirmation. Are we saying that in this case the config accurately represented SOTW but CLOWarden went and erased everything?

@austinlparker
Copy link
Member Author

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

Isn't that what you would expect from the automation? The config becomes the source of truth, and if the change to config is merged the automation should apply it without additional manual confirmation. Are we saying that in this case the config accurately represented SOTW but CLOWarden went and erased everything?

I would expect that if you had a PR open that modified the state, that would be the one that took effect, not the diff between the empty state and the PR, if that makes sense?

@yurishkuro
Copy link
Member

yurishkuro commented Sep 19, 2024

My understanding of the explanation provided that PR had nothing to do with it. Both periodic and adhoc syncs are still run off the merged config. So if the config was actually empty in main then deleting all groups would be the expected behavior once the app is enabled. But if the config was actually reflective of SOTW but CLOWarden's some internal state was empty and it deleted everything - then yes, it would be a big problem. I still haven't seen the actual PR - the timeline says there was a config change where the config file location was changed, and perhaps to some new version of config that didn't represent SOTW - then again, the behavior was expected (but not before PR was merged).

if CLOWarden has a CLI that can apply the changes, I would avoid using external app and start running the CLI in dry mode for PRs, and only tell it to apply the changes in manual runs (e.g. a dispatched workflow, after diff is verified). Once we gain confidence in it, we can enable auto-runs.

@austinlparker
Copy link
Member Author

My understanding of the explanation provided that PR had nothing to do with it. Both periodic and adhoc syncs are still run off the merged config. So if the config was actually empty in main then deleting all groups would be the expected behavior once the app is enabled. But if the config was actually reflective of SOTW but CLOWarden's some internal state was empty and it deleted everything - then yes, it would be a big problem. I still haven't seen the actual PR - the timeline says there was a config change where the config file location was changed, and perhaps to some new version of config that didn't represent SOTW - then again, the behavior was expected (but not before PR was merged).

if CLOWarden has a CLI that can apply the changes, I would avoid using external app and start running the CLI in dry mode for PRs, and only tell it to apply the changes in manual runs (e.g. a dispatched workflow, after diff is verified).

The PR is the PR I linked upthread. I understand from the explanation given why things worked the way they did, even if I didn't feel that it was clear it would happen this way before.

  1. We created a new repo and added a new config file (config.yaml, seen here open-telemetry/gh-manager@4bceb97) that had the initial team we wanted to manage through CLOWarden. We did this because prior maintainer feedback was that certain teams did not wish to use CLOWarden, so our initial spike was going to be managing a limited set of teams. In testing, this appeared to work fine, but that is because we got lucky.
  2. Once the new config file was created, the CLOWarden service was restarted in k8s to pick up the new config file location (previously, it was pointing to community/config.yaml). This is the critical difference between the test environment and the live one, we did not restart CLOWarden in that case.
  3. As CLOWarden restarted, it read the config.yaml file that existed with only one team, and its periodic reconciliation engine deleted all other teams to match the state in that file.

This is all relatively academic because we aren't going to be using CLOWarden after this since we don't feel that we can support a tool we don't really understand, and it won't allow us to slowly roll out features to the entire organization.

@yurishkuro
Copy link
Member

Fair enough - if we achieve the same with ~300 lines of Python, I would also go with that for better control.

@trask
Copy link
Member

trask commented Sep 27, 2024

Closing this now that recovery is completed and no new issues have been reported for a bit. If there's any more issues or details people want just let us know, thanks.

@kaylareopelle
Copy link
Contributor

@trask - Users of multiple ruby-related teams, including the ruby-maintainers and ruby-contrib-maintainers teams are are no longer able to update Project boards in the open-telemetry org.

Were some permissions related to Projects changed in the recent incident?

Someone may be able to follow these instructions to update our permissions.

@trask
Copy link
Member

trask commented Oct 16, 2024

hi @kaylareopelle! yeah, this was an unfortunate side effect

I've just granted write access for both @open-telemetry/ruby-maintainers and @open-telemetry/ruby-contrib-maintainers to these projects:

@kaylareopelle
Copy link
Contributor

kaylareopelle commented Oct 16, 2024

Thank you, @trask! Everything's working on my end now! I believe approvers also previously had write access. Is that the norm for other SIGs?

If so, would you mind granting access to ruby-approvers and ruby-contrib-approvers too?

@trask
Copy link
Member

trask commented Oct 16, 2024

done 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants