Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate regularly failing workflows #4872

Closed
26 of 31 tasks
davidkelliott opened this issue Aug 23, 2023 · 35 comments
Closed
26 of 31 tasks

Investigate regularly failing workflows #4872

davidkelliott opened this issue Aug 23, 2023 · 35 comments
Assignees
Labels
code quality github_actions Pull requests that update Github_actions code technical debt This issue is either technical debt or an issue that will lead to technical debt as time goes by.

Comments

@davidkelliott
Copy link
Contributor

davidkelliott commented Aug 23, 2023

User Story

As an MP engineer
I want to reduce the amount of workflow failures
So that we have more successful deployments

User Type(s)

MP engineers

Value

Less deployment failures
More consistent deployments

Assumptions / Hypothesis / Questions / Unknowns

Hypothesis

If we search for failed workflows on main branches in our repositories (excluding environments repo)
Then we can see which ones are regularly failing and try to improve them

Proposal

Search github for failed workflows on main:

https://github.com/ministryofjustice/modernisation-platform/actions?query=is%3Afailure+branch%3Amain+

Go through them and see which ones can be fixed, if they are failing regularly and not being addressed do we need a slack alert for them?

Unknowns

We have data from DORA metrics which indicates around 50% of our workflows fail. We don't know where these are now, if there is a lot of work to be done we may need to split this out into smaller issues.

Definition of done

  • workflows across all repos excluding the environments repo analysed:
    • modernisation-platform
    • modernisation-platform-terraform-ec2-autoscaling-group
    • modernisation-platform-ami-builds
    • modernisation-platform-terraform-ec2-instance
    • modernisation-platform-configuration-management
    • modernisation-platform-terraform-ecs
    • modernisation-platform-cp-network-test
    • modernisation-platform-terraform-ecs-cluster
    • modernisation-platform-terraform-environments
    • modernisation-platform-github-oidc-provider
    • modernisation-platform-terraform-iam-superadmins
    • modernisation-platform-github-oidc-role
    • modernisation-platform-terraform-lambda-function
    • modernisation-platform-terraform-loadbalancer
    • modernisation-platform-terraform-member-vpc
    • modernisation-platform-infrastructure-test
    • modernisation-platform-instance-scheduler
    • modernisation-platform-terraform-pagerduty-integration
    • modernisation-platform-terraform-aws-vm-import
    • modernisation-platform-terraform-s3-bucket
    • modernisation-platform-terraform-baselines
    • modernisation-platform-terraform-s3-bucket-replication-role
    • modernisation-platform-terraform-bastion-linux
    • modernisation-platform-terraform-ssm-patching
    • modernisation-platform-terraform-cross-account-access
    • modernisation-platform-terraform-trusted-advisor
  • workflows to fix identified
  • workflows fixed or issues created to fix
  • another team member has reviewed
  • tests are green

Reference

How to write good user stories

@davidkelliott davidkelliott added github_actions Pull requests that update Github_actions code needs refining technical debt This issue is either technical debt or an issue that will lead to technical debt as time goes by. code quality labels Aug 23, 2023
@ep-93 ep-93 assigned ep-93 and unassigned ep-93 Sep 5, 2023
@ewastempel ewastempel self-assigned this Sep 5, 2023
@ewastempel ewastempel moved this from To Do to In Progress in Modernisation Platform Sep 5, 2023
@ewastempel
Copy link
Contributor

@ewastempel is working on the modernisation-platform

@SteveLinden
Copy link
Contributor

SteveLinden commented Sep 7, 2023

@SteveLinden picked up the modernisation-platform-terraform-baselines code

for is: failure branch: main I get no results (leaving out the spaces gives an error)

@SteveLinden
Copy link
Contributor

@SteveLinden looked at modernisation-platform-terraform-s3-bucket

No issues listed with the same search

@connormaglynn
Copy link
Contributor

@connormaglynn looking at modernisation-platform-instance-scheduler 👀

@connormaglynn connormaglynn self-assigned this Sep 8, 2023
@SteveLinden SteveLinden self-assigned this Sep 8, 2023
@SteveLinden
Copy link
Contributor

SteveLinden commented Sep 8, 2023

Search incorrect. I will look again.
Search needs to be done on the actions with a filter rather than a search

@SteveLinden
Copy link
Contributor

So for modernisation-platform-terraform-baselines code there are many checkov errors. Need to go through and see which can be removed.

@SteveLinden
Copy link
Contributor

SteveLinden commented Sep 11, 2023

Added some chatbot ignores to modernisation platform terraform-baselines #ministryofjustice/modernisation-platform-terraform-baselines#283

@SteveLinden
Copy link
Contributor

SteveLinden commented Sep 11, 2023

Repeated the same for the s3-bucket one. ministryofjustice/modernisation-platform-terraform-s3-bucket#252

This was replaced by another call - #4996 see comment below

@SteveLinden
Copy link
Contributor

SteveLinden commented Sep 11, 2023

Waiting for the above releases to be approved so I will look at modernisation-platform-terraform-trusted-advisor

Errors reported in the tfsec but I am, currently, unable to locate those. Working on the checkov ones first

@connormaglynn
Copy link
Contributor

modernisation-platform-instance-scheduler integration tests fixed and this ticket has been created to remedy potential future issues

@ewastempel
Copy link
Contributor

ewastempel commented Sep 12, 2023

Analysis from modernisation-platform repository workflows:

Repo: https://github.com/ministryofjustice/modernisation-platform
Workflows: failed in main (search: "is:failure branch:main")
Workflow Name                         | Failure number  |               Comments            |    Action
______________________________________|_________________|___________________________________|________________
format-code.yml                       | 51              | no logs available                 | none
add-issues-to-project.yml             | 8               | no logs available                 | none
bootstrap-sprinkler.yml               | 0               | -                                 | none
close-stale-prs.yml                   | 0               | -                                 | none
core-logging-deployment.yml           | 6               | no logs available                 | none
core-network-services-deployment.yml  | 34              | run #1351 - timeout issue         | rerun and fixed 
                                      |                 | run #1338 - unknown               | fixed forward
                                      |                 | run #1312 - bad code              | fixed forward
                                      |                 | run #1309 - unknown               | fixed forward
                                      |                 | other runs have no logs           | none
core-security-deployment.yml          | 2               | no logs available                 | none
core-shared-services-deployment.yml   | 38              | state lock issue                  | none
core-vpc-development-deployment.yml   | 26              | run #994 - bad code               | fixed forward
                                      |                 | run #991 - bad code               | fixed forward
                                      |                 | run #986 - bad code               | fixed forward
                                      |                 | run #978 - bad code               | fixed forward
                                      |                 | run #976 - bad code               | fixed forward
                                      |                 | other runs have no logs           | none
core-vpc-preproduction-deployment.yml | 26              | bad code - same PRs as in dev     | fixed forward
core-vpc-production-deployment.yml    | 23              | bad code - same PRs as in dev     | fixed forward
core-vpc-test-deployment.yml          | 24              | bad code - same PRs as in dev     | fixed forward
generate-dependabot-file.yml          | 0               | -                                 | none
documentation.yml                     | 0               | -                                 | none
new-environment-files.yml             | 0               | -                                 | none
opa-policies.yml                      | 0               | -                                 | none
pages-build-deployment                | 0               | -                                 | none
publish-gh-pages.yml                  | 4               | run #160 - bad code               | fixed forward
labeler.yml                           | 0               | -                                 | none
scorecards.yml                        | 31              | mostly timeout issue when uploadi | 
                                      |                 | ng the artifact SARIF file from   |
                                      |                 | 2 months ago;                     |
                                      |                 | ScoreCards found 3 HIGH vulnerabil| create a ticket
                                      |                 | ities: https://github.com/ministr | to fix vulnerabilities
                                      |                 | yofjustice/modernisation-platform |
                                      |                 | /security/code-scanning           |
code-scanning.yml                     | 112             | all fixed/suppressed, mostly for: | create a ticket
                                      |                 | # checkov:skip=CKV2_AWS_57        | to revisit fixing CKV_AWS_149
                                      |                 | # checkov:skip=CKV_AWS_149        | to use customer-managed KMS 
SonarCloud analysis                   | 0               | workflow no longer exists         | none
terraform-static-analysis.yml         | 233             | all fixed/suppressed, mostly for: | same ticket as for
                                      |                 | # checkov:skip=CKV2_AWS_57        | code-scanning.yml
                                      |                 | # checkov:skip=CKV_AWS_149        | 
terraform-github.yml                  | 14              | run #1250 - actions secrets rate  |
                                      |                 | limit exceeded                    | none
terraform-member-environment.yml      | 4               | bad code mostly, no longer issue  | fixed forward
terraform-pagerduty.yml               | 4               | bad code mostly, no longer issue  | fixed forward
scheduled-baseline.yml                | 91              | noticed some pipelines were never | should aim to rerun failed
                                      |                 | rerun, but scheduled runs take ca | deployments when merging
                                      |                 | of it; latest run passes          |
Testing slack message                 | 0               | the workflow no longer exists     | none
Testing SSO assumption                | 0               | the workflow no longer exists     | none

Summarising, three tickets to be created:

  1. SonarCloud vulnerability issues: https://github.com/ministryofjustice/modernisation-platform/security/code-scanning
  2. Chekov issue:
    terraform-static-analysis.yml : https://github.com/ministryofjustice/modernisation-platform/actions/runs/5923065434/job/16058039084
    code-scanning.yml: https://github.com/ministryofjustice/modernisation-platform/actions/runs/5921047939/job/16053081949
  3. Additionally add a ticket to update all pipelines to use slack notifications when a pipeline fails.

Tickets created:
#4998
#4999
#5000

@SteveLinden
Copy link
Contributor

SteveLinden commented Sep 13, 2023

New issue for the modernisation-platform-terraform-s3-bucket created with Ewa's comment in there. #4996

@connormaglynn
Copy link
Contributor

Looked at modernisation-platform-cp-network-test. Only failing pipeline on main was the Secure Code Analysis from over 5 months ago, though they've been running fine since

GitHub had disabled the scheduled actions due to inactivity, which I've now re-enabled. This feature is probably something to be mindful of if we're depending on scheduled checks for inactive repos🤔

@SteveLinden
Copy link
Contributor

SteveLinden commented Sep 13, 2023

modernisation-platform-terraform-ssm-patching - checkov errors relate to calling the test area. I have commented out (#checkov:skip=) the offending items as they do not appear to be related to what we want to do.
Also failures on tflint (doesn't start) and tfsec for which I can't see any obvious errors

@ewastempel
Copy link
Contributor

ewastempel commented Sep 13, 2023

@ewastempel is picking up modernisation-platform-terraform-ecs-cluster

Repo: https://github.com/ministryofjustice/modernisation-platform-terraform-ecs-cluster/
All workflows that failed in main: 95
Workflow Name                         | Failure number  |               Comments            |    Action
______________________________________|_________________|___________________________________|________________
go-terratest.yml                      | 0               | -                                 | none
format-code.yml                       | 0               | -                                 | none
documentation.yml                     | 0               | -                                 | none
scorecards.yml                        | 1               | logs no longer available          | none
code-scanning.yml                     | 94              | no longer failing                 | none
terraform-static-analysis.yml         | 0               | -                                 | none

@SteveLinden
Copy link
Contributor

Changes to modernisation-platform-terraform-ssm-patching have been pushed through. I will check the results tomorrow to see if that fixes the checkov issues.

@ewastempel
Copy link
Contributor

@SteveLinden
Copy link
Contributor

Still trying to fix the tfsec error. Changing the scan_type to changed so, in theory, it should ignore other bits. Bit of a desperate change more in hope than anticipation. The error doesn't cause an issue but shows up as a fail on the actions list.
This is for modernisation-platform-terraform-trusted-advisor initially. If it does fix it I will do it for modernisation-platform-terraform-ssm-patching too.

@SteveLinden
Copy link
Contributor

modernisation-platform-terraform-cross-account-access had been failing on the tfsec part of the routine. This has been working for the past 3 weeks so it appears to be fixed,

@SteveLinden
Copy link
Contributor

SteveLinden commented Oct 16, 2023

Currently looking at modernisation-platform-terraform-bastion-linux. This calls the item I have changed in #4996

If that is successful I will apply a new latest release called 7.1.0 and apply this to the code and see what impact that has. Assuming it's none I will check what needs to be changed on the code to cure
Check: CKV_TF_1: "Ensure Terraform module sources use a commit hash"
FAILED for resource: module.bastion_linux.s3-bucket
Error: File: /main.tf:71-131

Also fails on CKV2_AWS_64. Neither of these is listed in the action for modernisation-platform-terraform-s3-bucket so will need to be checked/skipped or corrected.

Current PR for this is

@dms1981
Copy link
Contributor

dms1981 commented Oct 17, 2023

Applied fixes / addressed security code analysis alerts for the following repositories :

  • modernisation-platform-infrastructure-test
  • modernisation-platform-terraform-aws-vm-import
  • modernisation-platform-terraform-bastion-linux
  • modernisation-platform-terraform-pagerduty-integration
  • modernisation-platform-terraform-s3-bucket-replication-role

@SteveLinden
Copy link
Contributor

I've checked modernisation-platform-github-oidc-role and there are no errors showing other than dependabot issues.

Also checked modernisation-platform-incident-response which had lots of errors but I have been informed that this should not be on the list and the git should have been archived.

@SteveLinden
Copy link
Contributor

SteveLinden commented Oct 18, 2023

The changes made to the modernisation-platform-terraform-bastion-linux have corrected the issue and the check runs successfully now.
It's in

@SteveLinden
Copy link
Contributor

The version of modernisation-platform-terraform-ecs in the library is archived so this has not been looked at - last run was 6 months ago.

@SteveLinden
Copy link
Contributor

modernisation-platform-configuration-management - only failures are undertaken by user groups so this does not need any changes at the moment.

@SteveLinden
Copy link
Contributor

SteveLinden commented Oct 18, 2023

Lookin at modernisation-platform-terraform-iam-superadmins

There are many vpc errors - related to using * in the arns for example - which we are unlikely to be able to fix.
On checkov there are lots of CKV_TF_1 which can only be resolved by adding the "?ref=" in there. These are terraform code sections so we do not have access to the release details to set the value correctly.

It's possible we cannot resolve these so maybe reduce the number of runs?

@dms1981
Copy link
Contributor

dms1981 commented Oct 18, 2023

modernisation-platform-terraform-member-vpc - this one passes security checks now

@SteveLinden
Copy link
Contributor

SteveLinden commented Oct 18, 2023

modernisation-platform-terraform-lambda-function has not general failures but there are issues go tests such as being unable to find credentials ( Error: No valid credential sources found) or invalid count (Error: Invalid count argument) but they are infrequent.
Related to this I get an error when I try to test
Error: Cannot assume IAM Role
Error: operation error STS: AssumeRole, https response error StatusCode:
│ 403, RequestID: 17366048-281e-496d-b5f6-49455efd31ef, api error
│ AccessDenied: User:
│ arn:aws:sts:::assumed-role/AWSReservedSSO_AdministratorAccess_b8b1bbf0e9094790/[email protected]
│ is not authorized to perform: sts:AssumeRole on resource:

It was caused by an error in my config. This has been corrected.

@SteveLinden
Copy link
Contributor

On the modernisation-platform-terraform-lambda-function providers.tf code under test/unit-test it defaulted to the wrong ID rather than the correct one I wanted which was testing-ci-user. I commented out the code that was causing the issue and the test ran with no issues.
Everything is now fine on this code.

@SteveLinden
Copy link
Contributor

SteveLinden commented Oct 25, 2023

Looking at modernisation-platform-github-oidc-provider which has one checkov failure that should be cured with the addition of another condition (CKV_AWS_358: "Ensure GitHub Actions OIDC trust policies only allows actions from a specific known organization") and should be relatively easy to fix.

@SteveLinden
Copy link
Contributor

PR for the above has been posted. It's here

@markgov
Copy link
Contributor

markgov commented Oct 27, 2023

PR added for documentation workflow error on modernisation-platform-terraform-ec2-instance
ministryofjustice/modernisation-platform-terraform-ec2-instance#224

@markgov
Copy link
Contributor

markgov commented Oct 27, 2023

PR added for documentation workflow error on modernisation-platform-terraform-ec2-autoscaling-group
ministryofjustice/modernisation-platform-terraform-ec2-autoscaling-group#191

@davidkelliott
Copy link
Contributor Author

davidkelliott commented Oct 31, 2023

@dms1981
Copy link
Contributor

dms1981 commented Nov 1, 2023

I'll pick up the S3 bucket checks today
(ministryofjustice/modernisation-platform-terraform-s3-bucket#303)

@dms1981 dms1981 self-assigned this Nov 1, 2023
@dms1981 dms1981 closed this as completed Nov 1, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Modernisation Platform Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code quality github_actions Pull requests that update Github_actions code technical debt This issue is either technical debt or an issue that will lead to technical debt as time goes by.
Projects
Archived in project
Development

No branches or pull requests

7 participants