-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
Signed-off-by: Toby Lorne <[email protected]> Co-authored-by: Stephen <[email protected]> Co-authored-by: Toby <[email protected]>
Thoughts on CI/CD, based on conversations I've had over the last 18 months of talking about pipelines and Concourse CI / CDThis proposal is to cover continuous integration (CI) rather than continuous deployment (CD). As an organisation we have a standardised way of continuously deploying things without human intervention using Concourse, and a large number of Concourse deployments for this purpose. We also use Concourse for triggering jobs manually. We currently do not use software-as-a-service (SaaS) solutions for deployment, because we (depending on the provider):
Instead we have met our own user needs using Concourse/Jenkins; this is a valid exception to the TechOps principles/strategy. We are comfortable that we can operate Concourse at scale, given we have years of experience of operating Concourse. CI / CD and trustWe have concerns about IA/Security of using SaaS solutions for CD, however these concerns do not apply to CI tools. GOV.UK PaaS and GSP both developed techniques for deploying code hosted by untrusted sources using commit signing. As a result we can host our code wherever we like, and instead trust developers to sign code. A trusted deployer (Concourse pipeline) is responsible for verifying the integrity of the source code from an untrusted repository (e.g. GitHub). Therefore, we should have no concerns about the security / information assurance posture of any CI tool we pick, and instead we can evaluate the CI tools based on developer experience and cost to the taxpayer. The following steps can be followed which use an untrusted CI tool, but code is still trusted all the way from laptop to prod:
(This is the happy path, there are other threat models for consideration, but they are not relevant to this proposal) User needs for CIThis proposal is really about making developers and service teams productive, whilst also getting the benefits of consolidation and consistency. There are two high level user needs (non-exhaustive): As a developer writing code and pushing it to version control, I expect my tests to be run automatically using a CI service against real (ephemeral) databases, so that I do not have to run tests on my local machine. As a code reviewer, I expect the tests for an open pull request to be run automatically using a CI service against real (ephemeral) databases, so I do not have to checkout the code and run the tests locally, and so I can at a glance, see if the tests (and other status checks) are passing. (This does not explicitly mention the "coding in the open aspect" but must be a consideration) Technical considerationsConcourse, and Jenkins, being two pieces of self-hosted software we currently use for CI meet our needs, but not optimally. For the following (non-exhaustive) reasons:
|
I love the thought that's gone into this. One PaaS tenant deploys each PR to a new, temporary app on the PaaS. This allows real-world previewing of how it works. I think that in your model this would require the untrusted CI tool to have PaaS credentials, or to have the trusted CD tool also watching PRs. Do you have any thoughts on using that model for relatively small GDS services? |
This section of this comment is a direct response to @46bit I have no concerns for using "untrusted" CI tools for deploying applications to GOV.UK PaaS provided the credentials are scoped accordingly: ie for PR previews, a SaaS CI tool has:
ie for team-manuals, other microsites, tools, etc a SaaS CI tool has:
I think it would depend on the needs of the developer setting up PR checks. This section of this comment are more general thoughts Developer experienceThe user experience of setting up deployment pipelines with SaaS CI tools is, in my experience, frustrating; whereas Concourse is flexible and extensible enough for the needs of Platforms at GDS. The user experience of setting up simple CI pipelines with SaaS CI tools is, in my experience, an absolute dream. Financial resourcesThe financial resources required for a self-hosted CD tool for running fairly custom pipelines that deploy our platforms are quite low, because deployment pipelines tend not to require too much compute resources. Paying for capacity (e.g. we have 3 t3.medium VMs running 24x7); also self-hosting allows us to mitigate IA concerns, and to have very flexible pipelines. The financial resources required for self-hosting CI tools are quite high, because the compute resources are roughly correlated to developer activity/productivity. During the day when people are working we need to ensure tests run quickly, and capacity planning becomes harder: enter autoscaling, etc. A SaaS CI tools pricing models of charging, based on usage rather than capacity, fit this stage of the development cycle better IMO. |
The approach of running CI using SaaS products and CD using something self hosted is a sensible approach. I am also fine with PR CD happening with a n appropriately scoped SaaS tool. I suggest that as part of this we write patterns that non RE team members can follow easily to reduce the proliferation of tools. I would lean towards using Circle as the SaaS tool and concourse for the CD. Jenkins use is reducing and it looks like there is only going to be one programme left using it. That programme has much higher value things to do than migrate CI / CD. |
Some thoughts GSP's trust modelGSP's trust model has evolved over time; but we have got to a point where we trust GitHub because a) they are not a major part of our threat model and b) managing our own list of GPG keys has been painful. (I won't go into more detail here because the detail is off topic) Terminology: PR builds vs CI vs CDI think there are distinct but related things here:
In particular, I don't see PR builds as "CI": I see CI as the first step of the CD pipeline. It might be a bit prescriptivist of me, but the word "integration" in "continuous integration" means "merging your branch to master", so if it isn't on master, it's not CI. Although PR builds and the CI step might strongly overlap in what they do, they differ in why: for a PR build, it's "is this code okay to merge?" whereas in the CI step it's "is this code okay to push to production?". CI might have outputs such as built code artefacts which the rest of the pipeline promotes to successive environments; these outputs are not relevant to PR builds. I don't mind that other people might think differently; I mainly call this out so that we can come to agreement on the definition of terms in the context of this document. I also think it's important to note that while you might be happy running PR builds on CircleCI, you might prefer to run your CI step on Concourse, for example. You also might be happier with a lower level of assurance for PR builds than for your CI step. PR builds and CI have different but overlapping needs. The other technology bit: artefact repositoriesIt's (sometimes) impossible to implement a CD pipeline without an artefact repository of some sort. We have a proliferation of tools and SaaS services at GDS:
Any discussion of CD is incomplete without a nod to this. I think content trust is more concerning with artefact repositories than with source code repositories, because injecting nefarious code into a built binary is less obvious than injecting nefarious code into a source code repo. Deployment privileges and maintenance burdenAny CD server which has privileges to deploy to a production environment is something which we should be careful about. Access to that server should be (roughly) as tightly controlled as access to the production environment itself. In the best case, each prod environment has its own CD server for deploying to it. However, CD servers cost time and knowledge to maintain. Having lots of them is a maintenance burden, especially in terms of keeping them all up to date. So we may prefer to have fewer CD servers for maintenance simplicity. In short, we can either:
I am sure there is a reasonable way of synthesising these approaches usefully. We have something approaching this in the multi-tenant concourse, where we have separate workers for each prod environment; but I have ideas about how we could tighten this further. My viewMy view is that we should have something like this:
|
If you're running more/different tests against code on master than you are against a PR you're not failing fast. From my experience drawing a big distinction between PR builds and CI means you'll probably end up with two places where you need to maintain the same test setup in slightly different ways. If your tests have any external dependencies that becomes particularly annoying and you soon end up with things like GitHub-integrated CI tools such as Travis can be setup to run tests against master on a merge, so I think that gets rid of the first step of the CD pipeline. At that point your CD system picks up the tested code, builds artifacts and continues from there. Also maybe we want to stop using giant headings in comments because it draws attention away from comments which don't have them. |
Both Phil and Anshul have raised good points about PR builds & CI/CD. My preference would be for the same checks to be run at PR time and after merge to master, so that:
There are additional comments in the document about:
|
I'm not really sure where this fits into the doc, so I'll comment here. Something we rely on heavily in GOV.UK is parameterised builds. Many of our tests require running things against another repository - for example, every PR triggers a build of the publishing-e2e-tests integration test suite after the app's own test suite passes. Such a build has many parameters in Jenkins: Sometimes we need to manually tweak these parameters. For example, if we're making changes to two different apps (which live in two different git repositories) and need to test them against each other. So any CI solution we go for absolutely has to support human-tweaking of build parameters, and preferably in a more friction-free way than having having to commit a list of parameters to the branch. I accept that this isn't ideal, that it would be nice for the repository to define the exact state of the thing and not need to rely on a human filling in the right parameters, but we're a long way from that at the moment. A little while back we looked at using Concourse for CI, and this lack was a major part of why we didn't proceed. |
Just to represent the GOV.UK Design System team's needs for CI as a public comment:
The GOV.UK Design System aims to do open source rather than coding in the open so we have many contributions by people outside of GDS (alphagov). Representing the community outside of government is essential for the success of the Design System. Our workflow relies on the fact that CI runs when they open a pull request, it helps contributors fix mistakes themselves and reduces the overhead for us as we tend to get lots of smaller contributions. We will not be able to use any CI approach that blocks external contribution, thank you for taking all these needs into account when you figure out the next steps. We're available to chat about details of contribution if you need :) |
People have been asking on the status of this, and we're currently working on back-office things in order to proceed. We've reviewed the comments and confirmed the hypotheses we had before we started this process:
What we are doing is:
In the meantime:
Separately, we have admin work to do for evaluating how we administer version control, and we will use the comments from this PR to that end. |
Closing this as we now have GitHub Enterprise procured, which is in mild use for CI We still want a guidance page here, and I'll reference this PR when writing it |
⏰ The deadline for first pass of submissions is Friday 29th November ⏰
After this deadline, we will work on presenting a more concrete proposal based on the user needs externalised in this document...
What
We are writing up https://docs.google.com/document/d/1PZXZwD9yrP-toI1gpaSi9NWpwl0zAVHUoJgb_65zKko/edit?ts=5dc99a4a#heading=h.2s8g0l8netcs into some documentation about how GDS should do continuous integration.
We think that a GitHub thread of comments that we can write up will be more collaborative across GDS than a Google doc.
In the following comments I will be posting some snippets from the Google doc.
How to contribute
Please comment your thoughts / feedback and Harker and I will add them to the proposal once we've collated people's thoughts.
Some prompts:
What does your ideal development workflow look like? (assuming continuous deployment)
What software do your tests require to run?
Is your project open-source, does it have open source contributors, how do they expect to contribute?
How do you verify the integrity of your code? (e.g. commit signing, etc)
Status
1. Collection of comments/opinions
⏰ Deadline Wednesday 27th November (DONE)
Once this deadline expires we will work to make a concrete proposal out of people's contributions. The proposal will take the form (these are examples not the actual proposal):
Procure a SaaS tool for doing continuous integration with the following features:
Use X open-source tool and run it centrally from TechOps, specifically for Continuous Integration, with the following features:
2. Proposal
This is in progress
We will write the proposal, and people should add clarifying comments.
3. Do the work
We will take the actions from the concrete proposal, for instance