Skip to content

Commit

Permalink
Commit changes made by code formatters
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Jan 16, 2024
1 parent ed48ac8 commit afd89e6
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 45 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,12 @@ We are proposing that we aim for a "single sign on" approach where users can use

The current most complete source of this information for people who will be the first users of the cloud platform is GitHub. So our proposal is to use GitHub as our initial user directory - authentication for the new services that we are building will be through GitHub.


## Decision

We will use GitHub as the identify provider for the cloud platform.

We will design and build the new cloud platform with the assumption that users will login to all components using a single GitHub id.


## Consequences

We will define users and groups in GitHub and use GitHub's integration tools to provide access to other tools that require authentication.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ After consideration of the pros and cons of each approach we went with one clust

Some important reasons behind this move were:

* A single k8s cluster can be made powerful enough to run all of our workloads
* Managing a single cluster keeps our operational overhead and costs to a minimum.
* Namespaces and RBAC keep different workloads isolated from each other.
* It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments
- A single k8s cluster can be made powerful enough to run all of our workloads
- Managing a single cluster keeps our operational overhead and costs to a minimum.
- Namespaces and RBAC keep different workloads isolated from each other.
- It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments

To clarify the last point; to be useful, a development cluster must be as similar as possible to the production cluster. However, given multiple clusters, with different security and other constraints, some 'drift' is inevitable - e.g. the development cluster might be upgraded to a newer kubernetes version before staging and production, or it could have different connectivity into private networks, or different performance constraints from the production cluster.

Expand All @@ -39,6 +39,6 @@ If namespace segregation is not sufficient for this, then the whole cloud platfo

Having a single cluster to maintain works well for us.

* Service teams know that their development environments accurately reflect the production environments they will eventually create
* There is no duplication of effort, maintaining multiple, slightly different clusters
* All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies)
- Service teams know that their development environments accurately reflect the production environments they will eventually create
- There is no duplication of effort, maintaining multiple, slightly different clusters
- All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies)
36 changes: 18 additions & 18 deletions architecture-decision-record/021-Multi-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,27 @@ Date: 2021-05-11

## What’s proposed

We host user apps across *more than one* Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster *may* be further isolated by placing them in separate VPCs or separate AWS accounts.
We host user apps across _more than one_ Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster _may_ be further isolated by placing them in separate VPCs or separate AWS accounts.

## Context

Service teams' apps currently run on [one Kubernetes cluster](012-One-cluster-for-dev-staging-prod.html). That includes their dev/staging/prod environments - they are not split off. The key reasoning was:

* Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments
* Maintaining clusters for each environment is a cost in effort
* You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod.
- Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments
- Maintaining clusters for each environment is a cost in effort
- You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod.

(We also have clusters for other purposes: a 'management' cluster for Cloud Platform team's CI/CD and ephemeral 'test' clusters for the Cloud Platform team to test changes to the cluster.)

However we have seen some problems with using one cluster, and advantages to moving to multi-cluster:

* Scaling limits
* Single point of failure
* Derisk upgrading of k8s
* Reduce blast radius for security
* Reduce blast radius of accidental deletion
* Pre-prod cluster
* Cattle not pets
- Scaling limits
- Single point of failure
- Derisk upgrading of k8s
- Reduce blast radius for security
- Reduce blast radius of accidental deletion
- Pre-prod cluster
- Cattle not pets

### Scaling limits

Expand All @@ -40,11 +40,11 @@ Running everything on a single cluster is a 'single point of failure', which is

Several elements in the cluster are a single point of failure:

* ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls))
* external-dns
* cert manager
* kiam
* OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58))
- ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls))
- external-dns
- cert manager
- kiam
- OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58))

### Derisk upgrading of k8s

Expand Down Expand Up @@ -76,8 +76,8 @@ Multi-cluster will allow us to put pre-prod environments on a separate cluster t

If we were to create a fresh cluster, and an app is moved onto it, then there are a lot of impacts:

* **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl.
* **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated.
- **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl.
- **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated.

## Steps to achieve it

Expand Down
22 changes: 11 additions & 11 deletions architecture-decision-record/023-Logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ Cloud Platform's existing strategy for logs has been to **centralize** them in a

Concerns with existing ElasticSearch logging:

* ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes)
* CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs`
* Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged
* Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES
- ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes)
- CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs`
- Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged
- Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES

With these concerns in mind, and the [migration to EKS](022-EKS.html) meaning we'd need to reimplement log shipping, we reevaluate this strategy.

Expand All @@ -37,11 +37,11 @@ Rather than centralized logging in ES, we'll evaluate different logging solution

**AWS services for logging** - with the cluster now in EKS, it wouldn't be too much of a leap to centralizing logs in CloudWatch and make use of the AWS managed tools. One one hand it's proprietary to AWS, so adds cost of switching away. But it might be preferable to the cost of running ES, and related tools like GuardDuty and Security Hub, with use across Modernization Platform, is attractive.

### Observing apps**
### Observing apps\*\*

* Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it *requires* heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability.
* CloudWatch Logs - possible and low operational overhead - needs further evaluation.
* Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging.
- Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it _requires_ heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability.
- CloudWatch Logs - possible and low operational overhead - needs further evaluation.
- Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging.

### Observing the platform

Expand All @@ -53,9 +53,9 @@ TBD

### Security

* MLAP was designed for this, but it is stalled, so probably best to manage it ourselves.
* ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period.
* AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous.
- MLAP was designed for this, but it is stalled, so probably best to manage it ourselves.
- ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period.
- AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous.

## Next steps

Expand Down
8 changes: 4 additions & 4 deletions architecture-decision-record/034-EKS-Fargate.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ Move from EKS managed nodes to EKS Fargate.

This is really attractive because:

* to reduce our operational overhead
* improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container).
- to reduce our operational overhead
- improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container).

However there’s plenty of things we’d need to tackle, to achieve this (copied from [ADR022 EKS - Fargate considerations](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md#future-fargate-considerations)):

**Pod limits** - there is a quota limit of [500 Fargate pods per region per AWS Account](https://aws.amazon.com/about-aws/whats-new/2020/09/aws-fargate-increases-default-resource-count-service-quotas/) which could be an issue, considering we currently run ~2000 pods. We can request AWS raise the limit - not currently sure what scope there is. With Multi-cluster stage 5, the separation of loads into different AWS accounts will settle this issue.

**Daemonset functionality** - needs replacement:

* fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch.
* prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network
- fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch.
- prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network

**No EBS support** - Prometheus will run still in a managed node group. Likely other workloads too to consider.

Expand Down
6 changes: 3 additions & 3 deletions runbooks/source/add-concourse-to-cluster.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -77,12 +77,12 @@ Follow the URL this command outputs, choose to login with Username/Password, and

- Apply your pipeline

Please do not deploy the bootstrap pipeline in the [Concourse repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse/tree/main/pipelines/manager/main) into your test cluster. It is
Please do not deploy the bootstrap pipeline in the [Concourse repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse/tree/main/pipelines/manager/main) into your test cluster. It is
for production level deployment and may trigger false alarms to our Slack Channel.

To ensure an isolated testing environment, please create a new folder on your local machine and start with a simple pipeline. You may use [this link](https://concourse-ci.org/tutorial-hello-world.html) as reference
To ensure an isolated testing environment, please create a new folder on your local machine and start with a simple pipeline. You may use [this link](https://concourse-ci.org/tutorial-hello-world.html) as reference
to deploy the first pipeline into your test cluster and not the one under `manager/main`.

```
fly --target david-test1 set-pipeline \
--pipeline plan-pipeline \
Expand Down

0 comments on commit afd89e6

Please sign in to comment.