Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refresh-credential] rotate the AWS ECR credential by schedule #15313

Merged
merged 14 commits into from
Dec 15, 2022

Conversation

jenting
Copy link
Contributor

@jenting jenting commented Dec 12, 2022

Description

Add a new component refresh-credential to rotate the AWS ECR authorization token by schedule.

The refresh-credential component is installed once the containerRegistry.external.url is an AWS ECR URL.

Remember to configure the

  • Create a Kubernetes secret with AWS access/secret key pair.
    aws configure
    
    # default path is ~/.aws/credentials
    kubectl create secret aws-iam-credential -n default --from-file=</path/to/credentialsfile>
  • containerRegistry.inCluster = false to indicate to use external container registry.
  • containerRegistry.external.certificate indicates the secret name the AWS ECR authorization token write to.
  • containerRegistry.external.credentials which points to the AWS secret that with the AWS access/secret key pair.
    Example yaml manifest:
    containerRegistry:
      inCluster: false
      external:
        url: 012345678969.dkr.ecr.us-west-1.amazonaws.com
        certificate:
          kind: secret
          name: aws-ecr-credential
        credentials:
          kind: secret
          name: aws-iam-credential

The refresh-credential run as Kubernetes CronJob. The AWS ECR authorization token expires every 12 hours. The CronJob runs every 6 hours. Therefore, the on-caller have time to mitigate it once the ECR authorization token rotation fails.

We can add an alert rule when the job hits the backoffLimit. The trigger criteria is to query kube_job_status_failed{job_name=~"refresh-credential.*",reason="BackoffLimitExceeded"} > 0.
Reference to kube_job_status_failed.

https://www.loom.com/share/6d4b9dedc2df4a32bea87c12e3533230

Note
Since we haven't migrate/sync all the Gitpod component container image from GCP GCR to AWS ECR. Therefore, the secret with .dockerconfigjson requires to have both GCP GCR and AWS ECR authorization token because the image-builder push to AWS ECR, however the blobserve and registry-facade pull from GCP GCR.

Note
Because we set the Job restartPolicy = OnFailure, the pod will be terminated once the max backoff retry limit reaches. So we need to rely on the logging system to troubleshoot.

Warning
Please aware the service quotas.

Related Issue(s)

Fixes #12104

How to test

Try https://g704b4152c75da6e18a1605.workspace-preview.gitpod-io-dev.com/workspaces

  1. Trigger a imagebuild
  2. Check the base-image and workspace-image on the AWS ECR (us-west-1)

Release Notes

Support AWS ECR container registry

Documentation

None

Werft options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh

According to Kubernetes doc, a container using a Secret as a
subPath volume mount will not receive Secret updates.

Signed-off-by: JenTing Hsiao <[email protected]>
@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-jenting-aws-ecr-14914.68 because the annotations in the pull request description changed
(with .werft/ from main)

@jenting jenting marked this pull request as ready for review December 13, 2022 04:19
@jenting jenting requested review from a team December 13, 2022 04:19
@github-actions github-actions bot added team: SID team: IDE team: workspace Issue belongs to the Workspace team labels Dec 13, 2022
@jenting jenting marked this pull request as draft December 13, 2022 04:23
@jenting jenting force-pushed the jenting/aws-ecr-14914 branch 4 times, most recently from c962856 to 37a3115 Compare December 13, 2022 08:12
@jenting jenting force-pushed the jenting/aws-ecr-14914 branch from ee58365 to fe1542d Compare December 14, 2022 11:37
Copy link
Contributor

@mrsimonemms mrsimonemms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks broadly ok, but I've put a couple of questions/comments in there. Around on Slack if you want to chat about them.

Copy link
Contributor

@Pothulapati Pothulapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits, but looks good to me!

/hold just in case. Feel free to remove

@@ -81,6 +81,7 @@ func deployment(ctx *common.RenderContext) ([]runtime.Object, error) {
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: secretName,
Items: []corev1.KeyToPath{{Key: ".dockerconfigjson", Path: "pull-secret.json"}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is pull-secret the right name? as we would also use it for pushes here. Could be tackled with https://github.com/gitpod-io/security/issues/89

Copy link
Contributor Author

@jenting jenting Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is pull-secret.json now, ref code.
I updated the https://github.com/gitpod-io/security/issues/89 to track it.

- Provide AWS doc/code link
- Check IsAWSECRURL in Object once
- Check credential AWS access/secret key pair exists

Signed-off-by: JenTing Hsiao <[email protected]>
@jenting jenting force-pushed the jenting/aws-ecr-14914 branch from 10bc7a3 to ae4f39d Compare December 14, 2022 13:52
@jenting jenting requested a review from mrsimonemms December 14, 2022 14:15
@jenting
Copy link
Contributor Author

jenting commented Dec 14, 2022

@mrsimonemms PTAL 🙏

Copy link
Contributor

@mrsimonemms mrsimonemms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one additional question

@mrsimonemms mrsimonemms self-requested a review December 14, 2022 14:34
Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jenting , wow, solid work!

Can you share an updated Loom video showing the test?

Also, I think we'll need follow-on PRs:

  1. We should be using IRSA, instead of persisting client_id and client_secret as a Kubernetes secret
  2. What do you think of renaming registry-credential to refresh-ecr-credential? Naming things is hard. 😉
  3. What is the plan for creating an alert for when this job fails? If it attempts 10 times and still fails, how much time would that leave for us to react, before the cluster is unable to interact with the registry?

Comment on lines 64 to 65
accessKey := string(credSecret.Data[accessKeyIdName])
secretKey := string(credSecret.Data[secretAccessKeyName])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenting we should not be persisting an access and secret key to Kubernetes to interface with AWS. Instead, we should be leveraging an IAM role for a service account, aka IRSA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, we should avoid persisting long lived secrets to workspace clusters, and negotiate a temporary AWS role credential instead, to do the GetAuthorizationToken work.

After validating the token's signature, IAM exchanges the Kubernetes issued token for a temporary AWS role credential.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll need to use awsconfig.WithAssumeRoleCredentialOptions as an alternative to awsconfig.WithCredentialsProvider

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see an example from @mrzarquon here via ecr-helper service account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylos101 I saw for S3 object storage, we use the shared config.

@Furisto Could you tell me how did you prepare the credential file you developed for the S3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Furisto Looks like you were using the shared config for development.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update the doc, same as how the S3 load the credentials file.

Copy link
Member

@Furisto Furisto Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenting I created a Kubernetes secret that contains the AWS shared config file (which gives access only to a specific bucket) and then mounted the file into the container. Then I reference that file when I am creating the client.. The file contains the credentials for a user that has no permissions except for the ones assigned by a bucket policy.

Certificate ObjectRef `json:"certificate" validate:"required"`
URL string `json:"url" validate:"required"`
Certificate ObjectRef `json:"certificate" validate:"required"`
Credential *ObjectRef `json:"credential,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skeptical we need this Credential field, please see my IRSA feedback for example.

components/BUILD.yaml Outdated Show resolved Hide resolved

// CredentialSecret points to a Kubernetes secret which contains the credential to rotate
// the container registry credential .
CredentialSecret string `json:"credentialSecret"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CredentialSecret should not be needed, I think, if we rely on IRSA.

log.Infof("Secret %s/%s updated with new ECR credentials", cfg.SecretToUpdate, cfg.Namespace)
}

func newAWSConfig(region, accessKeyId, secretAccessKey, session string) (aws.Config, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be changed, for example, to configure AWS given the service account associated with this new component.

@@ -101,7 +101,7 @@ func configmap(ctx *common.RenderContext) ([]runtime.Object, error) {
MaxSize: MaxSizeBytes,
},
},
AuthCfg: "/mnt/pull-secret.json",
AuthCfg: "/mnt/pull-secret/pull-secret.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget, but, nesting the secret within a sub-directory was necessary, so that we can watch it for changes, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right 💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 21 to 24
credentialSecretName, err := credentialSecretName(ctx)
if err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to remove, given IRSA feedback

@jenting
Copy link
Contributor Author

jenting commented Dec 15, 2022

@kylos101

  1. What do you think of renaming registry-credential to refresh-ecr-credential? Naming things is hard. 😉

We could rename it to refresh-credential because I don't want this component specific for ECR only.

  1. What is the plan for creating an alert for when this job fails? If it attempts 10 times and still fails, how much time would that leave for us to react, before the cluster is unable to interact with the registry?

According to the Kubernetes doc, 10-time retry means the alert triggered after 10s+20s+40s+80s+160s+320s+360s+360s+360s+360s=34.5minutes.
Given the cronjob runs every 6 hours and the authorization token is valid for 12 hours. We have at least 5 hours to react.

@jenting jenting force-pushed the jenting/aws-ecr-14914 branch from 3998706 to 796b970 Compare December 15, 2022 03:56
@jenting
Copy link
Contributor Author

jenting commented Dec 15, 2022

  1. We should be using IRSA, instead of persisting client_id and client_secret as a Kubernetes secret

I talked with @kylos101, and Kyle said since S3 requires the same secret, don't worry about my feedback for IRSA.
Therefore, I will change this PR following how S3 loads the credential files.

Correct me if I am wrong, but if I remember correctly, the IRSA (the AWs STS assume-role) is workable when the Kubernetes cluster lands on the AWS EC2 because the STS assume-role relies on the AWS EC2 information to grant the temporary credential.
We could change how we get an AWS credential file in the follow-up PR when required.

@jenting jenting changed the title [registry-credential] rotate the AWS ECR credential by schedule [refresh-credential] rotate the AWS ECR credential by schedule Dec 15, 2022
@jenting jenting force-pushed the jenting/aws-ecr-14914 branch 2 times, most recently from d4f8cbe to 9d61e17 Compare December 15, 2022 05:15
@jenting jenting force-pushed the jenting/aws-ecr-14914 branch 2 times, most recently from 2697c6a to c09d2f9 Compare December 15, 2022 06:46
When using concurrentPolicy=Replace and the job failed but haven't reach the backoff
limit, the new job will replace the original one if the schedule time is
less than the sum of the backoff time.

It causes a problem that the job alert
`kube_job_status_failed{job_name=~"refresh-credential.*",reason="BackoffLimitExceeded"}` can't be fired.

Signed-off-by: JenTing Hsiao <[email protected]>
@jenting
Copy link
Contributor Author

jenting commented Dec 15, 2022

/unhold

All the comments are addressed. Thanks all the reviewers' comments.

We could have follow up PRs

  • if IRSA is required
  • if we want to separate the AWS IAM policy for the image-builder and registry-facade/blobserve.

@roboquat roboquat merged commit e7233ec into main Dec 15, 2022
@roboquat roboquat deleted the jenting/aws-ecr-14914 branch December 15, 2022 08:48
@roboquat roboquat added deployed: IDE IDE change is running in production deployed: workspace Workspace team change is running in production labels Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: IDE IDE change is running in production deployed: workspace Workspace team change is running in production release-note size/XXL team: IDE team: SID team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rotate container registry secret for Amazon's Elastic Container Registry (ECR)
9 participants