-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testnet Deployment via CI/CD #396
Conversation
I'm bringing in This means new development environment variables to be set. The Skopeo works like this: skopeo --insecure-policy copy docker-archive:$(nix-build release.nix -A docker) docker://015248367786.dkr.ecr.ap-southeast-2.amazonaws.com Haven't run the above yet, and that is the AWS ECR we have. As for AWS, once the container image is uploaded, we have to trigger an update of the ECS service: aws ecs update-service \
--cluster polykey \
--service polykey \
--desired-count 1 \
--force-new-deployment I think I will also create a new cluster like |
With the new 22.05 revision that we are on with NixOS we can finally use the docker build again. So this should make some testing easier. The |
Funny how aws/aws-cli#194 was opened last year, and it took an entire year of work to get to this point again. |
The The first is that by default it will use the The other ways are through the command line parameter So lastly there is In our |
The Basically: {
"auths": {
"015248367786.dkr.ecr.ap-southeast-2.amazonaws.com": {
"auth": "..."
}
}
} To actually get the ECR login we have to do this, we have to convert our AWS credentials to them: # assume you have the `AWS_*` env variables set
aws ecr get-login-password --region ap-southeast-2 | skopeo login --username AWS --password-stdin 015248367786.dkr.ecr.ap-southeast-2.amazonaws.com --authfile=./tmp/auth.json Notice that the username is So this means, that in our |
Ok now that are authenticated to the ECR registry, and we also have the If this works, we will reify this command into one of our scripts. Optionally it can be accessible with I think right now there's a number of relevant scripts:
|
I've updated the skopeo list-tags docker://$CONTAINER_REPOSITORY
skopeo inspect docker://$CONTAINER_REPOSITORY
skopeo inspect --config docker://$CONTAINER_REPOSITORY:latest The # Container repository domain
CONTAINER_REGISTRY='015248367786.dkr.ecr.ap-southeast-2.amazonaws.com'
# Container name located on the registry
CONTAINER_REPOSITORY="$CONTAINER_REGISTRY/polykey" |
I'm playing around with The The tag currently corresponds to the nix output hash:
Which helps us connect the nix store output to the uploaded container images on ECR. Right now in order to extract the container_tag="$(skopeo list-tags docker-archive://$(nix-build release.nix -A docker) | jq -r '.Tags[0] | split(":")[1]')" Which gives us Which means we should also have Then we use: # preserve the $container_tag
skopeo --insecure-policy copy docker-archive:$(nix-build release.nix -A docker) "docker://$CONTAINER_REPOSITORY:$container_tag"
# now set it to the latest as well
skopeo --insecure-policy copy "docker://$CONTAINER_REPOSITORY:$container_tag" "docker://$CONTAINER_REPOSITORY:latest" Note that containers also have and image id. This image id is calculated separately based on the internal layer and maybe the roofs of the container image? Not sure. But it's possible to have a different nix output hash with the same image id (if the nix derivation changed, but the output image didn't). If nix used content addresses, this wouldn't be an issue... |
The pushing to the ECR is now in the These variables are now embedded in
AWS creds and Also added |
Due to the new environment variables necessary:
I've created a new user on aws: |
After some reading, we have some solutions for the security of the capabilities passed down via environment variables (and dealing with software supply chain security). The first idea is that gitlab supports scoped environment variables. This means it's possible to limit the injection of the environment variable to a specific environment scope. This is done through way factors:
So basically we can have
The Additionally the environment variable scope can be specified with wildcards. However if you need a variable that works for both If it is possible to define multiple variables but with different scopes, this help alleviate the problem of variable collision (where you end up unioning the capability permissions), such as when we need to deal with the nix cache, and also deal with ECS and ECR. Although I haven't tried yet. EDIT: checked, it is in fact possible to have multiple variables with the same name as long as their scopes are different. Cannot confirm whether regex like One additional thing is that regarding the usage of yaml, I find that pulumi is a better idea overall, a domain specific language is superior that just In other news, regarding the security of chocolatey packages. Right now chocolatey is using packages provided by the chocolatey community. In particular the bill of materials include nodejs and python, although extra packages may be needed in the future. In that sense, its no greater or lesser secure than npm packages and nixpkgs. All rely on the community. Officially they recommend hosting your own packages and internalizing them to avoid network access, especially given that packages are not "pinned" in chocolatey unlike nixpkgs (and part of the reason why we like nixpkgs). This we would like to do simply to improve our CICD performance and to avoid 429 too many request rate limiting. But from a security perspective no matter what you're always going to be running trusted (ideally) but unverified code. This is not ideal, all signatures can do is reifying the trust chain, and that ultimately results in a chain of liability. But this is a ex post-facto security technique. Regardless of the trust chain (https://www.chainguard.dev/), a vulnerability means the damage is already done. Preventing damage ahead of time requires more than just trust. And this leads to Principle of least privilege, which is enforced through one of 2 ways:
Most security attempts are done through the first technique. Some form of isolation, whether by virtual machines, containerisation, vm isolation, network isolation, and even environment variable filtering with the The fundamental problem with technique one, is that everything starts as open, and are we trying to selectively trying to close things, this is privacy as an after-thought. This is doomed to failure, because it is fundamentally not scalable. The fundamental problem with technique two, is that it makes interoperability something that requires forethought, this is privacy by default. Anyway about
This is different from nixpkgs since nixpkgs hashes are specified by the package set already, all done via our Homebrew would need something similar. Details on chocolatey usage will be documented further on our development wiki. |
…r nix-build Note that even though `/.*` was ignored, the `.env.example` is still in the filtered source. This is because `nix-gitignore` appears to prepend the additional ignores, and thus `!.env.example` in `.gitignore` overrides the `/.*`.
800fe56
to
b853b8c
Compare
…TRY`, `CONTAINER_REPOSITORY`, and `REGISTRY_AUTH_FILE`
…d using environment scopes for deployment jobs
b853b8c
to
0cac0e1
Compare
Turns out the This can be overridden by Now the issue different platforms have different temporary locations. In Unix we use We have a project specific temporary directory and that's in So we can do something like:
Now on windows jobs, they must override the
I believe this should work, but I'm not sure if we should be using |
So I'm also adding |
Ok the
On the windows runners it appears to create a problem:
Not sure what these are for. So redefining the windows temporary files is a no-go. That's ok for now since we aren't reliant on the |
As for This means we have to acquire this directly in the job instead of setting it as a long term credential. Basically we have to exchange the long term credential of |
Image deployment worked:
|
…mediate artifacts in `$TMPDIR` instead of hardcoded to `/var/tmp`
Creating an ECS cluster can be done via the CLI like this: aws ecs create-cluster \
--cluster-name 'polykey-testnet' \
--capacity-providers 'FARGATE' \
--default-capacity-provider-strategy 'capacityProvider=FARGATE' \
--output json If the cluster has already been created and with the same parameters, it just returns information that already exists. The only issue is the cluster doesn't exist or was created with different parameters, this command actually returns an error:
The only useful thing to do is to then do What we need to decide is to what extent we expect infrastructure to be created from the js-polykey repository. Do we want to orchestrate the entire AWS here, or do we already have expectations of certain things already being setup. I think we can do something simple right now, and rely on these idempotent commands that basically specify desired state, except for the fact that partial changes are not possible without performing updates. |
Something I didn't realise before is that the AWS's awsvpc networking mode for fargate containers have some automatic DNS server being used. I can't find any docs on what dns servers AWS uses, but it's not possible to inject our own dns servers into it:
So we just have to use what they provide. |
67c9595
to
956a218
Compare
The While creating the cluster was idempotent, the registration of the task definition is not, it just adds a new task definition all the time. It's sort of a waste to have loads of old task definitions around, especially if nothing actually changed. For now, as we generate new task definitions, once we perform the updated service, old task definitions can be automatically garbage collected (we may keep at least the last 10 just in case things change). It also turns out we don't need the full arn for the aws --profile=matrix iam get-role --role-name 'ecsTaskExecutionRole' | jq -r '.Role.Arn' |
Ok so I'm going to push the image here to using the |
Successful deployment! https://gitlab.com/MatrixAI/open-source/js-polykey/-/jobs/2700813855 Do note the usage of |
a742cb2
to
dfb6406
Compare
The deployment is all working. Now I'm updating to 22.05, and that also worked. Additionally, the So now the upon At the same time, at the |
Once merged, it will still depend on tests passing, and only occur in the |
Just tested an agent start... there's some errors to be fixed with testnet too. |
dfb6406
to
991ce7d
Compare
Last few things todo:
|
This blog post https://www.opensourcerers.org/2020/11/16/container-images-multi-architecture-manifests-ids-digests-whats-behind/ provides some interesting information on multi-architecture container images. I noticed that AWS now offers ARM architecture instances, and they are in fact cheaper than x86. Probably due to ARM CPU efficiencies. |
Container registry push working https://gitlab.com/MatrixAI/open-source/js-polykey/container_registry/3225961. |
ECS deployment now looks like this:
|
Thinking about the deployment from testnet to mainnet. The way we upload images right now is to always associate them as the latest image. However we should not point them as the latest image until we are releasing to mainnet. Otherwise it's possible that mainnet will pick up the latest image that is only for testnet. There's a few ways to solve this... We could create 2 ECR repositories, one for Use the same ECR repository, but don't use the Now lastly we can make use of tags themselves. Rather than using just the The |
Some logs regarding the strange behaviour connecting to testnet. Will be useful to you @emmacasolin
Local logs:
Then on the testnet node:
|
206a401
to
a38ac91
Compare
… stages of the pipeline * `integration:deployment` - deploys to testenet * `integration:prerelease` - deploys to GitLab container registry as `testnet` * `release:deployment:branch` - deploys to mainnet * `release:deployment:tag` - deploys to mainnet * `release:distribution` - deploys to GitLab container registry as `mainnet` mainnet deployment is still a stub
a38ac91
to
2df69eb
Compare
Description
This PR works on the
integration:deployment
in order to get PK deployed ontestnet.polykey.io
.The
release:deployment
jobs will be done later, aftermainnet
is available. But first we will focus just the testnet.See https://about.gitlab.com/blog/2021/04/09/demystifying-ci-cd-variables/ to understand how variables are inherited on gitlab.
All of our deployment will occur with shell scripts and usage of command line tools like
aws
andskopeo
. No usage ofterraform
yet for specifying infrastructure resources. I actually thinkpulumi
is a better idea overall for making infrastructure as code.See: https://aws.amazon.com/blogs/aws/amazon-ec2-update-virtual-private-clouds-for-everyone/ regarding the default VPC and how it works and https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html.
See: https://www.opensourcerers.org/2020/11/16/container-images-multi-architecture-manifests-ids-digests-whats-behind/ for explanation of container image internals.
Issues Fixed
Tasks
AWS_*
env variables for authenticating to ECS and ECR, for controlling the ECS for running the containersmatrix-ai-polykey
AWS bot account that manipulates ECS, ECR and nix cache (due to lack of better POLA, and token composition)CONTAINER_REGISTRY
andCONTAINER_REPOSITORY
variables to point to ECR for hosting our container images, alsoREGISTRY_AUTH_FILE
is used to authenticate skopeo to the registryawscli
,jq
, andskopeo
as tools innix-shell
scripts/deploy-image.sh
that usesskopeo
to push the image up to the ECRscripts/deploy-image.sh
tointegration:deployment
jobjs-polykey
gitlab project for it to use thematrix-ai-polykey
account andREGISTRY_AUTH_FILE
forstaging
andproduction
scoped jobspolykey-testnet
cluster to serve as the testnet cluster on AWS[ ] 9. MadeAWS_*
variables protected, even thematrix-ai-nix
user account, this prevents the usage of our s3 cache on non-protected references, so any other user who submits a pull-request will need manual running of the CICD. We will figure out how to open the CICD to non-members of MatrixAi in the future after verifying software supply chain security. This will require an update to ourgitlab-runner
- This makes all non protected branches/PRS not capable of running CI/CD jobs, because our gitlab-runner will error out whenAWS_
variables are not presentscripts/deploy-service.sh
that usesaws
replace the container image used for the servicepolykey-testnet
npm run deploy-service
intointegration:deployment
- this was done withnpm run
, because it doesn't have anything to do with NPM, it's just a script[ ] 15. Swap to using secret root keys based on Merge bools aws/aws-cli#285- doing this after we have fixed several bugs on the testnet Remove httpretty as a test dep aws/aws-cli#403 Updating requests to 2.0.0. aws/aws-cli#398 trying to upload empty file fails with a NotImplemented error aws/aws-cli#399 and infrastructure issuesutils.nix
to avoid bringing in unnecessary files into the nix src fornix-build
Final checklist