-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add prow build clusters #830
Conversation
resource "google_project" "project" { | ||
name = var.project_name | ||
project_id = var.project_name | ||
org_id = "758905017065" // kubernetes.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be a string
variable with a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with that, but moved to hardcodes because I don't think we want to give people a choice to use other orgs/billing (in much the same way our ensure_project
bash sets these for everything)
/hold |
enabled = true | ||
} | ||
} | ||
resource "google_container_cluster" "test_cluster" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand this resource. Why not define a module test-k8s-infra-gke-cluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like copy-pasting resource definitions between modules was more likely to fall out-of-sync than copy-pasting within the same module. Copy-paste is the only approach I can use for any resource whose lifecycle depends on a flag/environment, since terraform doesn't allow these values to be derived from variables.
infra/gcp/clusters/k8s-infra-prow-build-trusted/prow-build-trusted/00-provider.tf
Outdated
Show resolved
Hide resolved
Turns out in order to support kind-ipv6 jobs the build cluster needs to be running an Ubuntu image, since COS doesn't provide an ipv6 stack. Using |
Running # gcloud container operations describe operation-1588648687093-274e5e0a --region=us-central1
detail: 'Done with 3 out of 6 nodes (50.0%): 1 being processed, 3 succeeded'
name: operation-1588648687093-274e5e0a
operationType: UPGRADE_NODES
progress:
metrics:
- intValue: '6'
name: NODES_TOTAL
- intValue: '3'
name: NODES_DONE
- intValue: '0'
name: NODES_DRAINING
- intValue: '0'
name: NODES_UPGRADING
- intValue: '1'
name: NODES_CREATING
- intValue: '0'
name: NODES_REGISTERING
- intValue: '3'
name: NODES_COMPLETE
selfLink: https://container.googleapis.com/v1/projects/773781448124/locations/us-central1/operations/operation-1588648687093-274e5e0a
startTime: '2020-05-05T03:18:07.093064626Z'
status: RUNNING
targetLink: https://container.googleapis.com/v1/projects/773781448124/locations/us-central1/clusters/prow-build/nodePools/pool1-20200430220922185300000001
zone: us-central1 |
de7a266
to
0b1bc3c
Compare
We could also add support for labels and in taints |
created 2 manual projects and 40 boskos projects
actuated via: - terraform destroy before removing the .tf files - gcloud projects delete for the e2e projects - gcloud compute addresses delete for the boskos-metrics ip
0b1bc3c
to
9d9483e
Compare
I think this is a great idea, but I'd like to cap this PR off here. I'm starting to get concerned about the amount of uncommitted infra I have running live. If we agree we're fine with the terraform structured as is, I'll open a followup issue to support labels and would welcome the help. |
Ping for reviews. I know it's a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
// to set lifecycle.prevent_destroy to false if is_prod_cluster | ||
// keep prod_ and test_ identical except for "unique to " comments | ||
resource "google_container_cluster" "prod_cluster" { | ||
count = var.is_prod_cluster == "true" ? 1 : 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you toggle this on the same object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think going from not->prod with a bare terraform apply
would nuke the test resources and create the prod resources.
I think going from prod->not would leave you with two copies of the resources actuated, even if terraform "forgot" about the prod resources.
I suspect you could get out of this and just have terraform start treating a resource differently if you used terraform state mv
+ terraform plan
to verify there were no changes to actuate
ping @thockin I want to make sure we get your eyes on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood most of this
|
||
locals { | ||
project_id = "k8s-infra-prow-build" | ||
cluster_name = "prow-build" // The name of the cluster defined in this file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ever going to need more than one? Should the be prow-build-1 or prow-build-aaa or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically the prow team hasn't ever had more than one build cluster per GCP project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have here a project named prow-build and a cluster named prow-build. I guess I am skeptical that we'll never want to use some GKE change that requires rebuilding the cluster, is all. If you're confident, I'll go with it, since it's likely you who has to deal with the mess if you are wrong :)
*/ | ||
|
||
locals { | ||
project_id = "k8s-infra-prow-build-trusted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I pick on names? I like symmetry, so I'd expect to see
k8s-infra-prow-build
+ k8s-infra-prow-trusted
or
k8s-infra-prow-untrusted
+ k8s-infra-prow-trusted
or
k8s-infra-prow-build-untrusted
+ k8s-infra-prow-build-trusted
Is there a reason not to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well. I like all of these suggestions better than what I chose. I prefer the first since the names are shortest.
The reason not to would be that renaming the project id at this point is going to involve creating a new project/cluster/nodepool combo along with the requisite coordination with prow.k8s.io oncall. The blocker at the moment is our projects being capped at billing quota.
I can file an issue to redo the trusted cluster as k8s-infra-prow-trusted/prow-trusted
and use it as an opportunity to have someone shadow, or someone else go through this while I watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think the rename is worth it I'll open a ticket, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair.
infra/gcp/clusters/k8s-infra-prow-build-trusted/prow-build-trusted/main.tf
Outdated
Show resolved
Hide resolved
make a projects dir and move all projects into it, instead of having modules be "the one dir that isn't a project" update README to follow suit
update READMEs accordingly
this has been broken out into a separate issue
Update PTAL |
ping for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: spiffxp, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Implements: #752 (comment)
I tried to break this down into logical commits with details in each commit, so it may be easier to review that way. Or, I included README's for the modules and clusters, which try to dump current state and how I arrived there.
This sets up three modules in
infra/gcp/clusters/modules
:k8s-infra-gke-project
- provisions a project to host a build cluster, using kubernetes-public as a referencek8s-infra-gke-cluster
- provisions a gke cluster with node service account and usage info dumped into a bigquery datasetk8s-infra-gke-nodepool
- provisions a gke nodepool with some defaults that are k8s-infra specific(note: they all depend on
terraform ~> 0.12.20
andgoogle ~> 3.19.0
; I have not figured out the best way to non-destructively migrateaaa
to this setup)I used those modules to create two project+cluster+nodepool combos:
k8s-infra-prow-build/prow-build
: the untrusted build cluster, and service accounts for pods andboskos-janitor
k8s-infra-prow-build-trusted/prow-build-trusted
: the trusted build cluster, and service accounts for pods andgcb-builder
(for jobs that push to releng/staging projects)resources/
dir containing kubernetes resource .yaml files destined for the cluster (I deployed these manually from a cloud-shell instance)I added e2e projects intended for the untrusted build cluster:
k8s-infra-e2e-boskos-nnn
: 40 projects for boskos to manage in ak8s-infra-gce-project
poolk8s-infra-e2e-gce-project
: for pinning to an e2e job for development/debuggingk8s-infra-e2e-node-e2e-project
: for pinning to a node e2e job for development/debuggingI updated some of the
ensure-*
scripts:gcb-builder
service account to push to releng/staging projectsk8s-auditor-gcp
service account to run on prow-build-trustedI hooked the clusters up to prow.k8s.io with on-call's help:
gs://kubernetes-release-dev
won't allow non-google.com accounts to be added to iam, which will prevent us from migrating some e2e'sI migrated some jobs over:
I hooked prow-build's boskos instance up to monitoring.k8s.prow.io
I removed the old
kubernetes-public/prow-build-test
cluster andspiffxp-*
e2e projectsFollowup work I've broken out into other issues:
greenhouse
(bazel build cache) inprow-build
(opened Setup greenhouse in k8s-infra-prow-build cluster #842)ghproxy
inprow-build-trusted
(for tools likeperibolos
) (opened Setup ghproxy in k8s-infra-prow-build-trusted #843)aaa
did) (opened Develop better ACLs for prow build clusters and e2e projects #844)/cc @thockin @cblecker @bartsmykla