Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add prow build clusters #830

Merged
merged 19 commits into from
May 26, 2020
Merged

Conversation

spiffxp
Copy link
Member

@spiffxp spiffxp commented May 1, 2020

Implements: #752 (comment)

I tried to break this down into logical commits with details in each commit, so it may be easier to review that way. Or, I included README's for the modules and clusters, which try to dump current state and how I arrived there.

This sets up three modules in infra/gcp/clusters/modules:

  • k8s-infra-gke-project - provisions a project to host a build cluster, using kubernetes-public as a reference
  • k8s-infra-gke-cluster - provisions a gke cluster with node service account and usage info dumped into a bigquery dataset
  • k8s-infra-gke-nodepool - provisions a gke nodepool with some defaults that are k8s-infra specific

(note: they all depend on terraform ~> 0.12.20 and google ~> 3.19.0; I have not figured out the best way to non-destructively migrate aaa to this setup)

I used those modules to create two project+cluster+nodepool combos:

  • k8s-infra-prow-build/prow-build: the untrusted build cluster, and service accounts for pods and boskos-janitor
  • k8s-infra-prow-build-trusted/prow-build-trusted: the trusted build cluster, and service accounts for pods and gcb-builder (for jobs that push to releng/staging projects)
  • each of these have a resources/ dir containing kubernetes resource .yaml files destined for the cluster (I deployed these manually from a cloud-shell instance)

I added e2e projects intended for the untrusted build cluster:

  • k8s-infra-e2e-boskos-nnn: 40 projects for boskos to manage in a k8s-infra-gce-project pool
  • k8s-infra-e2e-gce-project: for pinning to an e2e job for development/debugging
  • k8s-infra-e2e-node-e2e-project: for pinning to a node e2e job for development/debugging

I updated some of the ensure-* scripts:

  • allow gcb-builder service account to push to releng/staging projects
  • allow k8s-auditor-gcp service account to run on prow-build-trusted

I hooked the clusters up to prow.k8s.io with on-call's help:

  • we discovered that gs://kubernetes-release-dev won't allow non-google.com accounts to be added to iam, which will prevent us from migrating some e2e's

I migrated some jobs over:

I hooked prow-build's boskos instance up to monitoring.k8s.prow.io

I removed the old kubernetes-public/prow-build-test cluster and spiffxp-* e2e projects

Followup work I've broken out into other issues:

/cc @thockin @cblecker @bartsmykla

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. wg/k8s-infra approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 1, 2020
resource "google_project" "project" {
name = var.project_name
project_id = var.project_name
org_id = "758905017065" // kubernetes.io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a string variable with a default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with that, but moved to hardcodes because I don't think we want to give people a choice to use other orgs/billing (in much the same way our ensure_project bash sets these for everything)

@spiffxp spiffxp changed the title [wip] add prow build clusters add prow build clusters May 2, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 2, 2020
@spiffxp
Copy link
Member Author

spiffxp commented May 2, 2020

/hold
for comment

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2020
enabled = true
}
}
resource "google_container_cluster" "test_cluster" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand this resource. Why not define a module test-k8s-infra-gke-cluster ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like copy-pasting resource definitions between modules was more likely to fall out-of-sync than copy-pasting within the same module. Copy-paste is the only approach I can use for any resource whose lifecycle depends on a flag/environment, since terraform doesn't allow these values to be derived from variables.

@spiffxp
Copy link
Member Author

spiffxp commented May 5, 2020

Turns out in order to support kind-ipv6 jobs the build cluster needs to be running an Ubuntu image, since COS doesn't provide an ipv6 stack. Using terraform apply to change the configuration of the node pool, I'm curious how this will roll out.

@spiffxp
Copy link
Member Author

spiffxp commented May 5, 2020

terraform apply timed out with

Error: Error waiting for updating GKE node pool: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 30m0s)

Running terraform plan again shows it wants to do the same update-in-place of the nodepool as before. Meanwhile I see the operation still ongoing via gcloud. Not going to bother re-running terraform.

# gcloud container operations describe operation-1588648687093-274e5e0a --region=us-central1
detail: 'Done with 3 out of 6 nodes (50.0%): 1 being processed, 3 succeeded'
name: operation-1588648687093-274e5e0a
operationType: UPGRADE_NODES
progress:
  metrics:
  - intValue: '6'
    name: NODES_TOTAL
  - intValue: '3'
    name: NODES_DONE
  - intValue: '0'
    name: NODES_DRAINING
  - intValue: '0'
    name: NODES_UPGRADING
  - intValue: '1'
    name: NODES_CREATING
  - intValue: '0'
    name: NODES_REGISTERING
  - intValue: '3'
    name: NODES_COMPLETE
selfLink: https://container.googleapis.com/v1/projects/773781448124/locations/us-central1/operations/operation-1588648687093-274e5e0a
startTime: '2020-05-05T03:18:07.093064626Z'
status: RUNNING
targetLink: https://container.googleapis.com/v1/projects/773781448124/locations/us-central1/clusters/prow-build/nodePools/pool1-20200430220922185300000001
zone: us-central1

@ameukam
Copy link
Member

ameukam commented May 6, 2020

We could also add support for labels and in taints k8s-infra-gke-cluster. It could be useful for greenhouse.

@spiffxp spiffxp mentioned this pull request May 7, 2020
@spiffxp
Copy link
Member Author

spiffxp commented May 7, 2020

@ameukam

We could also add support for labels and in taints k8s-infra-gke-cluster. It could be useful for greenhouse.

I think this is a great idea, but I'd like to cap this PR off here. I'm starting to get concerned about the amount of uncommitted infra I have running live.

If we agree we're fine with the terraform structured as is, I'll open a followup issue to support labels and would welcome the help.

@spiffxp
Copy link
Member Author

spiffxp commented May 7, 2020

Ping for reviews. I know it's a size/XXL, if there's something I can do to help make this easier to digest, let me know

Copy link
Member

@cblecker cblecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

// to set lifecycle.prevent_destroy to false if is_prod_cluster
// keep prod_ and test_ identical except for "unique to " comments
resource "google_container_cluster" "prod_cluster" {
count = var.is_prod_cluster == "true" ? 1 : 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you toggle this on the same object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think going from not->prod with a bare terraform apply would nuke the test resources and create the prod resources.

I think going from prod->not would leave you with two copies of the resources actuated, even if terraform "forgot" about the prod resources.

I suspect you could get out of this and just have terraform start treating a resource differently if you used terraform state mv + terraform plan to verify there were no changes to actuate

@spiffxp
Copy link
Member Author

spiffxp commented May 11, 2020

ping @thockin I want to make sure we get your eyes on this

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood most of this

infra/gcp/clusters/README.md Outdated Show resolved Hide resolved
infra/gcp/clusters/modules/k8s-infra-gke-cluster/main.tf Outdated Show resolved Hide resolved
infra/gcp/clusters/k8s-infra-prow-build/prow-build/main.tf Outdated Show resolved Hide resolved

locals {
project_id = "k8s-infra-prow-build"
cluster_name = "prow-build" // The name of the cluster defined in this file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ever going to need more than one? Should the be prow-build-1 or prow-build-aaa or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically the prow team hasn't ever had more than one build cluster per GCP project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have here a project named prow-build and a cluster named prow-build. I guess I am skeptical that we'll never want to use some GKE change that requires rebuilding the cluster, is all. If you're confident, I'll go with it, since it's likely you who has to deal with the mess if you are wrong :)

infra/gcp/clusters/k8s-infra-prow-build/prow-build/main.tf Outdated Show resolved Hide resolved
*/

locals {
project_id = "k8s-infra-prow-build-trusted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I pick on names? I like symmetry, so I'd expect to see

k8s-infra-prow-build + k8s-infra-prow-trusted

or

k8s-infra-prow-untrusted + k8s-infra-prow-trusted

or

k8s-infra-prow-build-untrusted + k8s-infra-prow-build-trusted

Is there a reason not to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. I like all of these suggestions better than what I chose. I prefer the first since the names are shortest.

The reason not to would be that renaming the project id at this point is going to involve creating a new project/cluster/nodepool combo along with the requisite coordination with prow.k8s.io oncall. The blocker at the moment is our projects being capped at billing quota.

I can file an issue to redo the trusted cluster as k8s-infra-prow-trusted/prow-trusted and use it as an opportunity to have someone shadow, or someone else go through this while I watch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think the rename is worth it I'll open a ticket, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair.

infra/gcp/ensure-e2e-projects.sh Show resolved Hide resolved
make a projects dir and move all projects into it, instead of having
modules be "the one dir that isn't a project"

update README to follow suit
this has been broken out into a separate issue
@spiffxp
Copy link
Member Author

spiffxp commented May 16, 2020

Update PTAL

@spiffxp
Copy link
Member Author

spiffxp commented May 21, 2020

ping for review

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spiffxp, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spiffxp
Copy link
Member Author

spiffxp commented May 26, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 26, 2020
@k8s-ci-robot k8s-ci-robot merged commit e3db4e2 into kubernetes:master May 26, 2020
@spiffxp spiffxp deleted the prow-build-clusters branch May 27, 2020 17:20
@spiffxp spiffxp added the area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters label Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants