-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploying Cluster for Pangeo #489
Conversation
Do we have any pre-existing measurements from the Pangeo community to know the load of the existing clusters so we can match them with proper nodes sizes? |
It should match the profiles present in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d4f868e4d5c1ea92675facdabff86d49f31c7253/deployments/gcp-uscentral1b/config/common.yaml#L77 I think |
I agree, the remaining question is how we select nodes to optimize those profiles in terms of cost (among other things). |
In general, the current pangeo clusters are very non-dense - usually each user gets their own node (practically). So each pod's resource requests / limits track with the instance type. This is what we do now for our hubs too! |
I agree! But I don't want to block this PR on that :D |
By exploiting the fact that the machine size is listed in the node names when using the below command: kubectl get pods -o wide | grep MACHINE_TYPE | wc -l I got the following pod counts for different node types on the Namespace: Namespace: prod |
So I think I have a permissions issue. I get the following error when running $ terraform init
Initializing the backend...
Error: Failed to get existing workspaces: querying Cloud Storage failed: googleapi: Error 403: [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket., forbidden |
This was solved by making me an "owner" rather than an "organisation admin" on the gcp-org-admins group |
We know we'll need a scratch bucket, hence config connector, so the n1-highmem-4 machine is best suited
@sgibson91, do you want to get this one off the draft state now? Or are you thinking of deploying this one as is to test it and then get out of the draft? |
Marked as ready for review. I will do a manual deploy. I'm interested to see what CI, if any, breaks when we merge this as we don't have permissions to the Pangeo project that this repo expects, I think. |
@sgibson91 terraform deploys are still manual, we have no CD for those. Deploy by hand and iterate, and we can merge? |
Realise I've been commenting on the related issue thinking it was this PR 🤦🏻♀️ I tried a manual deploy (or rather, just terraform plan) and got a whole bunch of errors, even after trying with the access token you suggested What I did
What I gotmodule.gke.module.gcloud_delete_default_kube_dns_configmap.module.gcloud_kubectl.null_resource.module_depends_on[0]: Refreshing state... [id=8865553039709359568]
module.gke.random_string.cluster_service_account_suffix: Refreshing state... [id=owc3]
module.gke.random_shuffle.available_zones: Refreshing state... [id=-]
google_artifact_registry_repository.container_repository: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1/repositories/]
module.gke.google_project_iam_member.cluster_service_account-metric_writer[0]: Refreshing state... [id=two-eye-two-see/roles/monitoring.metricWriter/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_project_iam_member.cluster_service_account-log_writer[0]: Refreshing state... [id=two-eye-two-see/roles/logging.logWriter/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.service_accounts.google_project_iam_member.project-roles[0]: Refreshing state... [id=two-eye-two-see/roles/container.admin/serviceaccount:[email protected]]
module.service_accounts.google_project_iam_member.project-roles[2]: Refreshing state... [id=two-eye-two-see/roles/compute.instanceAdmin.v1/serviceaccount:[email protected]]
module.service_accounts.google_service_account.service_accounts[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/[email protected]]
module.gke.google_project_iam_member.cluster_service_account-resourceMetadata-writer[0]: Refreshing state... [id=two-eye-two-see/roles/stackdriver.resourceMetadata.writer/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_project_iam_member.cluster_service_account-monitoring_viewer[0]: Refreshing state... [id=two-eye-two-see/roles/monitoring.viewer/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_container_node_pool.pools["dask-worker-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/dask-worker-pool]
module.gke.google_container_node_pool.pools["user-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/user-pool]
module.service_accounts.google_service_account_key.keys[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/[email protected]/keys/107710688b17ce563c639416dbc445ee4998ae53]
module.gke.google_container_cluster.primary: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster]
module.service_accounts.google_project_iam_member.project-roles[1]: Refreshing state... [id=two-eye-two-see/roles/artifactregistry.writer/serviceaccount:[email protected]]
module.gke.google_service_account.cluster_service_account[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
google_project_iam_member.project: Refreshing state... [id=two-eye-two-see/roles/artifactregistry.reader/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_container_node_pool.pools["core-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/core-pool]
╷
│ Error: Error when reading or editing Service Account "projects/two-eye-two-see/serviceAccounts/[email protected]": googleapi: Error 403: Permission iam.serviceAccounts.get is required to perform this operation on service account projects/two-eye-two-see/serviceAccounts/[email protected]., forbidden
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/monitoring.metricWriter" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/stackdriver.resourceMetadata.writer" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵
╷
│ Error: Error when reading or editing Container Cluster "low-touch-hubs-cluster": googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/artifactregistry.writer" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵
╷
│ Error: Error when reading or editing Container NodePool dask-worker-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│
│
╵
╷
│ Error: Error when reading or editing ArtifactRegistryRepository "projects/two-eye-two-see/locations/us-central1/repositories/": googleapi: Error 403: Permission 'artifactregistry.repositories.get' denied on resource '//artifactregistry.googleapis.com/projects/two-eye-two-see/locations/us-central1/repositories/low-touch-hubs' (or it may not exist).
│ Details:
│ [
│ {
│ "@type": "type.googleapis.com/google.rpc.ErrorInfo",
│ "domain": "artifactregistry.googleapis.com",
│ "metadata": {
│ "permission": "artifactregistry.repositories.get",
│ "resource": "projects/two-eye-two-see/locations/us-central1/repositories/low-touch-hubs"
│ },
│ "reason": "IAM_PERMISSION_DENIED"
│ }
│ ]
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/logging.logWriter" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/compute.instanceAdmin.v1" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/container.admin" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵
╷
│ Error: Error when reading or editing Container NodePool core-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│
│
╵
╷
│ Error: Error when reading or editing Service Account "projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": googleapi: Error 403: Permission iam.serviceAccounts.get is required to perform this operation on service account projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com., forbidden
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/monitoring.viewer" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵
╷
│ Error: Error when reading or editing Container NodePool user-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│
│
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/artifactregistry.reader" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│
│
╵ |
I fiddled around with this a bit, and realized that we needed to create a new terraform workspace for this to work. I ran |
Hooray! The new workspace worked and the output of terraform plan looks way more sensible now! terraform plan output
Ok, I'm gonna deploy now 🙂 |
Ok, got a couple more errors, but thankfully I don't think these have anything to do with permissions!
and
this second error looks like it's "just" a case of enabling the registry API on the project 😜 |
Enabled Artifact Registry API and retrying... |
Registry successfully deployed, now just gotta figure out the cluster. Looks like an organisational constraint is preventing the cluster from assigning an external IP https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints#:~:text=INSTANCE-,constraints%2Fcompute.vmexternalipaccess,-is |
Yeah, I remember hearing about this from some other staff at Columbia. I think this requires @rabernat to intervene now? |
@sgibson91 can you open an issue with the error you encountered? @rabernat I think we need to:
Do you know where we can learn about (1)? My contact has moved on from Columbia unfortunately, but if we don't make progress via other means I can reach out to him |
Ok, since this issue is public, I think I'll just refer them to this. I have emailed my contact at CUIT with a request for assistance. |
Question from Parixit which needs a "correct / incorrect" response:
|
Correct |
It belongs in 2i2c-org#489
Ordering is important here. sops tries to use the first rule that matches the regex and does not work through the list if it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! We should auto-deploy it (like with #569) from CI, but that doesn't need to block this PR.
@sgibson91 I'd suggest that you:
- Do one final deploy to make sure things work ok,
- Self-merge this, so state of infra matches what is in master.
Alternatively, you can add Continuous Deployment of this first, so the state is maintained automatically.
Excited to get this done!
I had to add a
|
This PR adds a
tfvars
file toterraform/projects
that will deploy a Kubernetes cluster into the Pangeo GCP projectpangeo-integration-te-3eea
.pangeo-181919
projectpangeo-scratch
according to the infrastructure overview here: Documenting the current GCP deployment pangeo-data/pangeo-cloud-federation#874Task issue: fix #488
Hub issue: #482
At this point, I am accepting feedback on just about everything, from naming conventions to machine choice :)
Update: Waiting on private cluster support #538