Deploying Cluster for Pangeo #489

sgibson91 · 2021-06-28T16:22:43Z

This PR adds a tfvars file to terraform/projects that will deploy a Kubernetes cluster into the Pangeo GCP project pangeo-integration-te-3eea.

The size of the machines I took from the nodes Pangeo currently have deployed in their pangeo-181919 project
I named the scratch bucket pangeo-scratch according to the infrastructure overview here: Documenting the current GCP deployment pangeo-data/pangeo-cloud-federation#874

Task issue: fix #488
Hub issue: #482

At this point, I am accepting feedback on just about everything, from naming conventions to machine choice :)

Update: Waiting on private cluster support #538

damianavila · 2021-06-28T23:40:45Z

machine choice

Do we have any pre-existing measurements from the Pangeo community to know the load of the existing clusters so we can match them with proper nodes sizes?

yuvipanda · 2021-06-29T07:34:02Z

Do we have any pre-existing measurements from the Pangeo community to know the load of the existing clusters so we can match them with proper nodes sizes?

It should match the profiles present in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d4f868e4d5c1ea92675facdabff86d49f31c7253/deployments/gcp-uscentral1b/config/common.yaml#L77 I think

damianavila · 2021-06-29T15:36:44Z

It should match the profiles

I agree, the remaining question is how we select nodes to optimize those profiles in terms of cost (among other things).
Data from the existing Pangeo clusters could give us indications to perform that optimization.
I might be trying to optimize stuff a little bit early, I know, but since there is already data available, let's get the most of the previous experience! 😉

yuvipanda · 2021-06-29T15:40:01Z

In general, the current pangeo clusters are very non-dense - usually each user gets their own node (practically). So each pod's resource requests / limits track with the instance type. This is what we do now for our hubs too!

yuvipanda · 2021-06-29T15:40:23Z

let's get the most of the previous experience! 😉

I agree! But I don't want to block this PR on that :D

sgibson91 · 2021-06-29T16:22:47Z

By exploiting the fact that the machine size is listed in the node names when using the below command:

kubectl get pods -o wide | grep MACHINE_TYPE | wc -l

I got the following pod counts for different node types on the staging and prod deployments. (default-pool is type n1-standard-1.)

Namespace: staging
default-pool: 9
n1-highmem-4: 1

Namespace: prod
default-pool: 8
n1-highmem-2: 6
n1-highmem-4: 13

sgibson91 · 2021-07-01T12:35:36Z

So I think I have a permissions issue. I get the following error when running terraform init in this repo. I assume it's related to pulling the terraform state from GCP storage buckets.

$ terraform init

Initializing the backend...

Error: Failed to get existing workspaces: querying Cloud Storage failed: googleapi: Error 403: [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket., forbidden

sgibson91 · 2021-07-01T12:49:24Z

So I think I have a permissions issue. I get the following error when running terraform init in this repo. I assume it's related to pulling the terraform state from GCP storage buckets.
$ terraform init

Initializing the backend...

Error: Failed to get existing workspaces: querying Cloud Storage failed: googleapi: Error 403: [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket., forbidden

This was solved by making me an "owner" rather than an "organisation admin" on the gcp-org-admins group

terraform/projects/pangeo-hubs.tfvars

We know we'll need a scratch bucket, hence config connector, so the n1-highmem-4 machine is best suited

damianavila · 2021-07-06T21:32:57Z

@sgibson91, do you want to get this one off the draft state now? Or are you thinking of deploying this one as is to test it and then get out of the draft?

sgibson91 · 2021-07-07T10:00:20Z

Marked as ready for review. I will do a manual deploy. I'm interested to see what CI, if any, breaks when we merge this as we don't have permissions to the Pangeo project that this repo expects, I think.

yuvipanda · 2021-07-07T17:31:14Z

@sgibson91 terraform deploys are still manual, we have no CD for those. Deploy by hand and iterate, and we can merge?

sgibson91 · 2021-07-07T18:10:46Z

Realise I've been commenting on the related issue thinking it was this PR 🤦🏻‍♀️ I tried a manual deploy (or rather, just terraform plan) and got a whole bunch of errors, even after trying with the access token you suggested

What I did

Logged into gcloud with 2i2c.org account
Generated an access token with gcloud auth application-default print-access-token
Edited below block in main.tf to include the access token

https://github.com/2i2c-org/pilot-hubs/blob/861b55c1e98a5ee5c5111975e32bb9e0fdcd6980/terraform/main.tf#L1-L6

Ran terraform init -reconfigure (Maybe this should have been -migrate-state?)
Logged into gcloud with Columbia account
Ran terraform plan -var-file=projects/pangeo-hubs.tfvars

What I got

module.gke.module.gcloud_delete_default_kube_dns_configmap.module.gcloud_kubectl.null_resource.module_depends_on[0]: Refreshing state... [id=8865553039709359568]
module.gke.random_string.cluster_service_account_suffix: Refreshing state... [id=owc3]
module.gke.random_shuffle.available_zones: Refreshing state... [id=-]
google_artifact_registry_repository.container_repository: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1/repositories/]
module.gke.google_project_iam_member.cluster_service_account-metric_writer[0]: Refreshing state... [id=two-eye-two-see/roles/monitoring.metricWriter/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_project_iam_member.cluster_service_account-log_writer[0]: Refreshing state... [id=two-eye-two-see/roles/logging.logWriter/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.service_accounts.google_project_iam_member.project-roles[0]: Refreshing state... [id=two-eye-two-see/roles/container.admin/serviceaccount:[email protected]]
module.service_accounts.google_project_iam_member.project-roles[2]: Refreshing state... [id=two-eye-two-see/roles/compute.instanceAdmin.v1/serviceaccount:[email protected]]
module.service_accounts.google_service_account.service_accounts[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/[email protected]]
module.gke.google_project_iam_member.cluster_service_account-resourceMetadata-writer[0]: Refreshing state... [id=two-eye-two-see/roles/stackdriver.resourceMetadata.writer/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_project_iam_member.cluster_service_account-monitoring_viewer[0]: Refreshing state... [id=two-eye-two-see/roles/monitoring.viewer/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_container_node_pool.pools["dask-worker-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/dask-worker-pool]
module.gke.google_container_node_pool.pools["user-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/user-pool]
module.service_accounts.google_service_account_key.keys[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/[email protected]/keys/107710688b17ce563c639416dbc445ee4998ae53]
module.gke.google_container_cluster.primary: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster]
module.service_accounts.google_project_iam_member.project-roles[1]: Refreshing state... [id=two-eye-two-see/roles/artifactregistry.writer/serviceaccount:[email protected]]
module.gke.google_service_account.cluster_service_account[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
google_project_iam_member.project: Refreshing state... [id=two-eye-two-see/roles/artifactregistry.reader/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_container_node_pool.pools["core-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/core-pool]
╷
│ Error: Error when reading or editing Service Account "projects/two-eye-two-see/serviceAccounts/[email protected]": googleapi: Error 403: Permission iam.serviceAccounts.get is required to perform this operation on service account projects/two-eye-two-see/serviceAccounts/[email protected]., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/monitoring.metricWriter" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/stackdriver.resourceMetadata.writer" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container Cluster "low-touch-hubs-cluster": googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/artifactregistry.writer" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container NodePool dask-worker-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing ArtifactRegistryRepository "projects/two-eye-two-see/locations/us-central1/repositories/": googleapi: Error 403: Permission 'artifactregistry.repositories.get' denied on resource '//artifactregistry.googleapis.com/projects/two-eye-two-see/locations/us-central1/repositories/low-touch-hubs' (or it may not exist).
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
│     "domain": "artifactregistry.googleapis.com",
│     "metadata": {
│       "permission": "artifactregistry.repositories.get",
│       "resource": "projects/two-eye-two-see/locations/us-central1/repositories/low-touch-hubs"
│     },
│     "reason": "IAM_PERMISSION_DENIED"
│   }
│ ]
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/logging.logWriter" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/compute.instanceAdmin.v1" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/container.admin" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container NodePool core-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Service Account "projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": googleapi: Error 403: Permission iam.serviceAccounts.get is required to perform this operation on service account projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/monitoring.viewer" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container NodePool user-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/artifactregistry.reader" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵

yuvipanda · 2021-07-07T20:57:38Z

I fiddled around with this a bit, and realized that we needed to create a new terraform workspace for this to work.

I ran terraform workspace new pangeo-hubs, and then terraform plan started showing me sensible outputs. I've deleted that workspace now, so it can be recreated again to do a full run through. Try it out, @sgibson91?

sgibson91 · 2021-07-08T12:27:08Z

Hooray! The new workspace worked and the output of terraform plan looks way more sensible now!

terraform plan output

terraform plan -var-file=projects/pangeo-hubs.tfvars -out=plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_artifact_registry_repository.registry will be created
  + resource "google_artifact_registry_repository" "registry" {
      + create_time   = (known after apply)
      + format        = "DOCKER"
      + id            = (known after apply)
      + location      = "us-central1"
      + name          = (known after apply)
      + project       = "pangeo-integration-te-3eea"
      + repository_id = "pangeo-hubs-registry"
      + update_time   = (known after apply)
    }

  # google_container_cluster.cluster will be created
  + resource "google_container_cluster" "cluster" {
      + cluster_ipv4_cidr           = (known after apply)
      + datapath_provider           = (known after apply)
      + default_max_pods_per_node   = (known after apply)
      + enable_binary_authorization = false
      + enable_intranode_visibility = (known after apply)
      + enable_kubernetes_alpha     = false
      + enable_l4_ilb_subsetting    = false
      + enable_legacy_abac          = false
      + enable_shielded_nodes       = (known after apply)
      + enable_tpu                  = false
      + endpoint                    = (known after apply)
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + label_fingerprint           = (known after apply)
      + location                    = "us-central1-b"
      + logging_service             = (known after apply)
      + master_version              = (known after apply)
      + monitoring_service          = (known after apply)
      + name                        = "pangeo-hubs-cluster"
      + network                     = "default"
      + networking_mode             = (known after apply)
      + node_locations              = (known after apply)
      + node_version                = (known after apply)
      + operation                   = (known after apply)
      + private_ipv6_google_access  = (known after apply)
      + project                     = "pangeo-integration-te-3eea"
      + remove_default_node_pool    = true
      + self_link                   = (known after apply)
      + services_ipv4_cidr          = (known after apply)
      + subnetwork                  = (known after apply)
      + tpu_ipv4_cidr_block         = (known after apply)

      + addons_config {
          + cloudrun_config {
              + disabled           = (known after apply)
              + load_balancer_type = (known after apply)
            }

          + config_connector_config {
              + enabled = true
            }

          + dns_cache_config {
              + enabled = (known after apply)
            }

          + gce_persistent_disk_csi_driver_config {
              + enabled = (known after apply)
            }

          + horizontal_pod_autoscaling {
              + disabled = true
            }

          + http_load_balancing {
              + disabled = true
            }

          + istio_config {
              + auth     = (known after apply)
              + disabled = (known after apply)
            }

          + kalm_config {
              + enabled = (known after apply)
            }

          + network_policy_config {
              + disabled = (known after apply)
            }
        }

      + authenticator_groups_config {
          + security_group = (known after apply)
        }

      + cluster_autoscaling {
          + autoscaling_profile = "OPTIMIZE_UTILIZATION"
          + enabled             = false

          + auto_provisioning_defaults {
              + min_cpu_platform = (known after apply)
              + oauth_scopes     = (known after apply)
              + service_account  = (known after apply)
            }
        }

      + cluster_telemetry {
          + type = (known after apply)
        }

      + confidential_nodes {
          + enabled = (known after apply)
        }

      + database_encryption {
          + key_name = (known after apply)
          + state    = (known after apply)
        }

      + default_snat_status {
          + disabled = (known after apply)
        }

      + ip_allocation_policy {
          + cluster_ipv4_cidr_block       = (known after apply)
          + cluster_secondary_range_name  = (known after apply)
          + services_ipv4_cidr_block      = (known after apply)
          + services_secondary_range_name = (known after apply)
        }

      + master_auth {
          + client_certificate     = (known after apply)
          + client_key             = (sensitive value)
          + cluster_ca_certificate = (known after apply)
          + password               = (sensitive value)
          + username               = (known after apply)

          + client_certificate_config {
              + issue_client_certificate = (known after apply)
            }
        }

      + network_policy {
          + enabled = true
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = (known after apply)
          + local_ssd_count   = (known after apply)
          + machine_type      = (known after apply)
          + metadata          = (known after apply)
          + oauth_scopes      = (known after apply)
          + preemptible       = false
          + service_account   = (known after apply)
          + taint             = (known after apply)

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = (known after apply)
            }
        }

      + node_pool {
          + initial_node_count  = (known after apply)
          + instance_group_urls = (known after apply)
          + max_pods_per_node   = (known after apply)
          + name                = (known after apply)
          + name_prefix         = (known after apply)
          + node_count          = (known after apply)
          + node_locations      = (known after apply)
          + version             = (known after apply)

          + autoscaling {
              + max_node_count = (known after apply)
              + min_node_count = (known after apply)
            }

          + management {
              + auto_repair  = (known after apply)
              + auto_upgrade = (known after apply)
            }

          + node_config {
              + boot_disk_kms_key = (known after apply)
              + disk_size_gb      = (known after apply)
              + disk_type         = (known after apply)
              + guest_accelerator = (known after apply)
              + image_type        = (known after apply)
              + labels            = (known after apply)
              + local_ssd_count   = (known after apply)
              + machine_type      = (known after apply)
              + metadata          = (known after apply)
              + min_cpu_platform  = (known after apply)
              + oauth_scopes      = (known after apply)
              + preemptible       = (known after apply)
              + service_account   = (known after apply)
              + tags              = (known after apply)
              + taint             = (known after apply)

              + ephemeral_storage_config {
                  + local_ssd_count = (known after apply)
                }

              + kubelet_config {
                  + cpu_cfs_quota        = (known after apply)
                  + cpu_cfs_quota_period = (known after apply)
                  + cpu_manager_policy   = (known after apply)
                }

              + linux_node_config {
                  + sysctls = (known after apply)
                }

              + sandbox_config {
                  + sandbox_type = (known after apply)
                }

              + shielded_instance_config {
                  + enable_integrity_monitoring = (known after apply)
                  + enable_secure_boot          = (known after apply)
                }

              + workload_metadata_config {
                  + node_metadata = (known after apply)
                }
            }

          + upgrade_settings {
              + max_surge       = (known after apply)
              + max_unavailable = (known after apply)
            }
        }

      + notification_config {
          + pubsub {
              + enabled = (known after apply)
              + topic   = (known after apply)
            }
        }

      + release_channel {
          + channel = "UNSPECIFIED"
        }

      + workload_identity_config {
          + identity_namespace = "pangeo-integration-te-3eea.svc.id.goog"
        }
    }

  # google_container_node_pool.core will be created
  + resource "google_container_node_pool" "core" {
      + cluster             = "pangeo-hubs-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 1
      + instance_group_urls = (known after apply)
      + location            = "us-central1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "core-pool"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + operation           = (known after apply)
      + project             = "pangeo-integration-te-3eea"
      + version             = (known after apply)

      + autoscaling {
          + max_node_count = 5
          + min_node_count = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = 30
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "core"
              + "k8s.dask.org/node-purpose"    = "core"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-4"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + taint             = (known after apply)

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = (known after apply)
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

  # google_container_node_pool.dask_worker["worker"] will be created
  + resource "google_container_node_pool" "dask_worker" {
      + cluster             = "pangeo-hubs-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 0
      + instance_group_urls = (known after apply)
      + location            = "us-central1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "dask-worker"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + operation           = (known after apply)
      + project             = "pangeo-integration-te-3eea"
      + version             = (known after apply)

      + autoscaling {
          + max_node_count = 100
          + min_node_count = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-ssd"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "k8s.dask.org/node-purpose" = "worker"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-4"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = true
          + service_account   = (known after apply)
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "k8s.dask.org_dedicated"
                  + value  = "worker"
                },
            ]

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = "GKE_METADATA_SERVER"
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

  # google_container_node_pool.notebook["user"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster             = "pangeo-hubs-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 0
      + instance_group_urls = (known after apply)
      + location            = "us-central1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "nb-user"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + operation           = (known after apply)
      + project             = "pangeo-integration-te-3eea"
      + version             = (known after apply)

      + autoscaling {
          + max_node_count = 20
          + min_node_count = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-4"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = "GKE_METADATA_SERVER"
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

  # google_project_iam_custom_role.identify_project_role will be created
  + resource "google_project_iam_custom_role" "identify_project_role" {
      + deleted     = (known after apply)
      + description = "Minimal role for hub users on pangeo-hubs to identify as current project"
      + id          = (known after apply)
      + name        = (known after apply)
      + permissions = [
          + "serviceusage.services.use",
        ]
      + project     = "pangeo-integration-te-3eea"
      + role_id     = "pangeo_hubs_user_sa_role"
      + stage       = "GA"
      + title       = "Identify as project role for users in pangeo-hubs"
    }

  # google_project_iam_member.cd_sa_roles["roles/artifactregistry.writer"] will be created
  + resource "google_project_iam_member" "cd_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/artifactregistry.writer"
    }

  # google_project_iam_member.cd_sa_roles["roles/container.admin"] will be created
  + resource "google_project_iam_member" "cd_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/container.admin"
    }

  # google_project_iam_member.cluster_sa_roles["roles/artifactregistry.reader"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/artifactregistry.reader"
    }

  # google_project_iam_member.cluster_sa_roles["roles/logging.logWriter"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/logging.logWriter"
    }

  # google_project_iam_member.cluster_sa_roles["roles/monitoring.metricWriter"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/monitoring.metricWriter"
    }

  # google_project_iam_member.cluster_sa_roles["roles/monitoring.viewer"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/monitoring.viewer"
    }

  # google_project_iam_member.cluster_sa_roles["roles/stackdriver.resourceMetadata.writer"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/stackdriver.resourceMetadata.writer"
    }

  # google_project_iam_member.identify_project_binding will be created
  + resource "google_project_iam_member" "identify_project_binding" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = (known after apply)
    }

  # google_service_account.cd_sa will be created
  + resource "google_service_account" "cd_sa" {
      + account_id   = "pangeo-hubs-cd-sa"
      + display_name = "Continuous Deployment SA for pangeo-hubs"
      + email        = (known after apply)
      + id           = (known after apply)
      + name         = (known after apply)
      + project      = "pangeo-integration-te-3eea"
      + unique_id    = (known after apply)
    }

  # google_service_account.cluster_sa will be created
  + resource "google_service_account" "cluster_sa" {
      + account_id   = "pangeo-hubs-cluster-sa"
      + display_name = "Cluster SA for pangeo-hubs"
      + email        = (known after apply)
      + id           = (known after apply)
      + name         = (known after apply)
      + project      = "pangeo-integration-te-3eea"
      + unique_id    = (known after apply)
    }

  # google_service_account_key.cd_sa will be created
  + resource "google_service_account_key" "cd_sa" {
      + id                 = (known after apply)
      + key_algorithm      = "KEY_ALG_RSA_2048"
      + name               = (known after apply)
      + private_key        = (sensitive value)
      + private_key_type   = "TYPE_GOOGLE_CREDENTIALS_FILE"
      + public_key         = (known after apply)
      + public_key_type    = "TYPE_X509_PEM_FILE"
      + service_account_id = (known after apply)
      + valid_after        = (known after apply)
      + valid_before       = (known after apply)
    }

  # google_storage_bucket.user_buckets["pangeo-scratch"] will be created
  + resource "google_storage_bucket" "user_buckets" {
      + bucket_policy_only          = (known after apply)
      + force_destroy               = false
      + id                          = (known after apply)
      + location                    = "US-CENTRAL1"
      + name                        = "pangeo-hubs-pangeo-scratch"
      + project                     = "pangeo-integration-te-3eea"
      + self_link                   = (known after apply)
      + storage_class               = "STANDARD"
      + uniform_bucket_level_access = (known after apply)
      + url                         = (known after apply)
    }

  # google_storage_bucket_iam_member.member["pangeo-scratch"] will be created
  + resource "google_storage_bucket_iam_member" "member" {
      + bucket = "pangeo-hubs-pangeo-scratch"
      + etag   = (known after apply)
      + id     = (known after apply)
      + member = (known after apply)
      + role   = "roles/storage.admin"
    }

Plan: 19 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + ci_deployer_key = (sensitive value)

Ok, I'm gonna deploy now 🙂

sgibson91 · 2021-07-08T12:38:45Z

Ok, got a couple more errors, but thankfully I don't think these have anything to do with permissions!

Error: Error waiting for creating GKE cluster: Not all instances running in IGM after 40.240072593s. Expected 1, running 0, transitioning 1. Current errors: [CONDITION_NOT_MET]: Instance 'gke-pangeo-hubs-cluster-default-pool-65fa3508-485z' creation failed: Constraint constraints/compute.vmExternalIpAccess violated for project 291560455175. Add instance projects/pangeo-integration-te-3eea/zones/us-central1-b/instances/gke-pangeo-hubs-cluster-default-pool-65fa3508-485z to the constraint to use external IP with it.

with google_container_cluster.cluster,
on cluster.tf line 1, in resource "google_container_cluster" "cluster":
1: resource "google_container_cluster" "cluster" {

and

Error: Error creating Repository: googleapi: Error 403: Artifact Registry API has not been used in project 291560455175 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/artifactregistry.googleapis.com/overview?project=291560455175 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
Details:
[
{
"@\type": "type.googleapis.com/google.rpc.Help",
"links": [
{
"description": "Google developers console API activation",
"url": "https://console.developers.google.com/apis/api/artifactregistry.googleapis.com/overview?project=291560455175"
}
]
},
{
"@\type": "type.googleapis.com/google.rpc.ErrorInfo",
"domain": "googleapis.com",
"metadata": {
"consumer": "projects/291560455175",
"service": "artifactregistry.googleapis.com"
},
"reason": "SERVICE_DISABLED"
}
]

with google_artifact_registry_repository.registry,
on registry.tf line 6, in resource "google_artifact_registry_repository" "registry":
6: resource "google_artifact_registry_repository" "registry" {

this second error looks like it's "just" a case of enabling the registry API on the project 😜

sgibson91 · 2021-07-08T12:42:25Z

Enabled Artifact Registry API and retrying...

sgibson91 · 2021-07-08T12:54:04Z

Registry successfully deployed, now just gotta figure out the cluster. Looks like an organisational constraint is preventing the cluster from assigning an external IP https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints#:~:text=INSTANCE-,constraints%2Fcompute.vmexternalipaccess,-is

yuvipanda · 2021-07-08T13:06:36Z

Looks like an organisational constraint is preventing the cluster from assigning an external IP https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints#:~:text=INSTANCE-,constraints%2Fcompute.vmexternalipaccess,-is

Yeah, I remember hearing about this from some other staff at Columbia. I think this requires @rabernat to intervene now?

yuvipanda · 2021-07-12T11:13:04Z

@sgibson91 can you open an issue with the error you encountered?

@rabernat I think we need to:

Understand what Columbia's policies are about getting traffic into the cluster. Is it a blanket approval after you get one external IP? Or does it need approval for each external IP? Cluster design will have to change based on that.
Ask whoever managed cloud policy at Columbia to allow us to get traffic into the cluster, and figure out the process for it is.

Do you know where we can learn about (1)? My contact has moved on from Columbia unfortunately, but if we don't make progress via other means I can reach out to him

rabernat · 2021-07-12T11:14:10Z

Ok, since this issue is public, I think I'll just refer them to this. I have emailed my contact at CUIT with a request for assistance.

rabernat · 2021-07-12T14:48:04Z

Question from Parixit which needs a "correct / incorrect" response:

After enabling the API details, the only error that is present is the organizational constraint that is preventing the cluster from assigning an external IP, correct?

sgibson91 · 2021-07-12T14:52:54Z

Correct

It belongs in 2i2c-org#489

Ordering is important here. sops tries to use the first rule that matches the regex and does not work through the list if it fails

.sops.yaml

yuvipanda

This LGTM! We should auto-deploy it (like with #569) from CI, but that doesn't need to block this PR.

@sgibson91 I'd suggest that you:

Do one final deploy to make sure things work ok,
Self-merge this, so state of infra matches what is in master.

Alternatively, you can add Continuous Deployment of this first, so the state is maintained automatically.

Excited to get this done!

…ubs into new-cluster/pangeo-hub

sgibson91 · 2021-08-04T09:34:03Z

I had to add a labels attribute to the user and dask notebook blocks in the tfvars file to solve the below. I just left them blank. e821387

│ Error: Invalid value for input variable
│
│ on projects/pangeo-hubs.tfvars line 13:
│ 13: notebook_nodes = {
│ 14: "user" : {
│ 15: min : 0,
│ 16: max : 20,
│ 17: machine_type : "n1-highmem-4"
│ 18: },
│ 19: }
│
│ The given value is not valid for variable "notebook_nodes": element "user": attribute "labels" is required.
╵
╷
│ Error: Invalid value for input variable
│
│ on projects/pangeo-hubs.tfvars line 21:
│ 21: dask_nodes = {
│ 22: "worker" : {
│ 23: min : 0,
│ 24: max : 100,
│ 25: machine_type : "n1-highmem-4"
│ 26: },
│ 27: }

Add first pass at tfvars file for Pangeo hubs

3122d9c

sgibson91 added the 🏷️ pangeo label Jun 28, 2021

sgibson91 requested a review from yuvipanda June 28, 2021 16:24

sgibson91 self-assigned this Jun 28, 2021

sgibson91 mentioned this pull request Jul 1, 2021

Document how to provide GCP access 2i2c-org/team-compass#133

Closed

2 tasks

damianavila reviewed Jul 1, 2021

View reviewed changes

terraform/projects/pangeo-hubs.tfvars Outdated Show resolved Hide resolved

Update core pool machine type

b413759

We know we'll need a scratch bucket, hence config connector, so the n1-highmem-4 machine is best suited

sgibson91 mentioned this pull request Jul 5, 2021

Team Sync - Jul 05, 2021 2i2c-org/team-compass#142

Closed

sgibson91 marked this pull request as ready for review July 7, 2021 09:59

sgibson91 changed the title ~~[WIP] Deploying Cluster for Pangeo~~ Deploying Cluster for Pangeo Jul 7, 2021

sgibson91 requested a review from damianavila July 7, 2021 10:00

sgibson91 mentioned this pull request Jul 7, 2021

Deploy a cluster into GCP pangeo-integration-te-3eea project for Pangeo pilot hubs #488

Closed

4 tasks

sgibson91 mentioned this pull request Jul 12, 2021

Team Sync - Jul 12, 2021 2i2c-org/team-compass#150

Closed

sgibson91 mentioned this pull request Jul 21, 2021

Enabling private nodes for GKE clusters #538

Merged

3 tasks

sgibson91 mentioned this pull request Jul 26, 2021

Team Sync - Jul 26, 2021 2i2c-org/team-compass#172

Closed

Update pangeo-hubs input vars to provision a private cluster

5430f40

sgibson91 added a commit to sgibson91/infrastructure that referenced this pull request Aug 2, 2021

Shouldn't add the deployer key in this PR

8cd8202

It belongs in 2i2c-org#489

sgibson91 added 3 commits August 2, 2021 20:08

Merge branch 'master' into new-cluster/pangeo-hub

a1e04b7

Add sops config for key stored in pangeo project

ba38653

Ordering is important here. sops tries to use the first rule that matches the regex and does not work through the list if it fails

Add encrypted deployer key for pangeo cluster

5eb0f46

sgibson91 commented Aug 3, 2021

View reviewed changes

.sops.yaml Outdated Show resolved Hide resolved

Use more explicit path_regex for pangeo-hubs sops key

3df4f02

sgibson91 commented Aug 3, 2021

View reviewed changes

.sops.yaml Outdated Show resolved Hide resolved

Fix typo

94f4be1

sgibson91 mentioned this pull request Aug 3, 2021

Support multiple backends for SOPS #575

Closed

yuvipanda approved these changes Aug 4, 2021

View reviewed changes

sgibson91 added 4 commits August 4, 2021 10:26

Merge branch 'master' into new-cluster/pangeo-hub

d9fb04e

Merge branch 'new-cluster/pangeo-hub' of github.com:sgibson91/pilot-h…

3fc0393

…ubs into new-cluster/pangeo-hub

Move pangeo-hubs.tfvars into appropriate folder

528dfaf

Add empty labels to notebook and dask nodes

e821387

Update ci_deployer_key for the cluster

ff71b92

sgibson91 merged commit 021da9c into 2i2c-org:master Aug 4, 2021

sgibson91 deleted the new-cluster/pangeo-hub branch August 4, 2021 09:39

choldgraf mentioned this pull request Sep 4, 2021

Blog post: 2i2c quarterly update Q3 2i2c-org/team-compass#235

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying Cluster for Pangeo #489

Deploying Cluster for Pangeo #489

sgibson91 commented Jun 28, 2021 •

edited

Loading

damianavila commented Jun 28, 2021

yuvipanda commented Jun 29, 2021

damianavila commented Jun 29, 2021

yuvipanda commented Jun 29, 2021

yuvipanda commented Jun 29, 2021

sgibson91 commented Jun 29, 2021 •

edited

Loading

sgibson91 commented Jul 1, 2021

sgibson91 commented Jul 1, 2021

damianavila commented Jul 6, 2021

sgibson91 commented Jul 7, 2021

yuvipanda commented Jul 7, 2021

sgibson91 commented Jul 7, 2021 •

edited

Loading

yuvipanda commented Jul 7, 2021

sgibson91 commented Jul 8, 2021

sgibson91 commented Jul 8, 2021

sgibson91 commented Jul 8, 2021

sgibson91 commented Jul 8, 2021

yuvipanda commented Jul 8, 2021

yuvipanda commented Jul 12, 2021

rabernat commented Jul 12, 2021 •

edited

Loading

rabernat commented Jul 12, 2021

sgibson91 commented Jul 12, 2021

yuvipanda left a comment

sgibson91 commented Aug 4, 2021

Deploying Cluster for Pangeo #489

Deploying Cluster for Pangeo #489

Conversation

sgibson91 commented Jun 28, 2021 • edited Loading

damianavila commented Jun 28, 2021

yuvipanda commented Jun 29, 2021

damianavila commented Jun 29, 2021

yuvipanda commented Jun 29, 2021

yuvipanda commented Jun 29, 2021

sgibson91 commented Jun 29, 2021 • edited Loading

sgibson91 commented Jul 1, 2021

sgibson91 commented Jul 1, 2021

damianavila commented Jul 6, 2021

sgibson91 commented Jul 7, 2021

yuvipanda commented Jul 7, 2021

sgibson91 commented Jul 7, 2021 • edited Loading

What I did

What I got

yuvipanda commented Jul 7, 2021

sgibson91 commented Jul 8, 2021

sgibson91 commented Jul 8, 2021

sgibson91 commented Jul 8, 2021

sgibson91 commented Jul 8, 2021

yuvipanda commented Jul 8, 2021

yuvipanda commented Jul 12, 2021

rabernat commented Jul 12, 2021 • edited Loading

rabernat commented Jul 12, 2021

sgibson91 commented Jul 12, 2021

yuvipanda left a comment

Choose a reason for hiding this comment

sgibson91 commented Aug 4, 2021

sgibson91 commented Jun 28, 2021 •

edited

Loading

sgibson91 commented Jun 29, 2021 •

edited

Loading

sgibson91 commented Jul 7, 2021 •

edited

Loading

rabernat commented Jul 12, 2021 •

edited

Loading