Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying Cluster for Pangeo #489

Merged
merged 13 commits into from
Aug 4, 2021

Conversation

sgibson91
Copy link
Member

@sgibson91 sgibson91 commented Jun 28, 2021

This PR adds a tfvars file to terraform/projects that will deploy a Kubernetes cluster into the Pangeo GCP project pangeo-integration-te-3eea.

Task issue: fix #488
Hub issue: #482

At this point, I am accepting feedback on just about everything, from naming conventions to machine choice :)

Update: Waiting on private cluster support #538

@damianavila
Copy link
Contributor

machine choice

Do we have any pre-existing measurements from the Pangeo community to know the load of the existing clusters so we can match them with proper nodes sizes?

@yuvipanda
Copy link
Member

Do we have any pre-existing measurements from the Pangeo community to know the load of the existing clusters so we can match them with proper nodes sizes?

It should match the profiles present in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d4f868e4d5c1ea92675facdabff86d49f31c7253/deployments/gcp-uscentral1b/config/common.yaml#L77 I think

@damianavila
Copy link
Contributor

It should match the profiles

I agree, the remaining question is how we select nodes to optimize those profiles in terms of cost (among other things).
Data from the existing Pangeo clusters could give us indications to perform that optimization.
I might be trying to optimize stuff a little bit early, I know, but since there is already data available, let's get the most of the previous experience! 😉

@yuvipanda
Copy link
Member

In general, the current pangeo clusters are very non-dense - usually each user gets their own node (practically). So each pod's resource requests / limits track with the instance type. This is what we do now for our hubs too!

@yuvipanda
Copy link
Member

let's get the most of the previous experience! 😉

I agree! But I don't want to block this PR on that :D

@sgibson91
Copy link
Member Author

sgibson91 commented Jun 29, 2021

By exploiting the fact that the machine size is listed in the node names when using the below command:

kubectl get pods -o wide | grep MACHINE_TYPE | wc -l

I got the following pod counts for different node types on the staging and prod deployments. (default-pool is type n1-standard-1.)

Namespace: staging
default-pool: 9
n1-highmem-4: 1

Namespace: prod
default-pool: 8
n1-highmem-2: 6
n1-highmem-4: 13

@sgibson91
Copy link
Member Author

So I think I have a permissions issue. I get the following error when running terraform init in this repo. I assume it's related to pulling the terraform state from GCP storage buckets.

$ terraform init

Initializing the backend...

Error: Failed to get existing workspaces: querying Cloud Storage failed: googleapi: Error 403: [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket., forbidden

@sgibson91
Copy link
Member Author

So I think I have a permissions issue. I get the following error when running terraform init in this repo. I assume it's related to pulling the terraform state from GCP storage buckets.

$ terraform init

Initializing the backend...

Error: Failed to get existing workspaces: querying Cloud Storage failed: googleapi: Error 403: [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket., forbidden

This was solved by making me an "owner" rather than an "organisation admin" on the gcp-org-admins group

We know we'll need a scratch bucket, hence config connector, so the n1-highmem-4 machine is best suited
@damianavila
Copy link
Contributor

@sgibson91, do you want to get this one off the draft state now? Or are you thinking of deploying this one as is to test it and then get out of the draft?

@sgibson91 sgibson91 marked this pull request as ready for review July 7, 2021 09:59
@sgibson91 sgibson91 changed the title [WIP] Deploying Cluster for Pangeo Deploying Cluster for Pangeo Jul 7, 2021
@sgibson91
Copy link
Member Author

Marked as ready for review. I will do a manual deploy. I'm interested to see what CI, if any, breaks when we merge this as we don't have permissions to the Pangeo project that this repo expects, I think.

@yuvipanda
Copy link
Member

@sgibson91 terraform deploys are still manual, we have no CD for those. Deploy by hand and iterate, and we can merge?

@sgibson91
Copy link
Member Author

sgibson91 commented Jul 7, 2021

Realise I've been commenting on the related issue thinking it was this PR 🤦🏻‍♀️ I tried a manual deploy (or rather, just terraform plan) and got a whole bunch of errors, even after trying with the access token you suggested


What I did

  1. Logged into gcloud with 2i2c.org account
  2. Generated an access token with gcloud auth application-default print-access-token
  3. Edited below block in main.tf to include the access token

https://github.com/2i2c-org/pilot-hubs/blob/861b55c1e98a5ee5c5111975e32bb9e0fdcd6980/terraform/main.tf#L1-L6

  1. Ran terraform init -reconfigure (Maybe this should have been -migrate-state?)
  2. Logged into gcloud with Columbia account
  3. Ran terraform plan -var-file=projects/pangeo-hubs.tfvars

What I got

module.gke.module.gcloud_delete_default_kube_dns_configmap.module.gcloud_kubectl.null_resource.module_depends_on[0]: Refreshing state... [id=8865553039709359568]
module.gke.random_string.cluster_service_account_suffix: Refreshing state... [id=owc3]
module.gke.random_shuffle.available_zones: Refreshing state... [id=-]
google_artifact_registry_repository.container_repository: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1/repositories/]
module.gke.google_project_iam_member.cluster_service_account-metric_writer[0]: Refreshing state... [id=two-eye-two-see/roles/monitoring.metricWriter/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_project_iam_member.cluster_service_account-log_writer[0]: Refreshing state... [id=two-eye-two-see/roles/logging.logWriter/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.service_accounts.google_project_iam_member.project-roles[0]: Refreshing state... [id=two-eye-two-see/roles/container.admin/serviceaccount:[email protected]]
module.service_accounts.google_project_iam_member.project-roles[2]: Refreshing state... [id=two-eye-two-see/roles/compute.instanceAdmin.v1/serviceaccount:[email protected]]
module.service_accounts.google_service_account.service_accounts[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/[email protected]]
module.gke.google_project_iam_member.cluster_service_account-resourceMetadata-writer[0]: Refreshing state... [id=two-eye-two-see/roles/stackdriver.resourceMetadata.writer/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_project_iam_member.cluster_service_account-monitoring_viewer[0]: Refreshing state... [id=two-eye-two-see/roles/monitoring.viewer/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_container_node_pool.pools["dask-worker-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/dask-worker-pool]
module.gke.google_container_node_pool.pools["user-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/user-pool]
module.service_accounts.google_service_account_key.keys[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/[email protected]/keys/107710688b17ce563c639416dbc445ee4998ae53]
module.gke.google_container_cluster.primary: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster]
module.service_accounts.google_project_iam_member.project-roles[1]: Refreshing state... [id=two-eye-two-see/roles/artifactregistry.writer/serviceaccount:[email protected]]
module.gke.google_service_account.cluster_service_account[0]: Refreshing state... [id=projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
google_project_iam_member.project: Refreshing state... [id=two-eye-two-see/roles/artifactregistry.reader/serviceaccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com]
module.gke.google_container_node_pool.pools["core-pool"]: Refreshing state... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/low-touch-hubs-cluster/nodePools/core-pool]
╷
│ Error: Error when reading or editing Service Account "projects/two-eye-two-see/serviceAccounts/[email protected]": googleapi: Error 403: Permission iam.serviceAccounts.get is required to perform this operation on service account projects/two-eye-two-see/serviceAccounts/[email protected]., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/monitoring.metricWriter" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/stackdriver.resourceMetadata.writer" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container Cluster "low-touch-hubs-cluster": googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/artifactregistry.writer" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container NodePool dask-worker-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing ArtifactRegistryRepository "projects/two-eye-two-see/locations/us-central1/repositories/": googleapi: Error 403: Permission 'artifactregistry.repositories.get' denied on resource '//artifactregistry.googleapis.com/projects/two-eye-two-see/locations/us-central1/repositories/low-touch-hubs' (or it may not exist).
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
│     "domain": "artifactregistry.googleapis.com",
│     "metadata": {
│       "permission": "artifactregistry.repositories.get",
│       "resource": "projects/two-eye-two-see/locations/us-central1/repositories/low-touch-hubs"
│     },
│     "reason": "IAM_PERMISSION_DENIED"
│   }
│ ]
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/logging.logWriter" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/compute.instanceAdmin.v1" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/container.admin" Member "serviceAccount:[email protected]": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container NodePool core-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Service Account "projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": googleapi: Error 403: Permission iam.serviceAccounts.get is required to perform this operation on service account projects/two-eye-two-see/serviceAccounts/tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/monitoring.viewer" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Container NodePool user-pool: googleapi: Error 403: Required "container.clusters.get" permission(s) for "projects/two-eye-two-see/zones/us-central1-b/clusters/low-touch-hubs-cluster"., forbidden
│ 
│ 
╵
╷
│ Error: Error when reading or editing Resource "project \"two-eye-two-see\"" with IAM Member: Role "roles/artifactregistry.reader" Member "serviceAccount:tf-gke-low-touch-hubs--owc3@two-eye-two-see.iam.gserviceaccount.com": Error retrieving IAM policy for project "two-eye-two-see": googleapi: Error 403: The caller does not have permission, forbidden
│ 
│ 
╵

@yuvipanda
Copy link
Member

I fiddled around with this a bit, and realized that we needed to create a new terraform workspace for this to work.

I ran terraform workspace new pangeo-hubs, and then terraform plan started showing me sensible outputs. I've deleted that workspace now, so it can be recreated again to do a full run through. Try it out, @sgibson91?

@sgibson91
Copy link
Member Author

Hooray! The new workspace worked and the output of terraform plan looks way more sensible now!

terraform plan output
terraform plan -var-file=projects/pangeo-hubs.tfvars -out=plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_artifact_registry_repository.registry will be created
  + resource "google_artifact_registry_repository" "registry" {
      + create_time   = (known after apply)
      + format        = "DOCKER"
      + id            = (known after apply)
      + location      = "us-central1"
      + name          = (known after apply)
      + project       = "pangeo-integration-te-3eea"
      + repository_id = "pangeo-hubs-registry"
      + update_time   = (known after apply)
    }

  # google_container_cluster.cluster will be created
  + resource "google_container_cluster" "cluster" {
      + cluster_ipv4_cidr           = (known after apply)
      + datapath_provider           = (known after apply)
      + default_max_pods_per_node   = (known after apply)
      + enable_binary_authorization = false
      + enable_intranode_visibility = (known after apply)
      + enable_kubernetes_alpha     = false
      + enable_l4_ilb_subsetting    = false
      + enable_legacy_abac          = false
      + enable_shielded_nodes       = (known after apply)
      + enable_tpu                  = false
      + endpoint                    = (known after apply)
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + label_fingerprint           = (known after apply)
      + location                    = "us-central1-b"
      + logging_service             = (known after apply)
      + master_version              = (known after apply)
      + monitoring_service          = (known after apply)
      + name                        = "pangeo-hubs-cluster"
      + network                     = "default"
      + networking_mode             = (known after apply)
      + node_locations              = (known after apply)
      + node_version                = (known after apply)
      + operation                   = (known after apply)
      + private_ipv6_google_access  = (known after apply)
      + project                     = "pangeo-integration-te-3eea"
      + remove_default_node_pool    = true
      + self_link                   = (known after apply)
      + services_ipv4_cidr          = (known after apply)
      + subnetwork                  = (known after apply)
      + tpu_ipv4_cidr_block         = (known after apply)

      + addons_config {
          + cloudrun_config {
              + disabled           = (known after apply)
              + load_balancer_type = (known after apply)
            }

          + config_connector_config {
              + enabled = true
            }

          + dns_cache_config {
              + enabled = (known after apply)
            }

          + gce_persistent_disk_csi_driver_config {
              + enabled = (known after apply)
            }

          + horizontal_pod_autoscaling {
              + disabled = true
            }

          + http_load_balancing {
              + disabled = true
            }

          + istio_config {
              + auth     = (known after apply)
              + disabled = (known after apply)
            }

          + kalm_config {
              + enabled = (known after apply)
            }

          + network_policy_config {
              + disabled = (known after apply)
            }
        }

      + authenticator_groups_config {
          + security_group = (known after apply)
        }

      + cluster_autoscaling {
          + autoscaling_profile = "OPTIMIZE_UTILIZATION"
          + enabled             = false

          + auto_provisioning_defaults {
              + min_cpu_platform = (known after apply)
              + oauth_scopes     = (known after apply)
              + service_account  = (known after apply)
            }
        }

      + cluster_telemetry {
          + type = (known after apply)
        }

      + confidential_nodes {
          + enabled = (known after apply)
        }

      + database_encryption {
          + key_name = (known after apply)
          + state    = (known after apply)
        }

      + default_snat_status {
          + disabled = (known after apply)
        }

      + ip_allocation_policy {
          + cluster_ipv4_cidr_block       = (known after apply)
          + cluster_secondary_range_name  = (known after apply)
          + services_ipv4_cidr_block      = (known after apply)
          + services_secondary_range_name = (known after apply)
        }

      + master_auth {
          + client_certificate     = (known after apply)
          + client_key             = (sensitive value)
          + cluster_ca_certificate = (known after apply)
          + password               = (sensitive value)
          + username               = (known after apply)

          + client_certificate_config {
              + issue_client_certificate = (known after apply)
            }
        }

      + network_policy {
          + enabled = true
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = (known after apply)
          + local_ssd_count   = (known after apply)
          + machine_type      = (known after apply)
          + metadata          = (known after apply)
          + oauth_scopes      = (known after apply)
          + preemptible       = false
          + service_account   = (known after apply)
          + taint             = (known after apply)

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = (known after apply)
            }
        }

      + node_pool {
          + initial_node_count  = (known after apply)
          + instance_group_urls = (known after apply)
          + max_pods_per_node   = (known after apply)
          + name                = (known after apply)
          + name_prefix         = (known after apply)
          + node_count          = (known after apply)
          + node_locations      = (known after apply)
          + version             = (known after apply)

          + autoscaling {
              + max_node_count = (known after apply)
              + min_node_count = (known after apply)
            }

          + management {
              + auto_repair  = (known after apply)
              + auto_upgrade = (known after apply)
            }

          + node_config {
              + boot_disk_kms_key = (known after apply)
              + disk_size_gb      = (known after apply)
              + disk_type         = (known after apply)
              + guest_accelerator = (known after apply)
              + image_type        = (known after apply)
              + labels            = (known after apply)
              + local_ssd_count   = (known after apply)
              + machine_type      = (known after apply)
              + metadata          = (known after apply)
              + min_cpu_platform  = (known after apply)
              + oauth_scopes      = (known after apply)
              + preemptible       = (known after apply)
              + service_account   = (known after apply)
              + tags              = (known after apply)
              + taint             = (known after apply)

              + ephemeral_storage_config {
                  + local_ssd_count = (known after apply)
                }

              + kubelet_config {
                  + cpu_cfs_quota        = (known after apply)
                  + cpu_cfs_quota_period = (known after apply)
                  + cpu_manager_policy   = (known after apply)
                }

              + linux_node_config {
                  + sysctls = (known after apply)
                }

              + sandbox_config {
                  + sandbox_type = (known after apply)
                }

              + shielded_instance_config {
                  + enable_integrity_monitoring = (known after apply)
                  + enable_secure_boot          = (known after apply)
                }

              + workload_metadata_config {
                  + node_metadata = (known after apply)
                }
            }

          + upgrade_settings {
              + max_surge       = (known after apply)
              + max_unavailable = (known after apply)
            }
        }

      + notification_config {
          + pubsub {
              + enabled = (known after apply)
              + topic   = (known after apply)
            }
        }

      + release_channel {
          + channel = "UNSPECIFIED"
        }

      + workload_identity_config {
          + identity_namespace = "pangeo-integration-te-3eea.svc.id.goog"
        }
    }

  # google_container_node_pool.core will be created
  + resource "google_container_node_pool" "core" {
      + cluster             = "pangeo-hubs-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 1
      + instance_group_urls = (known after apply)
      + location            = "us-central1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "core-pool"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + operation           = (known after apply)
      + project             = "pangeo-integration-te-3eea"
      + version             = (known after apply)

      + autoscaling {
          + max_node_count = 5
          + min_node_count = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = 30
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "core"
              + "k8s.dask.org/node-purpose"    = "core"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-4"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + taint             = (known after apply)

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = (known after apply)
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

  # google_container_node_pool.dask_worker["worker"] will be created
  + resource "google_container_node_pool" "dask_worker" {
      + cluster             = "pangeo-hubs-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 0
      + instance_group_urls = (known after apply)
      + location            = "us-central1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "dask-worker"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + operation           = (known after apply)
      + project             = "pangeo-integration-te-3eea"
      + version             = (known after apply)

      + autoscaling {
          + max_node_count = 100
          + min_node_count = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-ssd"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "k8s.dask.org/node-purpose" = "worker"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-4"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = true
          + service_account   = (known after apply)
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "k8s.dask.org_dedicated"
                  + value  = "worker"
                },
            ]

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = "GKE_METADATA_SERVER"
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

  # google_container_node_pool.notebook["user"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster             = "pangeo-hubs-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 0
      + instance_group_urls = (known after apply)
      + location            = "us-central1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "nb-user"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + operation           = (known after apply)
      + project             = "pangeo-integration-te-3eea"
      + version             = (known after apply)

      + autoscaling {
          + max_node_count = 20
          + min_node_count = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-4"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + node_metadata = "GKE_METADATA_SERVER"
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

  # google_project_iam_custom_role.identify_project_role will be created
  + resource "google_project_iam_custom_role" "identify_project_role" {
      + deleted     = (known after apply)
      + description = "Minimal role for hub users on pangeo-hubs to identify as current project"
      + id          = (known after apply)
      + name        = (known after apply)
      + permissions = [
          + "serviceusage.services.use",
        ]
      + project     = "pangeo-integration-te-3eea"
      + role_id     = "pangeo_hubs_user_sa_role"
      + stage       = "GA"
      + title       = "Identify as project role for users in pangeo-hubs"
    }

  # google_project_iam_member.cd_sa_roles["roles/artifactregistry.writer"] will be created
  + resource "google_project_iam_member" "cd_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/artifactregistry.writer"
    }

  # google_project_iam_member.cd_sa_roles["roles/container.admin"] will be created
  + resource "google_project_iam_member" "cd_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/container.admin"
    }

  # google_project_iam_member.cluster_sa_roles["roles/artifactregistry.reader"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/artifactregistry.reader"
    }

  # google_project_iam_member.cluster_sa_roles["roles/logging.logWriter"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/logging.logWriter"
    }

  # google_project_iam_member.cluster_sa_roles["roles/monitoring.metricWriter"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/monitoring.metricWriter"
    }

  # google_project_iam_member.cluster_sa_roles["roles/monitoring.viewer"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/monitoring.viewer"
    }

  # google_project_iam_member.cluster_sa_roles["roles/stackdriver.resourceMetadata.writer"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = "roles/stackdriver.resourceMetadata.writer"
    }

  # google_project_iam_member.identify_project_binding will be created
  + resource "google_project_iam_member" "identify_project_binding" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "pangeo-integration-te-3eea"
      + role    = (known after apply)
    }

  # google_service_account.cd_sa will be created
  + resource "google_service_account" "cd_sa" {
      + account_id   = "pangeo-hubs-cd-sa"
      + display_name = "Continuous Deployment SA for pangeo-hubs"
      + email        = (known after apply)
      + id           = (known after apply)
      + name         = (known after apply)
      + project      = "pangeo-integration-te-3eea"
      + unique_id    = (known after apply)
    }

  # google_service_account.cluster_sa will be created
  + resource "google_service_account" "cluster_sa" {
      + account_id   = "pangeo-hubs-cluster-sa"
      + display_name = "Cluster SA for pangeo-hubs"
      + email        = (known after apply)
      + id           = (known after apply)
      + name         = (known after apply)
      + project      = "pangeo-integration-te-3eea"
      + unique_id    = (known after apply)
    }

  # google_service_account_key.cd_sa will be created
  + resource "google_service_account_key" "cd_sa" {
      + id                 = (known after apply)
      + key_algorithm      = "KEY_ALG_RSA_2048"
      + name               = (known after apply)
      + private_key        = (sensitive value)
      + private_key_type   = "TYPE_GOOGLE_CREDENTIALS_FILE"
      + public_key         = (known after apply)
      + public_key_type    = "TYPE_X509_PEM_FILE"
      + service_account_id = (known after apply)
      + valid_after        = (known after apply)
      + valid_before       = (known after apply)
    }

  # google_storage_bucket.user_buckets["pangeo-scratch"] will be created
  + resource "google_storage_bucket" "user_buckets" {
      + bucket_policy_only          = (known after apply)
      + force_destroy               = false
      + id                          = (known after apply)
      + location                    = "US-CENTRAL1"
      + name                        = "pangeo-hubs-pangeo-scratch"
      + project                     = "pangeo-integration-te-3eea"
      + self_link                   = (known after apply)
      + storage_class               = "STANDARD"
      + uniform_bucket_level_access = (known after apply)
      + url                         = (known after apply)
    }

  # google_storage_bucket_iam_member.member["pangeo-scratch"] will be created
  + resource "google_storage_bucket_iam_member" "member" {
      + bucket = "pangeo-hubs-pangeo-scratch"
      + etag   = (known after apply)
      + id     = (known after apply)
      + member = (known after apply)
      + role   = "roles/storage.admin"
    }

Plan: 19 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + ci_deployer_key = (sensitive value)

Ok, I'm gonna deploy now 🙂

@sgibson91
Copy link
Member Author

Ok, got a couple more errors, but thankfully I don't think these have anything to do with permissions!

Error: Error waiting for creating GKE cluster: Not all instances running in IGM after 40.240072593s. Expected 1, running 0, transitioning 1. Current errors: [CONDITION_NOT_MET]: Instance 'gke-pangeo-hubs-cluster-default-pool-65fa3508-485z' creation failed: Constraint constraints/compute.vmExternalIpAccess violated for project 291560455175. Add instance projects/pangeo-integration-te-3eea/zones/us-central1-b/instances/gke-pangeo-hubs-cluster-default-pool-65fa3508-485z to the constraint to use external IP with it.

with google_container_cluster.cluster,
on cluster.tf line 1, in resource "google_container_cluster" "cluster":
1: resource "google_container_cluster" "cluster" {

and

Error: Error creating Repository: googleapi: Error 403: Artifact Registry API has not been used in project 291560455175 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/artifactregistry.googleapis.com/overview?project=291560455175 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
Details:
[
{
"@\type": "type.googleapis.com/google.rpc.Help",
"links": [
{
"description": "Google developers console API activation",
"url": "https://console.developers.google.com/apis/api/artifactregistry.googleapis.com/overview?project=291560455175"
}
]
},
{
"@\type": "type.googleapis.com/google.rpc.ErrorInfo",
"domain": "googleapis.com",
"metadata": {
"consumer": "projects/291560455175",
"service": "artifactregistry.googleapis.com"
},
"reason": "SERVICE_DISABLED"
}
]

with google_artifact_registry_repository.registry,
on registry.tf line 6, in resource "google_artifact_registry_repository" "registry":
6: resource "google_artifact_registry_repository" "registry" {

this second error looks like it's "just" a case of enabling the registry API on the project 😜

@sgibson91
Copy link
Member Author

Enabled Artifact Registry API and retrying...

@sgibson91
Copy link
Member Author

Registry successfully deployed, now just gotta figure out the cluster. Looks like an organisational constraint is preventing the cluster from assigning an external IP https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints#:~:text=INSTANCE-,constraints%2Fcompute.vmexternalipaccess,-is

@yuvipanda
Copy link
Member

Looks like an organisational constraint is preventing the cluster from assigning an external IP https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints#:~:text=INSTANCE-,constraints%2Fcompute.vmexternalipaccess,-is

Yeah, I remember hearing about this from some other staff at Columbia. I think this requires @rabernat to intervene now?

@yuvipanda
Copy link
Member

@sgibson91 can you open an issue with the error you encountered?

@rabernat I think we need to:

  1. Understand what Columbia's policies are about getting traffic into the cluster. Is it a blanket approval after you get one external IP? Or does it need approval for each external IP? Cluster design will have to change based on that.
  2. Ask whoever managed cloud policy at Columbia to allow us to get traffic into the cluster, and figure out the process for it is.

Do you know where we can learn about (1)? My contact has moved on from Columbia unfortunately, but if we don't make progress via other means I can reach out to him

@rabernat
Copy link
Contributor

rabernat commented Jul 12, 2021

Ok, since this issue is public, I think I'll just refer them to this. I have emailed my contact at CUIT with a request for assistance.

@rabernat
Copy link
Contributor

Question from Parixit which needs a "correct / incorrect" response:

After enabling the API details, the only error that is present is the organizational constraint that is preventing the cluster from assigning an external IP, correct?

@sgibson91
Copy link
Member Author

Correct

sgibson91 added a commit to sgibson91/infrastructure that referenced this pull request Aug 2, 2021
Ordering is important here. sops tries to use the first rule that
matches the regex and does not work through the list if it fails
.sops.yaml Outdated Show resolved Hide resolved
.sops.yaml Outdated Show resolved Hide resolved
Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! We should auto-deploy it (like with #569) from CI, but that doesn't need to block this PR.

@sgibson91 I'd suggest that you:

  1. Do one final deploy to make sure things work ok,
  2. Self-merge this, so state of infra matches what is in master.

Alternatively, you can add Continuous Deployment of this first, so the state is maintained automatically.

Excited to get this done!

@sgibson91
Copy link
Member Author

I had to add a labels attribute to the user and dask notebook blocks in the tfvars file to solve the below. I just left them blank. e821387

│ Error: Invalid value for input variable

│ on projects/pangeo-hubs.tfvars line 13:
│ 13: notebook_nodes = {
│ 14: "user" : {
│ 15: min : 0,
│ 16: max : 20,
│ 17: machine_type : "n1-highmem-4"
│ 18: },
│ 19: }

│ The given value is not valid for variable "notebook_nodes": element "user": attribute "labels" is required.


│ Error: Invalid value for input variable

│ on projects/pangeo-hubs.tfvars line 21:
│ 21: dask_nodes = {
│ 22: "worker" : {
│ 23: min : 0,
│ 24: max : 100,
│ 25: machine_type : "n1-highmem-4"
│ 26: },
│ 27: }

@sgibson91 sgibson91 merged commit 021da9c into 2i2c-org:master Aug 4, 2021
@sgibson91 sgibson91 deleted the new-cluster/pangeo-hub branch August 4, 2021 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deploy a cluster into GCP pangeo-integration-te-3eea project for Pangeo pilot hubs
4 participants