diff --git a/content/en/docs/gke/anthos.md b/content/en/docs/gke/anthos.md index 4673637da1..ecb63d7f1f 100644 --- a/content/en/docs/gke/anthos.md +++ b/content/en/docs/gke/anthos.md @@ -16,9 +16,8 @@ open source technologies, including Kubernetes, Istio, and Knative. Using Anthos, you can create a consistent setup across your on-premises and cloud environments, helping you to automate policy and security at scale. -If you're interested in running Kubeflow on Anthos GKE, email the Kubeflow team -at -[google-kubeflow-support@google.com](mailto:google-kubeflow-support@google.com). +Kubeflow on GKE On Prem is a work in progress. To track progess you can subscribe +to the GitHub issue [kubeflow/gcp-blueprints#138](https://github.com/kubeflow/gcp-blueprints/issues/138). ## Next steps diff --git a/content/en/docs/gke/private-clusters.md b/content/en/docs/gke/private-clusters.md index 6b42ed1c96..b5d0e2036f 100644 --- a/content/en/docs/gke/private-clusters.md +++ b/content/en/docs/gke/private-clusters.md @@ -1,301 +1,241 @@ +++ title = "Securing Your Clusters" -description = "How to secure Kubeflow clusters using VPC service controls and private GKE" +description = "How to secure Kubeflow clusters using private GKE" weight = 70 +++ -{{% alert title="Out of date" color="warning" %}} -This guide contains outdated information pertaining to Kubeflow 1.0. This guide -needs to be updated for Kubeflow 1.1. -{{% /alert %}} - -{{% alert title="Alpha" color="warning" %}} -This feature is currently in **alpha** release status with limited support. The -Kubeflow team is interested in any feedback you may have, in particular with -regards to usability of the feature. Note the following issues already reported: - -* [Documentation on how to use Kubeflow with private GKE and VPC service controls](https://github.com/kubeflow/website/issues/1705) -* [Replicating Docker images to private Container Registry](https://github.com/kubeflow/kubeflow/issues/3210) -* [Installing Istio for Kubeflow on private GKE](https://github.com/kubeflow/kubeflow/issues/3650) -* [Profile-controller crashes on GKE private cluster](https://github.com/kubeflow/kubeflow/issues/4661) -* [kfctl should work with private GKE without public endpoint](https://github.com/kubeflow/kfctl/issues/158) -{{% /alert %}} - -This guide describes how to secure Kubeflow using [VPC Service Controls](https://cloud.google.com/vpc-service-controls/docs/) and private GKE. - -Together these two features signficantly increase security -and mitigate the risk of data exfiltration. - - * VPC Service Controls allow you to define a perimeter around - Google Cloud Platform (GCP) services. - - Kubeflow uses VPC Service Controls to prevent applications - running on GKE from writing data to GCP resources outside - the perimeter. - * Private GKE removes public IP addresses from GKE nodes making - them inaccessible from the public internet. +These instructions explain how to deploy Kubeflow using private GKE. - Kubeflow uses IAP to make Kubeflow web apps accessible - from your browser. +1. Follow the [blueprint instructions](../deploy/management-setup/) to setup a management cluster -VPC Service Controls allow you to restrict which Google services are accessible from your -GKE/Kubeflow clusters. This is an important part of security and in particular -mitigating the risks of data exfiltration. +1. As a workaround for Issue + [kubeflow/gcp-blueprints#32](https://github.com/kubeflow/gcp-blueprints/issues/32) + (in CNRM 1.9.1, the [CustomResourceDefinition + (CRD)](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#customresourcedefinitions) + for container cluster is missing `ipAllocationPolicy` fields needed to create + a private GKE cluster), modify the container cluster CRD schema in your + management cluster to include the missing fields. -For more information refer to the [VPC Service Control Docs](https://cloud.google.com/vpc-service-controls/docs/overview). + * Check Issue [kubeflow/gcp-blueprints#32](https://github.com/kubeflow/gcp-blueprints/issues/32) + to find out if it has been resolved in later versions of CNRM. If the issue hasn't been resolved, + you can follow the instructions in the issue to work around the problem. -Creating a [private Kubernetes Engine cluster](https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters) -means the Kubernetes Engine nodes won't have public IP addresses. This can improve security by blocking unwanted outbound/inbound -access to nodes. Removing IP addresses means external services (such as GitHub, PyPi, and DockerHub) won't be accessible -from the nodes. Google services (such as BigQuery and Cloud Storage) are still accessible. +1. Fetch the blueprint by running this command: -Importantly this means you can continue to use your [Google Container Registry (GCR)](https://cloud.google.com/container-registry/docs/) to host your Docker images. Other Docker registries (for example, DockerHub) will not be accessible. If you need to use Docker images -hosted outside GCR you can use the scripts provided by Kubeflow to mirror them to your GCR registry. + ``` + kpt pkg get https://github.com/kubeflow/gcp-blueprints.git/kubeflow@master ./${PKGDIR} + ``` + * This will create the directory `${PKGDIR}` with the blueprint. -## Before you start +1. Change to the Kubeflow directory: -Before installing Kubeflow ensure you have installed the following tools: - - * [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) - * [gcloud](https://cloud.google.com/sdk/) + ``` + cd ${PKGDIR} + ``` +1. Fetch Kubeflow manifests: -You will need to know your gcloud organization ID and project number; you can get them via gcloud. + ``` + make get-pkg + ``` -``` -export PROJECT= -export ORGANIZATION_NAME= -export ORGANIZATION=$(gcloud organizations list --filter=DISPLAY_NAME=${ORGANIZATION_NAME} --format='value(name)') -export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT} --format='value(projectNumber)') -``` +1. Add the private GKE patches to your kustomization - * Projects are identified by names, IDs, and numbers. For more info, see [Identifying projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects). + 1. Open `instance/gcp_config` + 1. In `patchesStrategicMerge` add -## Enable VPC Service Controls In Your Project + ``` + - ../../upstream/manifests/gcp/v2/privateGKE/cluster-private-patch.yaml + ``` + 1. In `resources` add -1. Enable VPC service controls: + ``` + - ../../upstream/manifests/gcp/v2/privateGKE/ + ``` - ``` - export PROJECT= - gcloud services enable accesscontextmanager.googleapis.com \ - cloudresourcemanager.googleapis.com \ - dns.googleapis.com --project=${PROJECT} - ``` -1. Check if you have an access policy object already created: + * **Note**: Do not use `kustomize edit` to perform the above actions until [kubernetes-sigs/kustomize#2310](https://github.com/kubernetes-sigs/kustomize/issues/2310) is fixed - ``` - gcloud beta access-context-manager policies list \ - --organization=${ORGANIZATION} - ``` +1. Open the `Makefile` and edit the `set-values` rule to invoke `kpt cfg set` with the desired values for + your deployment. - * An [access policy](https://cloud.google.com/vpc-service-controls/docs/overview#terminology) is a GCP resource object that defines service perimeters. There can be only one access policy object in an organization, and it is a child of the Organization resource. + * Change `kpt cfg set ./instance gke.private false` to `kpt cfg set ./instance gke.private true` + * You need to set region, location and zone because the deployment is a mix of zonal and regional resources and some which could be either +### Deploy Kubeflow -1. If you don't have an access policy object, create one: - ``` - gcloud beta access-context-manager policies create \ - --title "default" --organization=${ORGANIZATION} - ``` +1. Configure the setters -1. Save the Access Policy Object ID as an environment variable so that it can be used in subsequent commands: + ``` + make set-values + ``` - ``` - export POLICYID=$(gcloud beta access-context-manager policies list --organization=${ORGANIZATION} --limit=1 --format='value(name)') - ``` -1. Create a service perimeter: +1. Set environment variables with OAuth Client ID and Secret for IAP. - ``` - gcloud beta access-context-manager perimeters create KubeflowZone \ - --title="Kubeflow Zone" --resources=projects/${PROJECT_NUMBER} \ - --restricted-services=bigquery.googleapis.com,containerregistry.googleapis.com,storage.googleapis.com \ - --project=${PROJECT} --policy=${POLICYID} - ``` + ``` + export CLIENT_ID= + export CLIENT_SECRET= + ``` - * Here we have created a service perimeter with the name KubeflowZone. +1. Deploy Kubeflow - * The perimeter is created in PROJECT_NUMBER and restricts access to GCS (storage.googleapis.com), BigQuery (bigquery.googleapis.com), and GCR (containerregistry.googleapis.com). + ``` + make apply + ``` - * Placing GCS (Google Cloud Storage) and BigQuery in the perimeter means that access to GCS and BigQuery - resources owned by this project is now restricted. By default, access from outside - the perimeter will be blocked + * In this case you can simply edit the Makefile and comment out the line - * More than one project can be added to the same perimeter + ``` + kubectl --context=$(MGMTCTXT) wait --for=condition=Ready --timeout=600s containercluster $(NAME) + ``` -1. Create an access level to allow Google Container Builder to access resources inside the perimiter: + * Then rerun `make apply` - * Create a members.yaml file with the following contents +1. The cloud endpoints controller doesn't work with private GKE ([kubeflow/gcp-blueprints#36](https://github.com/kubeflow/gcp-blueprints/issues/36)) as a workaround + you can run `kfctl` locally to create the endpoitn - ``` - - members: - - serviceAccount:${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com - - user: - ``` + ``` + kfctl apply -f .build/iap-ingress/ctl.isla.solutions_v1_cloudendpoint_${KFNAME}.yaml + ``` - * Google Container Builder is used to mirror Kubeflow images into the perimeter - * Adding your email allows you to access the GCP services - inside the perimeter from outside the cluster +## Architectural notes - * This is convenient for building and pushing images and data - from your local machine. +* The reference architecture uses [Cloud Nat](https://cloud.google.com/nat/docs/overview) to allow outbound + internet access from node even though they don't have public IPs. - * For more information refer to the [docs](https://cloud.google.com/access-context-manager/docs/create-access-level#members-example). + * Outbound traffic can be restricted using firewall rules -1. Create the access level: + * Outbound internet access is needed to download the JWKs keys used to verify JWTs attached by IAP - ``` - gcloud beta access-context-manager levels create kubeflow \ - --basic-level-spec=members.yaml \ - --policy=${POLICYID} \ - --title="Kubeflow ${PROJECT}" - ``` + * If you want to completely disable all outbound internet access you will have to find some alternative solution + to keep the JWKs in sync with your ISTIO policy - * The name for the level can't have any hyphens -1. Bind Access Level to a Service Perimeter: +## Troubleshooting - ``` - gcloud beta access-context-manager perimeters update KubeflowZone \ - --add-access-levels=kubeflow \ - --policy=${POLICYID} - ``` +* Cluster is stuck in provisioning state. -## Set up container registry for GKE private clusters: + * Use the UI or gcloud to figure out what state the cluster is stuck in + * If you use gcloud you need to look at the operation e.g. -Follow the step belows to configure your GCR registry to be accessible from your secured clusters. -For more info see [instructions](https://cloud.google.com/vpc-service-controls/docs/set-up-gke). + + 1. Find the operations + + ``` + gcloud --project=${PROJECT} container operations list + ``` -1. Create a managed private zone + 1. Get operation details - ``` - export ZONE_NAME=kubeflow - export NETWORK= - gcloud beta dns managed-zones create ${ZONE_NAME} \ - --visibility=private \ - --networks=https://www.googleapis.com/compute/v1/projects/${PROJECT}/global/networks/${NETWORK} \ - --description="Kubeflow DNS" \ - --dns-name=gcr.io \ - --project=${PROJECT} - ``` + ``` + gcloud --project=${PROJECT} container operations describe --region=${REGION} ${OPERATION} + ``` -1. Start a transaction +* Cluster health checks are failing. - ``` - gcloud dns record-sets transaction start \ - --zone=${ZONE_NAME} \ - --project=${PROJECT} - ``` + * This is usually because the firewall rules allowing the GKE health checks are not configured correctly -1. Add a CNAME record for \*.gcr.io + * A good place to start is verifying they were created correctly - ``` - gcloud dns record-sets transaction add \ - --name=*.gcr.io. \ - --type=CNAME gcr.io. \ - --zone=${ZONE_NAME} \ - --ttl=300 \ - --project=${PROJECT} - ``` + ``` + kubectl --context=${MGMTCTXT} describe computefirewall + ``` -1. Add an A record for the restricted VIP + * Turn on firewall rule logging to see what traffic is being blocked. - ``` - gcloud dns record-sets transaction add \ - --name=gcr.io. \ - --type=A 199.36.153.4 199.36.153.5 199.36.153.6 199.36.153.7 \ - --zone=${ZONE_NAME} \ - --ttl=300 \ - --project=${PROJECT} - ``` + ``` + kpt cfg set ./upstream/manifests/gcp/v2/privateGKE/ log-firewalls true + make apply + ``` -1. Commit the transaction + * To look for traffic blocked by firewall rules in stackdriver use a filter like the following - ``` - gcloud dns record-sets transaction execute \ - --zone=${ZONE_NAME} \ - --project=${PROJECT} + ``` + logName: "projects/${PROJECT}/logs/compute.googleapis.com%2Ffirewall" + jsonPayload.disposition = "DENIED" ``` -## Mirror Kubeflow Application Images + * **Logging must be enabled** on your firewall rules. You can enable it by using a kpt setter -Since private GKE can only access gcr.io, we need to mirror all images outside gcr.io for Kubeflow applications. We will use the `kfctl` tool to accomplish this. + ``` + kpt cfg set ./upstream/manifests/gcp/v2/privateGKE/ log-firewalls true + ``` + * Change project to your project -1. Set your user credentials. You only need to run this command once: - - ``` - gcloud auth application-default login - ``` + * Then look at the fields `jsonPayload.connection` this will tell you source and destination ips + * Based on the IPs try to figure out where the traffic is coming from (e.g. node to master) and + then match to appropriate firewall rules -1. Inside your `${KFAPP}` directory create a local configuration file `mirror.yaml` based on this [template](https://github.com/kubeflow/manifests/blob/master/experimental/mirror-images/gcp_template.yaml) + * For example - 1. Change destination to your project gcr registry. + ``` + connection: { + dest_ip: "172.16.0.34" + dest_port: 443 + protocol: 6 + src_ip: "10.10.10.31" + src_port: 60556 + } + disposition: "DENIED" + ``` -1. Generate pipeline files to mirror images by running - - ``` - cd ${KFAPP} - ./kfctl alpha mirror build mirror.yaml -V -o pipeline.yaml --gcb - ``` + * The destination IP in this case is for a GKE master so the firewall rules are not configured to correctly allow + traffic to the master. - * If you want to use Tekton rather than Google Cloud Build(GCB) drop `--gcb` to emit a Tekton pipeline - * The instructions below assume you are using GCB -1. Edit the couldbuild.yaml file +* Common cause for networking related issue is is that some of the network resources (e.g. the Network, Routes, Firewall Rules, etc... ) don't get created - 1. In the `images` section add + * This could be because a reference is incorrect (e.g. firewall rules reference the wrong network) - ``` - - //docker.io/istio/proxy_init:1.1.6 - ``` - - * Replace `/` with your registry + * You can double check resources by doing kubectl describe and looking for errors. - 1. Under `steps` section add + * [kubeflow/gcp-blueprints#38](https://github.com/kubeflow/gcp-blueprints/issues/38) is tracking + tools to automate this - ``` - - args: - - build - - -t - - //docker.io/istio/proxy_init:1.1.6 - - --build-arg=INPUT_IMAGE=docker.io/istio/proxy_init:1.1.6 - - . - name: gcr.io/cloud-builders/docker - waitFor: - - '-' - ``` +* Google Container Registry (GCR) images can't be pulled. - 1. Remove the mirroring of cos-nvidia-installer:fixed image. You don’t need it to be replicated because this image is privately available through GKE internal repo. + * This likely indicates an issue with access to private GCR. This could be because of: - 1. Remove the images from the `images` section - 1. Remove it from the `steps` section + * DNS configurations: Check that the `DNSRecordSet` and `DNSManagedZone` CNRM resources are in a ready state. + * Routes: Make sure any default route to the internet has a larger value for the priority + then any routes to private GCP APIs so that the private routes match first. -1. Create a cloud build job to do the mirroring + * If image pull errors show IP addresses and not the `restricted.googleapis.com` VIP, then you have + an issue with networking. - ``` - gcloud builds submit --async gs://kubeflow-examples/image-replicate/replicate-context.tar.gz --project --config cloudbuild.yaml - ``` + * Firewall rules -1. Update your manifests to use the mirror'd images +* Access to allowed (non-Google) sites is blocked - ``` - kfctl alpha mirror overwrite -i pipeline.yaml - ``` + * The configuration uses CloudNat to allow selective access to sites/ -1. Edit file “kustomize/istio-install/base/istio-noauth.yaml”: + * In addition to allowing IAP, this allows sites like GitHub to be accessed. - 1. Replace `docker.io/istio/proxy_init:1.16` to `gcr.io//docker.io/istio/proxy_init:1.16` - 1. Replace `docker.io/istio/proxyv2:1.1.6` to `gcr.io//docker.io/istio/proxyv2:1.1.6` + * In order for CloudNat to work you need -## Deploy Kubeflow with Private GKE + * A default route to the internet + * Firewall rules to allow egress traffic to allowed sites -{{% alert title="Coming Soon" color="warning" %}} -You can follow the issue: [Documentation on how to use Kubeflow with private GKE and VPC service controls](https://github.com/kubeflow/website/issues/1705) -{{% /alert %}} + * These rules need to be higher priority then the deny all firewall egress rules. + +### Kubernetes Webhooks are blocked by firewall rules + +A common failure mode is that webhooks for custom resources are blocked by default firewall rules. +As explained in the [GKE docs](https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules), only connections from master to ports 443 and 10250 +are allowed by default. If you have a webhook serving on a different port +you will need to add an explict ingress firewall rule to allow that port to be accessed. + +These errors usually manifest as failures to create custom resources that depend on webhooks. An example +error is: + +``` +Error from server (InternalError): error when creating ".build/kubeflow-apps/cert-manager.io_v1alpha2_certificate_admission-webhook-cert.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request +``` ## Next steps @@ -303,5 +243,5 @@ You can follow the issue: [Documentation on how to use Kubeflow with private GKE * Learn more about [VPC Service Controls](https://cloud.google.com/vpc-service-controls/docs/) * See how to [delete](/docs/gke/deploy/delete-cli) your Kubeflow deployment using the CLI. -* [Troubleshoot](/docs/gke/troubleshooting-gke) any issues you may +* [Troubleshoot](/docs/gke/troubleshooting-gke) any GKE issues you may find.