Table of Contents generated with DocToc
- Working with GCP test infrastructure
- Logs
- Debugging Failed Tests
- Testing Changes to the ProwJobs
- Cleaning up leaked resources
- Integration with K8s Prow Infrastructure.
- Setting up Kubeflow Test Infrastructure
- Setting up Kubeflow Release Clusters For Testing
- Setting up a Kubeflow Repository to Use Prow
The tests store the results of tests in a shared NFS filesystem. To inspect the results you can mount the NFS volume.
To make this easy, We run a stateful set in our test cluster that mounts the same volumes as our Argo workers. Furthermore, this stateful set is using an environment (GCP credentials, docker image, etc...) that mimics our Argo workers. You can ssh into this stateful set in order to get access to the NFS volume.
kubectl exec -it debug-worker-0 /bin/bash
This can be very useful for reproducing test failures.
Logs from the E2E tests are available in a number of places and can be used to troubleshoot test failures.
These should be publicly accessible.
The logs from each step are copied to GCS and made available through spyglass. The K8s-ci robot should post a link to the spyglass UI in the PR. You can also find them as follows
- Open up the prow jobs dashboard e.g. for kubeflow/kubeflow
- Find your job
- Click on the link under job; this goes to the Gubernator dashboard
- Click on artifacts
- Navigate to artifacts/logs
If these logs aren't available it could indicate a problem running the step that uploads the artifacts to GCS for spyglass. In this case you can use one of the alternative methods listed below.
The argo UI is publicly accessible at http://testing-argo.kubeflow.org/timeline.
- Find and click on the workflow corresponding to your pre/post/periodic job
- Select the workflow tab
- From here you can select a specific step and then see the logs for that step
Since we run our E2E tests on GKE, all logs are persisted in Stackdriver logging.
Viewer access to Stackdriver logs is available by joining one of the following groups
We use the new stackdriver Kubernetes logging which means we use the k8s_pod and k8s_container resource types.
Below are some relevant filters:
Get container logs for a specific pod
resource.type="k8s_container"
resource.labels.cluster_name="kubeflow-testing"
resource.labels.pod_name="${POD_NAME}"
Get logs using pod label
resource.type="k8s_container"
resource.labels.cluster_name="kubeflow-testing"
metadata.userLabels.${LABEL_KEY}="${LABEL_VALUE}"
Get events for a pod
resource.type="k8s_pod"
resource.labels.cluster_name="${CLUSTER}"
resource.labels.pod_name="${POD_NAME}"
The Kubeflow docs have some useful gcloud one liners for fetching logs.
Our tests are split across three projects
-
k8s-prow-builds
- This is owned by the prow team
- This is where the prow jobs are defined
-
kubeflow-ci
- This is where the prow jobs run in the
test-pods
namespace - This is where the Argo E2E workflows kicked off by the prow jobs run
- This is where other Kubeflow test infra (e.g. various cron jobs run)
- This is where the prow jobs run in the
-
kubeflow-ci-deployment
- This is the project where E2E tests actually create Kubeflow clusters
We currently have the following levels of access
-
ci-viewer-only
-
This is controlled by the group ci-viewer
-
This group basically grants viewer only access to projects kubeflow-ci and kubeflow-ci-deployment
-
This provides access to stackdriver for both projects
-
Folks making regular and continual contributions to Kubeflow and in need of access to debug tests can generally have access
-
-
ci-edit/admin
-
This is controlled by the group ci-team
-
This group grants permissions necessary to administer the infrastructure running in kubeflow-ci and kubeflow-ci-deployment
-
Access to this group is highly restricted since this is critical infrastructure for the project
-
Following standard operating procedures we want to limit the number of folks with direct access to infrastructure
- Rather than granting more people access we want to develop scalable practices that eliminate the need for granting large numbers of people access (e.g. developing git ops processes)
-
-
example-maintainers
-
This is controlled by the group example-maintainers
-
This group provides more direct access to the Kubeflow clusters running kubeflow-ci-deployment
-
This group is intended for the folks actively developing and maintaining tests for Kubeflow examples
-
Continuous testing for kubeflow examples should run against regularly updated, auto-deployed clusters in project kubeflow-ci-deployment
- Example maintainers are granted elevated access to these clusters in order to facilitate development of these tests
-
If no results show up in Spyglass this means the prow job didn't get far enough to upload any results/logs to GCS.
To debug this you need the pod logs. You can access the pod logs via the build log link for your job in the prow jobs UI
- Pod logs are ephemeral so you need to check shortly after your job runs.
The pod logs are available in StackDriver but only the Google Kubeflow Team has access
- Prow controllers run on a cluster (
k8s-prow/prow
) owned by the K8s team - Prow jobs (i.e. pods) run on a build cluster (
kubeflow-ci/kubeflow-testing
) owned by the Kubeflow team - This policy for controller logs is owned by K8s, while the policy for job logs is governed by Kubeflow
To access the stackdriver logs
- Open stackdriver for project kubeflow-ci
- Get the pod ID by clicking on the build log in the prow jobs UI
- Filter the logs using
resource.type="container"
resource.labels.pod_id=${POD_ID}
- For example, if the TF serving workflow failed, filter the logs using
resource.type="container"
resource.labels.cluster_name="kubeflow-testing"
labels."container.googleapi.com/namespace_name"=WORKFLOW_NAME
resource.labels.container_name="mnist-cpu"
The Argo UI will surface logs for the pod but only if the pod hasn't been deleted yet by Kubernetes.
Using stackdriver to fetch pod logs is more reliable/durable but requires viewer permissions for Kubeflow's ci's infrastructure.
An Argo workflow fails and you click on the failed step in the Argo UI to get the logs and you see the error
failed to get container status {"docker" "b84b751b0102b5658080a520c9a5c2655855960c4695cf557c0c1e45999f7429"}:
rpc error: code = Unknown desc = Error: No such container: b84b751b0102b5658080a520c9a5c2655855960c4695cf557c0c1e45999f7429
This error is a red herring; it means the pod is probably gone so Argo couldn't get the logs.
The logs should be in StackDriver but to get them we need to identify the pod.
-
Get the workflow spec:
-
Get the workflow YAML using kubectl
kubectl get wf -o yaml ${WF_NAME} > /tmp/${WF_NAME}.yaml
- This requires appropriate K8s RBAC permissions
- You'll need to be added to the Google group [email protected]
- Create a PR adding yourself to ci-team
- Add credentials to your $HOME/.kube/config:
gcloud --project kubeflow-ci container clusters get-credentials kubeflow-testing --zone us-east1-d
-
Get the workflow YAML from Prow artifacts
- Find your Prow job from https://prow.k8s.io/?repo=kubeflow%2Ftesting.
- Find the artifacts from the Spyglass link of the Prow job, e.g. https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_testing/360/kubeflow-testing-presubmit/1120174107468500992/.
- Download
${WF_NAME}.yaml
from the GCS artifacts page.
-
-
Search the YAML spec for the pod information for the failed step
-
We need to find information that can be used to fetch logs for the pod from stackdriver
-
Using Pod labels
-
In the workflow spec look at the step metadata to see if it contains labels
metadata: labels: BUILD_ID: "1405" BUILD_NUMBER: "1405" JOB_NAME: kubeflow-examples-presubmit JOB_TYPE: presubmit PULL_BASE_SHA: 8a26b23e3d35d5993d93e8b9ecae52371598d1cc PULL_NUMBER: "522" PULL_PULL_SHA: 9aecf80f1c41059cd8ff13d1ca8b9e821dc462bf REPO_NAME: examples REPO_OWNER: kubeflow step_name: tfjob-test workflow: kubeflow-examples-presubmit-gis-522-9aecf80-1405-9055 workflow_template: gis
-
Follow the stackdriver instructions to query for the logs
- Use labels
BUILD_ID
andstep_name
to identify the pod
- Use labels
-
-
If no labels are specified for the step you can use displayName to match the text in the UI to step status
kubeflow-presubmit-kfctl-1810-70210d5-3900-218a-2243590372: boundaryID: kubeflow-presubmit-kfctl-1810-70210d5-3900-218a displayName: kfctl-apply-gcp finishedAt: 2018-10-17T05:07:58Z id: kubeflow-presubmit-kfctl-1810-70210d5-3900-218a-2243590372 message: failed with exit code 1 name: kubeflow-presubmit-kfctl-1810-70210d5-3900-218a.kfctl-apply-gcp phase: Failed startedAt: 2018-10-17T05:04:20Z templateName: kfctl-apply-gcp type: Pod
-
id will be the name of the pod.
-
Follow the instructions to get the stackdriver logs for the pod or use the following gcloud command
gcloud --project=kubeflow-ci logging read --format="table(timestamp, resource.labels.container_name, textPayload)" \ --freshness=24h \ --order asc \ "resource.type=\"k8s_container\" resource.labels.pod_name=\"${POD}\" "
-
-
-
If an E2E test fails because one of the Kubeflow applications (e.g. the Jupyter web app) isn't reported as deploying successfully we can follow these instructions to debug it.
To debug it we want to look at the K8s events indicating why the K8s deployment failed. In most cases the cluster will already be torn down so we need to look at the kubernetes events associated with that deployment.
-
Get the cluster used for Kubeflow
-
In prow look at artifacts and find the YAML spec for the Argo workflow that ran your e2e test
-
Identify the step that deployed Kubeflow
-
Open up stack driver logging
-
Use a filter (advanced) like the following to find the log entry getting the credentials for your deployment
resource.type="k8s_container" resource.labels.pod_name=`<POD NAME>` resource.labels.container_name="main" get-credentials
-
The log output should look like the following
get-credentials kfctl-6742 --zone=us-east1-d --project=kubeflow-ci-deployment
- The argument
kfctl-6742
is the name of the cluster
- The argument
-
-
You can use the script
py/kubeflow/testing/troubleshoot_deployment.py
to fetch logs alternatively you can follow the steps below to filter the logs in the stackdriver UI -
Use a filter like the following to get the events associated with the deployment or statefulset
resource.labels.cluster_name="kfctl-6742" logName="projects/kubeflow-ci-deployment/logs/events" jsonPayload.involvedObject.name="jupyter-web-app"
-
Change the name of the involvedObject and cluster name to match your deployment.
-
If a pod was created the name of the pod should be present e.g.
Scaled up replica set jupyter-web-app-5fcddbf75c to 1"
-
You can continue to look at event logs for the replica set to eventually get to the name of a pod and potentially the pod.
-
Changes to our ProwJob configs in config.yaml should be relatively infrequent since most of the code invoked as part of our tests lives in the repository.
However, in the event we need to make changes here are some instructions for testing them.
Follow Prow's getting started guide to create your own prow cluster.
- Note The only part you really need is the ProwJob CRD and controller.
Checkout kubernetes/test-infra.
git clone https://github.com/kubernetes/test-infra git_k8s-test-infra
Build the mkpj binary
bazel build build prow/cmd/mkpj
Generate the ProwJob Config
./bazel-bin/prow/cmd/mkpj/mkpj --job=$JOB_NAME --config-path=$CONFIG_PATH
- This binary will prompt for needed information like the sha #
- The output will be a ProwJob spec which can be instantiated using kubectl
Create the ProwJob
kubectl create -f ${PROW_JOB_YAML_FILE}
- To rerun the job bump metadata.name and status.startTime
To monitor the job open Prow's UI by navigating to the external IP associated with the ingress for your Prow cluster or using kubectl proxy.
Test failures sometimes leave resources (GCP deployments, VMs, GKE clusters) still running. The following scripts for example can be used to garbage collect all resources. The script can GC specific resources with different commands.
cd py
python -m kubeflow.testing.cleanup_ci --project kubeflow-ci-deployment all
This script is set up as a cronjob by cd test-infra/cleanup && make hydrate
.
We rely on K8s instance of Prow to actually run our jobs.
Here's a dashboard with the results.
Our jobs should be added to K8s config
Our tests require:
- a K8s cluster
- Argo installed on the cluster
- A shared NFS filesystem
Our prow jobs execute Argo worflows in project/clusters owned by Kubeflow. We don't use the shared Kubernetes test clusters for this.
- This gives us more control of the resources we want to use e.g. GPUs
This section provides the instructions for setting this up.
Create a GKE cluster
PROJECT=kubeflow-ci
ZONE=us-east1-d
CLUSTER=kubeflow-testing
NAMESPACE=kubeflow-test-infra
gcloud --project=${PROJECT} container clusters create \
--zone=${ZONE} \
--machine-type=n1-standard-8 \
${CLUSTER}
gcloud compute --project=${PROJECT} addresses create argo-ui --global
gcloud services --project=${PROJECT} enable cloudbuild.googleapis.com
gcloud services --project=${PROJECT} enable containerregistry.googleapis.com
gcloud services --project=${PROJECT} enable container.googleapis.com
- The tests need a GCP service account to upload data to GCS for Gubernator
SERVICE_ACCOUNT=kubeflow-testing
gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} --display-name "Kubeflow testing account"
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com --role roles/container.admin \
--role=roles/viewer \
--role=roles/cloudbuild.builds.editor \
--role=roles/logging.viewer \
--role=roles/storage.admin \
--role=roles/compute.instanceAdmin.v1
- Our tests create K8s resources (e.g. namespaces) which is why we grant it developer permissions.
- Project Viewer (because GCB requires this with gcloud)
- Kubernetes Engine Admin (some tests create GKE clusters)
- Logs viewer (for GCB)
- Compute Instance Admin to create VMs for minikube
- Storage Admin (For GCR)
GCE_DEFAULT=${PROJECT_NUMBER}[email protected]
FULL_SERVICE=${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com
gcloud --project=${PROJECT} iam service-accounts add-iam-policy-binding \
${GCE_DEFAULT} --member="serviceAccount:${FULL_SERVICE}" \
--role=roles/iam.serviceAccountUser
- Service Account User of the Compute Engine Default Service account (to avoid this error)
Create a secret key containing a GCP private key for the service account
KEY_FILE=<path to key>
SECRET_NAME=gcp-credentials
gcloud iam service-accounts keys create ${KEY_FILE} \
--iam-account ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com
kubectl create secret generic ${SECRET_NAME} \
--namespace=${NAMESPACE} --from-file=key.json=${KEY_FILE}
Make the service account a cluster admin
kubectl create clusterrolebinding ${SERVICE_ACCOUNT}-admin --clusterrole=cluster-admin \
--user=${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com
- The service account is used to deploye Kubeflow which entails creating various roles; so it needs sufficient RBAC permission to do so.
Add a clusterrolebinding that uses the numeric id of the service account as a work around for ksonnet/ksonnet#396
NUMERIC_ID=`gcloud --project=kubeflow-ci iam service-accounts describe ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com --format="value(oauth2ClientId)"`
kubectl create clusterrolebinding ${SERVICE_ACCOUNT}-numeric-id-admin --clusterrole=cluster-admin \
--user=${NUMERIC_ID}
You need to use a GitHub token with ksonnet otherwise the test quickly runs into GitHub API limits.
TODO(jlewi): We should create a GitHub bot account to use with our tests and then create API tokens for that bot.
You can use the GitHub API to create a token
- The token doesn't need any scopes because its only accessing public data and is needed only for API metering.
To create the secret run
kubectl create secret generic github-token --namespace=${NAMESPACE} --from-literal=github_token=${GITHUB_TOKEN}
We use GCP Cloud FileStore to create an NFS filesystem.
There is a deployment manager config in the directory test-infra/gcp_configs
The ksonnet app test-infra
contains ksonnet configs to deploy the test infrastructure.
First, install the kubeflow package
ks pkg install kubeflow/core
Then change the server ip in test-infra/environments/prow/spec.json
to
point to your cluster.
You can deploy argo as follows (you don't need to use argo's CLI)
Set up the environment
NFS_SERVER=<Internal GCE IP address of the NFS Server>
ks env add ${ENV}
ks param set --env=${ENV} argo namespace ${NAMESPACE}
ks param set --env=${ENV} debug-worker namespace ${NAMESPACE}
ks param set --env=${ENV} nfs-external namespace ${NAMESPACE}
ks param set --env=${ENV} nfs-external nfsServer ${NFS_SERVER}
In the testing environment (but not release) we also expose the UI
ks param set --env=${ENV} argo exposeUi true
ks apply ${ENV} -c argo
Create the PVs corresponding to external NFS
ks apply ${ENV} -c nfs-external
The e2e test that runs click-to-deploy app will test deploying kubeflow to a cluter under project kubeflow-ci-deployment. So it needs to know a clientID and secret of that project. Check out this page and look for client ID called deployapp-test-client.
kubectl create secret generic --namespace=${NAMESPACE} kubeflow-oauth --from-literal=client_id=${CLIENT_ID} --from-literal=client_secret=${CLIENT_SECRET}
User or service account deploying the test infrastructure needs sufficient permissions to create the roles that are created as part deploying the test infrastructure. So you may need to run the following command before using ksonnet to deploy the test infrastructure.
kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin [email protected]
We maintain a pool of Kubeflow clusters corresponding to different releases of Kubeflow. These can be used for
- Running continuous integration of our examples against a particular release
- Manual testing of features in each release
The configs for each deployment are stored in the test-infra directory
The deployments should be named using one of the following patterns
kf-vX.Y-n??
- For clusters corresponding to a particular releasekf-vmaster-n??
- For clusters corresponding to master
This naming scheme is chosen to allow us to cycle through a fixed set of names e.g.
kf-v0.4-n00
...
kf-v0.4-n04
The reason we want to cycle through names is because the endpoint name for the deployment needs to be manually set in the OAuth credential used for IAP. By cycling through a fixed set of names we can automate redeployment without having to manually configure the OAuth credential.
-
Get kfctl for the desired release
-
Run the following command
python -m kubeflow.testing.create_kf_instance --base_name=<kf-vX.Y|kf-vmaster>
-
Create a PR with the resulting config.
-
Define ProwJobs see pull/4951
- Add prow jobs to prow/config.yaml
- Add trigger plugin to prow/plugins.yaml
- Add test dashboards to testgrid/config.yaml
- Modify testgrid/cmd/configurator/config_test.go to allow presubmits for the new repo.
-
Add the
ci-bots
team to the repository with write access- Write access will allow bots in the team to update status
-
Follow instructions for adding a repository to the PR dashboard.
-
Add an
OWNERS
to your Kubeflow repository. TheOWNERS
file, like this one, will specify who can review and approve on this repo.
Webhooks for prow should already be configured according to these instructions for the org so you shouldn't need to set hooks per repository. * Use https://prow.k8s.io/hook as the target * Get HMAC token from k8s test team