diff --git a/content/docs/about/events.md b/content/docs/about/events.md index 41b29c5b19..fddeb8b1cc 100644 --- a/content/docs/about/events.md +++ b/content/docs/about/events.md @@ -11,13 +11,24 @@ Please raise a [GitHub issue](https://github.com/kubeflow/website/issues/new) if * [Data Day Texas, Austin](https://datadaytexas.com/), 26 January, 2019 - Kubeflow: Portable Machine Learning on Kubernetes: Michelle Casbon * [KubeCon, Seattle](https://events.linuxfoundation.org/events/kubecon-cloudnativecon-north-america-2018/), 11-13 December, 2018 - - [Workshop: Kubeflow End-to-End: GitHub Issue Summarization](https://sched.co/GrWE): Amy Unruh, Michelle Casbon - - [Natural Language Code Search for GitHub Using Kubeflow](https://sched.co/GrVn): Jeremy Lewi, Hamel Husain - - [Eco-Friendly ML: How the Kubeflow Ecosystem Bootstrapped Itself](https://sched.co/GrTc): Peter McKinnon - [Deep Dive: Kubeflow BoF](https://sched.co/Ha1X): Jeremy Lewi, David Aronchick + - [Slides](https://docs.google.com/presentation/d/1QP-o4O3ygpJ6aVfu6lAm0tMWYhAvdKE9FD6_92DB3EY) + - [Video](https://www.youtube.com/watch?v=gbZJ8eSIfJg) + - [Eco-Friendly ML: How the Kubeflow Ecosystem Bootstrapped Itself](https://sched.co/GrTc): Peter McKinnon + - [Slides](https://docs.google.com/presentation/d/1DUJHiYxz0D6qexBbNGjtRHYi4ERTKUOZ-LvqoHVKS-E) + - [Video](https://www.youtube.com/watch?v=EVSfp8HGJXY) - [Machine Learning as Code](https://sched.co/GrVh): Jay Smith + - [Slides](https://docs.google.com/presentation/d/1XKyf5fAM9KfF4OSnZREoDu-BP8JBtGkuSv72iX7CARY) + - [Video](https://www.youtube.com/watch?v=VXrGp5er1ZE) + - [Natural Language Code Search for GitHub Using Kubeflow](https://sched.co/GrVn): Jeremy Lewi, Hamel Husain + - [Slides](https://drive.google.com/open?id=1jHE61fAqZNgaDrpItk5L_tCzLU0DuL86rCz4yAKz4Ss) + - [Video](https://www.youtube.com/watch?v=SF77UBvfTHU) + - [Workshop: Kubeflow End-to-End: GitHub Issue Summarization](https://sched.co/GrWE): Amy Unruh, Michelle Casbon + - [Codelab](g.co/codelabs/kubecon18) + - [Slides](https://docs.google.com/presentation/d/1FFftSbWidin3opCIl4U0HVPvS6xk17izUFrrMR7e5qk) + - [Video](https://www.youtube.com/watch?v=UdthJEq8YsA) * Women in ML & Data Science, Melbourne, 5 December, 2018 - - Panel: Juliet Hougland, Michelle Casbon + - Panel: Michelle Casbon * [YOW!, Melbourne](https://melbourne.yowconference.com.au/), 4-7 December, 2018 - [Kubeflow Explained: NLP Architectures on Kubernetes](https://melbourne.yowconference.com.au/proposal/?id=6858): Michelle Casbon * [YOW!, Brisbane](https://brisbane.yowconference.com.au/), 3-4 December, 2018 @@ -26,9 +37,25 @@ Please raise a [GitHub issue](https://github.com/kubeflow/website/issues/new) if - [Kubeflow Explained: NLP Architectures on Kubernetes](https://sydney.yowconference.com.au/proposal/?id=6860): Michelle Casbon * [Scale By the Bay, San Francisco](http://scale.bythebay.io/), 15-17 November, 2018 - [Data Engineering & AI Panel](https://sched.co/Fndz): Michelle Casbon + - [Video](https://www.youtube.com/watch?v=sJd9RRmgCH4) * [KubeCon, Shanghai](https://www.lfasiallc.com/events/kubecon-cloudnativecon-china-2018/), 13-15 November, 2018 - - [CI/CD Pipelines & Machine Learning](https://sched.co/FuJo): Jeremy Lewi + - [A Tale of Using Kubeflow to Make the Electricity Smarter in China](https://sched.co/FzGn): Julia Han, Xin Zhang + - [Slides](https://schd.ws/hosted_files/kccncchina2018english/34/XinZhang_JuliaHan_En.pdf) + - [Video](https://www.youtube.com/watch?v=fad1FsfEvNY) - [A Year of Democratizing ML With Kubernetes & Kubeflow](https://sched.co/FuLr): David Aronchick, Fei Xue + - [Video](https://www.youtube.com/watch?v=oMlddDdJgEg) + - [Benchmarking Machine Learning Workloads on Kubeflow](https://sched.co/FuJw): Xinyuan Huang, Ce Gao + - [Video](https://www.youtube.com/watch?v=9sLRIBYYUlQ) + - [CI/CD Pipelines & Machine Learning](https://sched.co/FuJo): Jeremy Lewi + - [Video](https://www.youtube.com/watch?v=EH850bIQVag) + - [Kubeflow From the End User's Perspective](https://sched.co/FuJx): Xin Zhang + - [Video](https://www.youtube.com/watch?v=x0CKhyoV9aI) + - [Kubernetes CI/CD Hacks with KicroK8s and Kubeflow](https://sched.co/FuJc): Land Lu, Zhang Lei Mao + - [Video](https://www.youtube.com/watch?v=1SSvS2w5OMQ) + - [Machine Learning on Kubernetes BoF](https://sched.co/FuJs): David Aronchick + - [Video](https://www.youtube.com/watch?v=0eEAZ7lmLbo) + - [Operating Deep Learning Pipelines Anywhere Using Kubeflow](https://sched.co/FuJt): Jörg Schad, Gilbert Song + - [Video](https://www.youtube.com/watch?v=63HJgZK27mU) * [DevFest, Seattle](https://www.eventbrite.com/e/devfest-seattle-2018-tickets-50408043816), 3 November, 2018 - Kubeflow End to End: Amy Unruh * [Data@Scale, Boston](https://dataatscale2018.splashthat.com/), 25 October, 2018 diff --git a/content/docs/about/kubeflow.md b/content/docs/about/kubeflow.md index 2a135781be..36b28f052b 100644 --- a/content/docs/about/kubeflow.md +++ b/content/docs/about/kubeflow.md @@ -51,20 +51,6 @@ You can choose to deploy your workloads locally or to a cloud environment. Kubeflow started as an open sourcing of the way Google ran [TensorFlow](https://www.tensorflow.org/) internally, based on a pipeline called [TensorFlow Extended](https://www.tensorflow.org/tfx/). It began as just a simpler way to run TensorFlow jobs on Kubernetes, but has since expanded to be a multi-architecture, multi-cloud framework for running entire machine learning pipelines. -## Workflow - -The basic workflow is: - -* Download the Kubeflow scripts and configuration files. -* Customize the configuration. -* Run the scripts to deploy your containers to your chosen environment. - -You adapt the configuration to choose the platforms and services that you want -to use for each stage of the ML workflow: data preparation, model training, -prediction serving, and service management. - -You can choose to deploy your workloads locally or to a cloud environment. - ## Notebooks Included in Kubeflow is [JupyterHub](https://jupyterhub.readthedocs.io/en/stable/) to create and manage multi-user interactive Jupyter notebooks. Project Jupyter is a non-profit, open-source project to support interactive data science and scientific computing across all programming languages. diff --git a/content/docs/guides/components/pytorch.md b/content/docs/guides/components/pytorch.md index 7c8da38337..2eeaa5f67b 100644 --- a/content/docs/guides/components/pytorch.md +++ b/content/docs/guides/components/pytorch.md @@ -40,99 +40,101 @@ ks apply ${ENVIRONMENT} -c pytorch-operator ## Creating a PyTorch Job -You can create PyTorch Job by defining a PyTorchJob config file. See [distributed MNIST example](https://github.com/kubeflow/pytorch-operator/blob/master/examples/dist-mnist/pytorch_job_mnist.yaml) config file. You may change the config file based on your requirements. +You can create PyTorch Job by defining a PyTorchJob config file. See [distributed MNIST example](https://github.com/kubeflow/pytorch-operator/blob/master/examples/tcp-dist/mnist/v1beta1/pytorch_job_mnist.yaml) config file. You may change the config file based on your requirements. ``` -cat examples/dist-mnist/pytorch_job_mnist.yaml +cat pytorch_job_mnist.yaml ``` Deploy the PyTorchJob resource to start training: ``` -kubectl create -f examples/dist-mnist/pytorch_job_mnist.yaml +kubectl create -f pytorch_job_mnist.yaml ``` You should now be able to see the created pods matching the specified number of replicas. ``` -kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test +kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist ``` Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress. ``` -PODNAME=$(kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test,task_index=0 -o name) +PODNAME=$(kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist,pytorch-replica-type=master,pytorch-replica-index=0 -o name) kubectl logs -f ${PODNAME} ``` ## Monitoring a PyTorch Job ``` -kubectl get -o yaml pytorchjobs dist-mnist-for-e2e-test +kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist ``` See the status section to monitor the job status. Here is sample output when the job is successfully completed. ``` -apiVersion: v1 -items: -- apiVersion: kubeflow.org/v1alpha1 - kind: PyTorchJob - metadata: - clusterName: "" - creationTimestamp: 2018-06-22T08:16:14Z - generation: 1 - name: dist-mnist-for-e2e-test - namespace: default - resourceVersion: "3276193" - selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/pytorchjobs/dist-mnist-for-e2e-test - uid: 87772d3b-75f4-11e8-bdd9-42010aa00072 - spec: - RuntimeId: kmma - pytorchImage: pytorch/pytorch:v0.2 - replicaSpecs: - - masterPort: 23456 - replicaType: MASTER +apiVersion: kubeflow.org/v1beta1 +kind: PyTorchJob +metadata: + clusterName: "" + creationTimestamp: 2018-12-16T21:39:09Z + generation: 1 + name: pytorch-tcp-dist-mnist + namespace: default + resourceVersion: "15532" + selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist + uid: 059391e8-017b-11e9-bf13-06afd8f55a5c +spec: + cleanPodPolicy: None + pytorchReplicaSpecs: + Master: replicas: 1 + restartPolicy: OnFailure template: metadata: creationTimestamp: null spec: containers: - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0 - imagePullPolicy: IfNotPresent name: pytorch + ports: + - containerPort: 23456 + name: pytorchjob-port resources: {} - restartPolicy: OnFailure - - masterPort: 23456 - replicaType: WORKER + Worker: replicas: 3 + restartPolicy: OnFailure template: metadata: creationTimestamp: null spec: containers: - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0 - imagePullPolicy: IfNotPresent name: pytorch + ports: + - containerPort: 23456 + name: pytorchjob-port resources: {} - restartPolicy: OnFailure - terminationPolicy: - master: - replicaName: MASTER - replicaRank: 0 - status: - phase: Done - reason: "" - replicaStatuses: - - ReplicasStates: - Succeeded: 1 - replica_type: MASTER - state: Succeeded - - ReplicasStates: - Running: 1 - Succeeded: 2 - replica_type: WORKER - state: Running - state: Succeeded -kind: List -metadata: - resourceVersion: "" - selfLink: "" +status: + completionTime: 2018-12-16T21:43:27Z + conditions: + - lastTransitionTime: 2018-12-16T21:39:09Z + lastUpdateTime: 2018-12-16T21:39:09Z + message: PyTorchJob pytorch-tcp-dist-mnist is created. + reason: PyTorchJobCreated + status: "True" + type: Created + - lastTransitionTime: 2018-12-16T21:39:09Z + lastUpdateTime: 2018-12-16T21:40:45Z + message: PyTorchJob pytorch-tcp-dist-mnist is running. + reason: PyTorchJobRunning + status: "False" + type: Running + - lastTransitionTime: 2018-12-16T21:39:09Z + lastUpdateTime: 2018-12-16T21:43:27Z + message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed. + reason: PyTorchJobSucceeded + status: "True" + type: Succeeded + replicaStatuses: + Master: {} + Worker: {} + startTime: 2018-12-16T21:40:45Z ``` diff --git a/content/docs/guides/components/seldon.md b/content/docs/guides/components/seldon.md index afc8834ba5..0592de5d22 100644 --- a/content/docs/guides/components/seldon.md +++ b/content/docs/guides/components/seldon.md @@ -25,6 +25,12 @@ ks param set seldon seldonVersion 0.1.8 ks generate seldon seldon ``` +Deploy seldon cluster manager: + +``` +ks apply ${KF_ENV} -c seldon +``` + ### Seldon Deployment Graphs Seldon allows complex runtime graphs for model inference to be deployed. Some example prototypes have been provided to help you get started. Follow the [Seldon docs](https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/readme.md) to wrap your model code into an image that can be managed by Seldon. In the examples below we will use a model image ```seldonio/mock_classifier``` ; replace this with your actual model image. You will also need to choose between the v1alpha2 and v1alpha1 prototype examples depending on which version of Seldon you generated above. The following prototypes are available: diff --git a/content/docs/guides/components/tfserving_new.md b/content/docs/guides/components/tfserving_new.md index 4e71f497cd..8bea8e4fec 100644 --- a/content/docs/guides/components/tfserving_new.md +++ b/content/docs/guides/components/tfserving_new.md @@ -14,7 +14,7 @@ Generate the service(model) component ``` ks generate tf-serving-service mnist-service -ks param set mnist-service modelName mnist +ks param set mnist-service modelName mnist // match your deployment mode name ks param set mnist-service trafficRule v1:100 // optional, it's the default value ``` @@ -53,10 +53,10 @@ See [doc](https://cloud.google.com/docs/authentication/) for more detail. To use S3, generate a different prototype ``` -ks generate tf-serving-aws ${MODEL_COMPONENT} --name=${MODEL_NAME} +ks generate tf-serving-deployment-aws ${MODEL_COMPONENT} --name=${MODEL_NAME} ``` -First you need to create secret that will contain access credentials. +First you need to create secret that will contain access credentials. Use base64 to encode your credentials and check deails in the Kubernetes guide to [creating a secret manually](https://kubernetes.io/docs/concepts/configuration/secret/#creating-a-secret-manually) ``` apiVersion: v1 metadata: @@ -71,8 +71,8 @@ Enable S3, set url and point to correct Secret ``` MODEL_PATH=s3://kubeflow-models/inception -ks param set ${MODEL_COMPONENT} modelPath ${MODEL_PATH} -ks param set ${MODEL_COMPONENT} s3Enable True +ks param set ${MODEL_COMPONENT} modelBasePath ${MODEL_PATH} +ks param set ${MODEL_COMPONENT} s3Enable true ks param set ${MODEL_COMPONENT} s3SecretName secretname ``` @@ -82,7 +82,7 @@ Optionally you can also override default parameters of S3 # S3 region ks param set ${MODEL_COMPONENT} s3AwsRegion us-west-1 -# true Whether or not to use https for S3 connections +# Whether or not to use https for S3 connections ks param set ${MODEL_COMPONENT} s3UseHttps true # Whether or not to verify https certificates for S3 connections diff --git a/content/docs/guides/components/tftraining.md b/content/docs/guides/components/tftraining.md index b9625044a9..f7773ed2a3 100644 --- a/content/docs/guides/components/tftraining.md +++ b/content/docs/guides/components/tftraining.md @@ -160,7 +160,18 @@ VERSION=v0.2-branch ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow ks pkg install kubeflow-git/examples +``` + +Choose a tf-job prototype from the following list of available prototypes, to match the CRD you're using: +> Type `ks prototype list` to list all available prototypes +* `io.ksonnet.pkg.tf-job-operator` - A TensorFlow job operator. +* `io.ksonnet.pkg.tf-job-simple` - A simple TFJob to run CNN benchmark +* `io.ksonnet.pkg.tf-job-simple-v1alpha1` - A simple TFJob to run CNN benchmark +* `io.ksonnet.pkg.tf-job-simple-v1beta1` - A simple TFJob to run CNN benchmark + +Run the `generate` command: +``` ks generate tf-job-simple ${CNN_JOB_NAME} --name=${CNN_JOB_NAME} ``` @@ -240,6 +251,7 @@ To use GPUs your cluster must be configured to use GPUs. * For more information: * [K8s Instructions For Scheduling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) * [GKE Instructions](https://cloud.google.com/kubernetes-engine/docs/concepts/gpus) + * [EKS Instructions](https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html) To attach GPUs specify the GPU resource on the container in the replicas that should contain the GPUs; for example. @@ -674,4 +686,4 @@ Events: in the previous section. ## More information -* See how to [run a job with gang-scheduling](/docs/guides/job-scheduling). \ No newline at end of file +* See how to [run a job with gang-scheduling](/docs/guides/job-scheduling).