Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/kubeflow/website into docsy
Browse files Browse the repository at this point in the history
  • Loading branch information
sarahmaddox committed Dec 22, 2018
2 parents 1df13a8 + 4f706ac commit 1c9f27f
Show file tree
Hide file tree
Showing 6 changed files with 112 additions and 79 deletions.
37 changes: 32 additions & 5 deletions content/docs/about/events.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,24 @@ Please raise a [GitHub issue](https://github.com/kubeflow/website/issues/new) if
* [Data Day Texas, Austin](https://datadaytexas.com/), 26 January, 2019
- Kubeflow: Portable Machine Learning on Kubernetes: Michelle Casbon
* [KubeCon, Seattle](https://events.linuxfoundation.org/events/kubecon-cloudnativecon-north-america-2018/), 11-13 December, 2018
- [Workshop: Kubeflow End-to-End: GitHub Issue Summarization](https://sched.co/GrWE): Amy Unruh, Michelle Casbon
- [Natural Language Code Search for GitHub Using Kubeflow](https://sched.co/GrVn): Jeremy Lewi, Hamel Husain
- [Eco-Friendly ML: How the Kubeflow Ecosystem Bootstrapped Itself](https://sched.co/GrTc): Peter McKinnon
- [Deep Dive: Kubeflow BoF](https://sched.co/Ha1X): Jeremy Lewi, David Aronchick
- [Slides](https://docs.google.com/presentation/d/1QP-o4O3ygpJ6aVfu6lAm0tMWYhAvdKE9FD6_92DB3EY)
- [Video](https://www.youtube.com/watch?v=gbZJ8eSIfJg)
- [Eco-Friendly ML: How the Kubeflow Ecosystem Bootstrapped Itself](https://sched.co/GrTc): Peter McKinnon
- [Slides](https://docs.google.com/presentation/d/1DUJHiYxz0D6qexBbNGjtRHYi4ERTKUOZ-LvqoHVKS-E)
- [Video](https://www.youtube.com/watch?v=EVSfp8HGJXY)
- [Machine Learning as Code](https://sched.co/GrVh): Jay Smith
- [Slides](https://docs.google.com/presentation/d/1XKyf5fAM9KfF4OSnZREoDu-BP8JBtGkuSv72iX7CARY)
- [Video](https://www.youtube.com/watch?v=VXrGp5er1ZE)
- [Natural Language Code Search for GitHub Using Kubeflow](https://sched.co/GrVn): Jeremy Lewi, Hamel Husain
- [Slides](https://drive.google.com/open?id=1jHE61fAqZNgaDrpItk5L_tCzLU0DuL86rCz4yAKz4Ss)
- [Video](https://www.youtube.com/watch?v=SF77UBvfTHU)
- [Workshop: Kubeflow End-to-End: GitHub Issue Summarization](https://sched.co/GrWE): Amy Unruh, Michelle Casbon
- [Codelab](g.co/codelabs/kubecon18)
- [Slides](https://docs.google.com/presentation/d/1FFftSbWidin3opCIl4U0HVPvS6xk17izUFrrMR7e5qk)
- [Video](https://www.youtube.com/watch?v=UdthJEq8YsA)
* Women in ML & Data Science, Melbourne, 5 December, 2018
- Panel: Juliet Hougland, Michelle Casbon
- Panel: Michelle Casbon
* [YOW!, Melbourne](https://melbourne.yowconference.com.au/), 4-7 December, 2018
- [Kubeflow Explained: NLP Architectures on Kubernetes](https://melbourne.yowconference.com.au/proposal/?id=6858): Michelle Casbon
* [YOW!, Brisbane](https://brisbane.yowconference.com.au/), 3-4 December, 2018
Expand All @@ -26,9 +37,25 @@ Please raise a [GitHub issue](https://github.com/kubeflow/website/issues/new) if
- [Kubeflow Explained: NLP Architectures on Kubernetes](https://sydney.yowconference.com.au/proposal/?id=6860): Michelle Casbon
* [Scale By the Bay, San Francisco](http://scale.bythebay.io/), 15-17 November, 2018
- [Data Engineering & AI Panel](https://sched.co/Fndz): Michelle Casbon
- [Video](https://www.youtube.com/watch?v=sJd9RRmgCH4)
* [KubeCon, Shanghai](https://www.lfasiallc.com/events/kubecon-cloudnativecon-china-2018/), 13-15 November, 2018
- [CI/CD Pipelines & Machine Learning](https://sched.co/FuJo): Jeremy Lewi
- [A Tale of Using Kubeflow to Make the Electricity Smarter in China](https://sched.co/FzGn): Julia Han, Xin Zhang
- [Slides](https://schd.ws/hosted_files/kccncchina2018english/34/XinZhang_JuliaHan_En.pdf)
- [Video](https://www.youtube.com/watch?v=fad1FsfEvNY)
- [A Year of Democratizing ML With Kubernetes & Kubeflow](https://sched.co/FuLr): David Aronchick, Fei Xue
- [Video](https://www.youtube.com/watch?v=oMlddDdJgEg)
- [Benchmarking Machine Learning Workloads on Kubeflow](https://sched.co/FuJw): Xinyuan Huang, Ce Gao
- [Video](https://www.youtube.com/watch?v=9sLRIBYYUlQ)
- [CI/CD Pipelines & Machine Learning](https://sched.co/FuJo): Jeremy Lewi
- [Video](https://www.youtube.com/watch?v=EH850bIQVag)
- [Kubeflow From the End User's Perspective](https://sched.co/FuJx): Xin Zhang
- [Video](https://www.youtube.com/watch?v=x0CKhyoV9aI)
- [Kubernetes CI/CD Hacks with KicroK8s and Kubeflow](https://sched.co/FuJc): Land Lu, Zhang Lei Mao
- [Video](https://www.youtube.com/watch?v=1SSvS2w5OMQ)
- [Machine Learning on Kubernetes BoF](https://sched.co/FuJs): David Aronchick
- [Video](https://www.youtube.com/watch?v=0eEAZ7lmLbo)
- [Operating Deep Learning Pipelines Anywhere Using Kubeflow](https://sched.co/FuJt): Jörg Schad, Gilbert Song
- [Video](https://www.youtube.com/watch?v=63HJgZK27mU)
* [DevFest, Seattle](https://www.eventbrite.com/e/devfest-seattle-2018-tickets-50408043816), 3 November, 2018
- Kubeflow End to End: Amy Unruh
* [Data@Scale, Boston](https://dataatscale2018.splashthat.com/), 25 October, 2018
Expand Down
14 changes: 0 additions & 14 deletions content/docs/about/kubeflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,20 +51,6 @@ You can choose to deploy your workloads locally or to a cloud environment.

Kubeflow started as an open sourcing of the way Google ran [TensorFlow](https://www.tensorflow.org/) internally, based on a pipeline called [TensorFlow Extended](https://www.tensorflow.org/tfx/). It began as just a simpler way to run TensorFlow jobs on Kubernetes, but has since expanded to be a multi-architecture, multi-cloud framework for running entire machine learning pipelines.

## Workflow

The basic workflow is:

* Download the Kubeflow scripts and configuration files.
* Customize the configuration.
* Run the scripts to deploy your containers to your chosen environment.

You adapt the configuration to choose the platforms and services that you want
to use for each stage of the ML workflow: data preparation, model training,
prediction serving, and service management.

You can choose to deploy your workloads locally or to a cloud environment.

## Notebooks

Included in Kubeflow is [JupyterHub](https://jupyterhub.readthedocs.io/en/stable/) to create and manage multi-user interactive Jupyter notebooks. Project Jupyter is a non-profit, open-source project to support interactive data science and scientific computing across all programming languages.
Expand Down
108 changes: 55 additions & 53 deletions content/docs/guides/components/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,99 +40,101 @@ ks apply ${ENVIRONMENT} -c pytorch-operator

## Creating a PyTorch Job

You can create PyTorch Job by defining a PyTorchJob config file. See [distributed MNIST example](https://github.com/kubeflow/pytorch-operator/blob/master/examples/dist-mnist/pytorch_job_mnist.yaml) config file. You may change the config file based on your requirements.
You can create PyTorch Job by defining a PyTorchJob config file. See [distributed MNIST example](https://github.com/kubeflow/pytorch-operator/blob/master/examples/tcp-dist/mnist/v1beta1/pytorch_job_mnist.yaml) config file. You may change the config file based on your requirements.

```
cat examples/dist-mnist/pytorch_job_mnist.yaml
cat pytorch_job_mnist.yaml
```
Deploy the PyTorchJob resource to start training:

```
kubectl create -f examples/dist-mnist/pytorch_job_mnist.yaml
kubectl create -f pytorch_job_mnist.yaml
```
You should now be able to see the created pods matching the specified number of replicas.

```
kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test
kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist
```
Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

```
PODNAME=$(kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test,task_index=0 -o name)
PODNAME=$(kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist,pytorch-replica-type=master,pytorch-replica-index=0 -o name)
kubectl logs -f ${PODNAME}
```
## Monitoring a PyTorch Job

```
kubectl get -o yaml pytorchjobs dist-mnist-for-e2e-test
kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist
```
See the status section to monitor the job status. Here is sample output when the job is successfully completed.

```
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha1
kind: PyTorchJob
metadata:
clusterName: ""
creationTimestamp: 2018-06-22T08:16:14Z
generation: 1
name: dist-mnist-for-e2e-test
namespace: default
resourceVersion: "3276193"
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/pytorchjobs/dist-mnist-for-e2e-test
uid: 87772d3b-75f4-11e8-bdd9-42010aa00072
spec:
RuntimeId: kmma
pytorchImage: pytorch/pytorch:v0.2
replicaSpecs:
- masterPort: 23456
replicaType: MASTER
apiVersion: kubeflow.org/v1beta1
kind: PyTorchJob
metadata:
clusterName: ""
creationTimestamp: 2018-12-16T21:39:09Z
generation: 1
name: pytorch-tcp-dist-mnist
namespace: default
resourceVersion: "15532"
selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
spec:
cleanPodPolicy: None
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
imagePullPolicy: IfNotPresent
name: pytorch
ports:
- containerPort: 23456
name: pytorchjob-port
resources: {}
restartPolicy: OnFailure
- masterPort: 23456
replicaType: WORKER
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
imagePullPolicy: IfNotPresent
name: pytorch
ports:
- containerPort: 23456
name: pytorchjob-port
resources: {}
restartPolicy: OnFailure
terminationPolicy:
master:
replicaName: MASTER
replicaRank: 0
status:
phase: Done
reason: ""
replicaStatuses:
- ReplicasStates:
Succeeded: 1
replica_type: MASTER
state: Succeeded
- ReplicasStates:
Running: 1
Succeeded: 2
replica_type: WORKER
state: Running
state: Succeeded
kind: List
metadata:
resourceVersion: ""
selfLink: ""
status:
completionTime: 2018-12-16T21:43:27Z
conditions:
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:39:09Z
message: PyTorchJob pytorch-tcp-dist-mnist is created.
reason: PyTorchJobCreated
status: "True"
type: Created
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:40:45Z
message: PyTorchJob pytorch-tcp-dist-mnist is running.
reason: PyTorchJobRunning
status: "False"
type: Running
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:43:27Z
message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
reason: PyTorchJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Master: {}
Worker: {}
startTime: 2018-12-16T21:40:45Z
```
6 changes: 6 additions & 0 deletions content/docs/guides/components/seldon.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,12 @@ ks param set seldon seldonVersion 0.1.8
ks generate seldon seldon
```

Deploy seldon cluster manager:

```
ks apply ${KF_ENV} -c seldon
```

### Seldon Deployment Graphs

Seldon allows complex runtime graphs for model inference to be deployed. Some example prototypes have been provided to help you get started. Follow the [Seldon docs](https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/readme.md) to wrap your model code into an image that can be managed by Seldon. In the examples below we will use a model image ```seldonio/mock_classifier``` ; replace this with your actual model image. You will also need to choose between the v1alpha2 and v1alpha1 prototype examples depending on which version of Seldon you generated above. The following prototypes are available:
Expand Down
12 changes: 6 additions & 6 deletions content/docs/guides/components/tfserving_new.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Generate the service(model) component

```
ks generate tf-serving-service mnist-service
ks param set mnist-service modelName mnist
ks param set mnist-service modelName mnist // match your deployment mode name
ks param set mnist-service trafficRule v1:100 // optional, it's the default value
```

Expand Down Expand Up @@ -53,10 +53,10 @@ See [doc](https://cloud.google.com/docs/authentication/) for more detail.

To use S3, generate a different prototype
```
ks generate tf-serving-aws ${MODEL_COMPONENT} --name=${MODEL_NAME}
ks generate tf-serving-deployment-aws ${MODEL_COMPONENT} --name=${MODEL_NAME}
```

First you need to create secret that will contain access credentials.
First you need to create secret that will contain access credentials. Use base64 to encode your credentials and check deails in the Kubernetes guide to [creating a secret manually](https://kubernetes.io/docs/concepts/configuration/secret/#creating-a-secret-manually)
```
apiVersion: v1
metadata:
Expand All @@ -71,8 +71,8 @@ Enable S3, set url and point to correct Secret

```
MODEL_PATH=s3://kubeflow-models/inception
ks param set ${MODEL_COMPONENT} modelPath ${MODEL_PATH}
ks param set ${MODEL_COMPONENT} s3Enable True
ks param set ${MODEL_COMPONENT} modelBasePath ${MODEL_PATH}
ks param set ${MODEL_COMPONENT} s3Enable true
ks param set ${MODEL_COMPONENT} s3SecretName secretname
```

Expand All @@ -82,7 +82,7 @@ Optionally you can also override default parameters of S3
# S3 region
ks param set ${MODEL_COMPONENT} s3AwsRegion us-west-1
# true Whether or not to use https for S3 connections
# Whether or not to use https for S3 connections
ks param set ${MODEL_COMPONENT} s3UseHttps true
# Whether or not to verify https certificates for S3 connections
Expand Down
14 changes: 13 additions & 1 deletion content/docs/guides/components/tftraining.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,18 @@ VERSION=v0.2-branch

ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow-git/examples
```
Choose a tf-job prototype from the following list of available prototypes, to match the CRD you're using:
> Type `ks prototype list` to list all available prototypes
* `io.ksonnet.pkg.tf-job-operator` - A TensorFlow job operator.
* `io.ksonnet.pkg.tf-job-simple` - A simple TFJob to run CNN benchmark
* `io.ksonnet.pkg.tf-job-simple-v1alpha1` - A simple TFJob to run CNN benchmark
* `io.ksonnet.pkg.tf-job-simple-v1beta1` - A simple TFJob to run CNN benchmark
Run the `generate` command:
```
ks generate tf-job-simple ${CNN_JOB_NAME} --name=${CNN_JOB_NAME}
```
Expand Down Expand Up @@ -240,6 +251,7 @@ To use GPUs your cluster must be configured to use GPUs.
* For more information:
* [K8s Instructions For Scheduling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
* [GKE Instructions](https://cloud.google.com/kubernetes-engine/docs/concepts/gpus)
* [EKS Instructions](https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html)
To attach GPUs specify the GPU resource on the container in the replicas
that should contain the GPUs; for example.
Expand Down Expand Up @@ -674,4 +686,4 @@ Events:
in the previous section.

## More information
* See how to [run a job with gang-scheduling](/docs/guides/job-scheduling).
* See how to [run a job with gang-scheduling](/docs/guides/job-scheduling).

0 comments on commit 1c9f27f

Please sign in to comment.