Skip to content

Commit

Permalink
Merge branch 'master' into dashboard
Browse files Browse the repository at this point in the history
  • Loading branch information
wbuchwalter committed Dec 20, 2017
2 parents e8aca7e + cb1e053 commit 7ef434b
Show file tree
Hide file tree
Showing 5,849 changed files with 2,305,067 additions and 663 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# pkg and bin directories currently contain build artifacts
# only so we exclude them.
bin/

vendor/
node_modules/
build/

.vscode/

# Compiled python files.
Expand Down
10 changes: 7 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
language: go

go:
- 1.8.x

- 1.8
- 1.9

addons:
apt:
sources:
Expand All @@ -11,6 +12,8 @@ addons:
- glide

install:
# get coveralls.io support
- go get github.com/mattn/goveralls
# Install dependencies using glide
- glide install
# We need to remove the vendored dependencies of apiextensions otherwise we get type conflicts.
Expand All @@ -22,4 +25,5 @@ script:
# directory.
# With go 1.9 vendor will be automatically excluded.
# For now though we just run all tests in pkg.
- go test -v ./pkg/...
- $GOPATH/bin/goveralls -service=travis-ci -v -package ./pkg/...

11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

[![Build Status](https://travis-ci.org/tensorflow/k8s.svg?branch=master)](https://travis-ci.org/tensorflow/k8s)

[![Coverage Status](https://coveralls.io/repos/github/tensorflow/k8s/badge.svg?branch=master)](https://coveralls.io/github/tensorflow/k8s?branch=master)

[Prow Test Dashboard](https://k8s-testgrid.appspot.com/sig-big-data)

[Prow Jobs](https://prow.k8s.io/?repo=tensorflow%2Fk8s)
Expand Down Expand Up @@ -58,7 +60,9 @@ TfJob requires Kubernetes >= 1.8
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
helm install ${CHART} -n tf-job --wait --replace --set rbac.install=true,cloud=<gke or azure>
```


* If you aren't running on GKE or Azure don't set cloud.

For non-RBAC enabled clusters:
```
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
Expand Down Expand Up @@ -145,6 +149,7 @@ metadata:
data:
controller_config_file.yaml: |
accelerators:
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
alpha.kubernetes.io/nvidia-gpu:
volumes:
- name: <volume-name> # Desired name of the volume, ex: nvidia-libs
Expand All @@ -161,7 +166,7 @@ data:
Then simply create the `ConfigMap` and install the Helm chart (**the order matters**) without specifying any cloud provider:

```
kubectl create configmap tf-job-operator-config --from-file <your-configmap-path>
kubectl create configmap tf-job-operator-config --from-file <your-configmap-path> --dry-run -o yaml | kubectl replace configmap tf-job-operator-config -f -
helm install ${CHART} -n tf-job --wait --replace
```

Expand Down Expand Up @@ -361,7 +366,7 @@ spec:

The TfJob operator will create a service named
**tensorboard-$RUNTIME_ID** for your job. You can connect to it
using the Kubernetes API Server porxy as follows
using the Kubernetes API Server proxy as follows

Start the K8s proxy
```
Expand Down
10 changes: 5 additions & 5 deletions cmd/tf_operator/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ import (
"github.com/tensorflow/k8s/pkg/controller"
"github.com/tensorflow/k8s/pkg/util"
"github.com/tensorflow/k8s/pkg/util/k8sutil"
"github.com/tensorflow/k8s/pkg/util/k8sutil/election"
"github.com/tensorflow/k8s/pkg/util/k8sutil/election/resourcelock"
"github.com/tensorflow/k8s/version"
election "k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"

log "github.com/golang/glog"

"io/ioutil"

"github.com/tensorflow/k8s/pkg/spec"
"k8s.io/client-go/pkg/api/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/tools/record"
)

Expand Down Expand Up @@ -123,11 +123,11 @@ func main() {
// TODO: replace with to client-go once leader election pacakge is imported
// see https://github.com/kubernetes/client-go/issues/28
rl := &resourcelock.EndpointsLock{
EndpointsMeta: v1.ObjectMeta{
EndpointsMeta: metav1.ObjectMeta{
Namespace: namespace,
Name: "tf-operator",
},
Client: k8sutil.MustNewKubeClient(),
Client: k8sutil.MustNewKubeClient().CoreV1(),
LockConfig: resourcelock.ResourceLockConfig{
Identity: id,
EventRecorder: &record.FakeRecorder{},
Expand Down
43 changes: 30 additions & 13 deletions developer_guide.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

## Building the Operator

Create a symbolic link inside your GOPATH to the location you checked out the code
Expand All @@ -12,15 +11,11 @@ ln -sf ${GIT_TRAINING} ${GOPATH}/src/github.com/tensorflow/k8s

Resolve dependencies (if you don't have glide install, check how to do it [here](https://github.com/Masterminds/glide/blob/master/README.md#install))

install dependencies, `-v` will ignore subpackage vendor
```sh
glide install
rm -rf vendor/k8s.io/apiextensions-apiserver/vendor
glide install -v
```

* The **rm** is needed to remove the vendor directory of dependencies
that also vendor dependencies as these produce conflicts
with the versions vendored by mlkube

Build it

```sh
Expand All @@ -37,30 +32,52 @@ To build the following artifacts:
You can run

```sh
pip install -r py/requirements.txt
python -m py.release local --registry=${REGISTRY}
```

* The docker image will be tagged into your registry
* The helm chart will be created in **./bin**


## Running the Operator Locally

Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.

We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with
a K8s cluster.
a K8s cluster. Set your environment:

Set your environment
```sh
export KUBECONFIG=$(echo ~/.kube/config)
export MY_POD_NAMESPACE=default
export MY_POD_NAME=my-pod
```

* MY_POD_NAMESPACE is used because the CRD is namespace scoped and we use the namespace of the controller to
set the corresponding namespace for the resource.
* MY_POD_NAMESPACE is used because the CRD is namespace scoped and we use the namespace of the controller to
set the corresponding namespace for the resource.
* TODO(jlewi): Do we still need to set MY_POD_NAME? Why?

Make a copy of `grpc_tensorflow_server.py` and create a config file named `controller_config_file.yaml`:

```
cp grpc_tensorflow_server/grpc_tensorflow_server.py /tmp/grpc_tensorflow_server.py
cat > /tmp/controller_config_file.yaml << EOL
grpcServerFilePath: /tmp/grpc_tensorflow_server.py
EOL
```

Now we are ready to run operator locally:

```
tf_operator -controller_config_file=/tmp/controller_config_file.yaml
```

The command creates a CRD `tfjobs` and block watching for creation of the resource kind. To verify local
operator is working, create an example job and you should see jobs created by it.

TODO(jlewi): Do we still need to set MY_POD_NAME? Why?
```
kubectl create -f https://raw.githubusercontent.com/tensorflow/k8s/master/examples/tf_job.yaml
```

## Go version

Expand Down
2 changes: 1 addition & 1 deletion examples/tf_job_gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@ spec:
name: tensorflow
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
restartPolicy: OnFailure
2 changes: 1 addition & 1 deletion examples/tf_sample/build_and_push.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
def GetGitHash():
# The image tag is based on the githash.
git_hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"])
git_hash = git_hash.strip()
git_hash = git_hash.strip().decode("utf-8")
modified_files = subprocess.check_output(["git", "ls-files", "--modified"])
untracked_files = subprocess.check_output(
["git", "ls-files", "--others", "--exclude-standard"])
Expand Down
2 changes: 1 addition & 1 deletion examples/tf_sample/tf_sample/tf_smoke.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def run(server, cluster_spec): # pylint: disable=too-many-statements, too-many-
c = tf.multiply(a, b)
results.append(c)

init_op = tf.initialize_all_variables()
init_op = tf.global_variables_initializer()

if server:
target = server.target
Expand Down
Loading

0 comments on commit 7ef434b

Please sign in to comment.