Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard V1 #125

Merged
merged 35 commits into from
Dec 23, 2017
Merged

Dashboard V1 #125

merged 35 commits into from
Dec 23, 2017

Conversation

wbuchwalter
Copy link
Contributor

@wbuchwalter wbuchwalter commented Nov 5, 2017

Addresses #57

To test, create k8s/dashboard/dashboard/deploy.yaml in your cluster.

Left To Do for v1:

  • Add Volume/VolumeMounts options at creation
  • Branding: Use consistent name everywhere (tensorflow/k8s, or kubeflow?)
  • Rebase with latest master changes
  • Document installation process

This change is Reviewable

@k8s-ci-robot
Copy link

Hi @wbuchwalter. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wbuchwalter wbuchwalter changed the title [WIP] Dashboard [WIP] Dashboard V1 Nov 5, 2017
@wbuchwalter wbuchwalter requested a review from jlewi November 5, 2017 22:24
@jlewi
Copy link
Contributor

jlewi commented Nov 5, 2017

/ok-to-test

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.
I don't know much javascript/FE. So I mostly looked at the none JS code.

It might be useful if you could expand the README to include key points about the design of the FE.

@@ -128,7 +128,7 @@ func (c *TfJobRestClient) Update(ns string, j *spec.TfJob) (*spec.TfJob, error)
}

func (c *TfJobRestClient) Delete(ns, name string) error {
_, err := c.restcli.Delete().Resource(spec.CRDKindPlural).Namespace(ns).DoRaw()
_, err := c.restcli.Delete().Resource(spec.CRDKindPlural).Namespace(ns).Name(name).DoRaw()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change need to be rebased? I thought you already submitted the fix to the Delete method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I need to rebase this, duplicated fix.

@@ -0,0 +1,146 @@
#!/usr/bin/python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reuse py/build_and_push_image.py rather than duplicating a lot of that code?
You can have a shim script if needed to called build_and_push with appropriate arguments.
(I'm trying to get rid of that code duplication and have a set of reusable python functions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't want to lose time on this before the point mentioned above (one or two docker images) is settled.
If we go with 2, I will clean this.

@@ -0,0 +1,12 @@
# TODO(jlewi): How can we make it work with golang:1.8.2-alpine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a separate Docker image for the dashboard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this. having a single image would make the release scripts easier to maintain. I don't see any obvious drawbacks.

@@ -1,7 +1,7 @@
apiVersion: "mlkube.io/v1beta1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you sync with master please? mlkube.io has been replaced by tensorflow.org?

}

buildTensorBoardSpec() {
// "tensorboard": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this commented out code?

Copy link
Contributor Author

@wbuchwalter wbuchwalter Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be finished before merging. Missing Volumes and VolumeMounts.

@@ -0,0 +1,31 @@
apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a helm package to make it easy to configure? Should it be part of the existing helm package or have its own helm package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest integrating it as part of the current helm chart, with options to enable/disable the dashboard as well as wether it should be exposed on an internal or external IP.

@jlewi
Copy link
Contributor

jlewi commented Nov 5, 2017

For branding, I'd suggest tensorflow/k8s. kubeflow doesn't exist yet .

What does the frontend need to mount volumes?

@jlewi
Copy link
Contributor

jlewi commented Nov 5, 2017

I looked at the current demo. Regarding logging, is there a way we can take advantage of different logging backends? For example GKE uses stackdriver logging which offers a nice UI which supports a fairly rich filter syntax. So it might be nice if on GKE clicking on logs redirected to the stackdriver UI with appropriate log filter. I assume you could do something similar on other clouds?

Also, it looks like when you click on logs, the logs are displayed in a dialog box. Instead of using a dialog box, could you just use a text box? I.e expand the box for master/worker/ps etc... and show a text box with the logs?

@wbuchwalter
Copy link
Contributor Author

For branding, I'd suggest tensorflow/k8s. kubeflow doesn't exist yet .

👍

What does the frontend need to mount volumes?

This is for the creation of new TfJobs. You should be able to describe a volume backed by Azure Files, GlusterFS etc.

}

type TfJobList struct {
TfJobs []spec.TfJob `json:"tfjobs"`
Copy link
Member

@jimexist jimexist Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tfJobs?

per above

import './App.css';
import Home from './Home.js'

let headerStyle = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use const

);
}
return (
<div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in react 16 no need for wrapper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this per @jimexist's comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jimexist is mistaken here.
JSX still needs a wrapper element, see airbnb/javascript#1575 for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can wrap them in an array - to get rid of one level of DOM element. but it's minor issue I suppose.

Copy link
Member

@jimexist jimexist Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if you use react 16.2 you can use <> and </>: https://reactjs.org/blog/2017/11/28/react-v16.2.0-fragment-support.html

};

deleteTfJob(ns, name)
.catch(console.log);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use console.error

const JobList = ({jobs}) => {
let jobSummaries = []
for (let i = 0; i < jobs.length; i++) {
jobSummaries.push(<JobSummary key={i} job={jobs[i]} />)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use jobs.map would do, but it's a matter of personal taste

const ReplicaSpec = ({ spec, status, pods }) => {
// let podComponents = pods.map(p => <Pod pod={p} />);
return (
<div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for <div/>?

# TODO(jlewi): How can we make it work with golang:1.8.2-alpine
FROM golang:1.8.2

RUN mkdir -p /opt/dashboard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combining multiple runs would be better

"private": true,
"dependencies": {
"material-ui": "^0.19.4",
"react": "^16.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add prop-types if you are feeling good :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and license?

Copy link
Member

@jimexist jimexist Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personal suggestion: prettier.js and standard.js

@jlewi
Copy link
Contributor

jlewi commented Dec 20, 2017

@jimexist A proposal would be great.

@wbuchwalter
Copy link
Contributor Author

@jlewi Sorry for being slow on this, tangled in other commitments... I'll try to finish this today.

@coveralls
Copy link

coveralls commented Dec 20, 2017

Coverage Status

Coverage decreased (-0.1%) to 37.782% when pulling 7ef434b on wbuchwalter:dashboard into cb1e053 on tensorflow:master.

@jlewi
Copy link
Contributor

jlewi commented Dec 20, 2017

Great thanks!

@wbuchwalter
Copy link
Contributor Author

@jlewi So the airflow job still fails currently as it is still using the release script from master, and consequently not building the dashboard's backend.
If I'm not mistaken that's the issue you tried to fix in #200 right?
looking at the logs it doesn't seem like this is being cloned before running build_and_push.py.

Trying to understand what's wrong.

@wbuchwalter
Copy link
Contributor Author

wbuchwalter commented Dec 20, 2017

So, e2e_tests_dag.py runs:
python -m py.release build --src_dir=/var/lib/data/runs/tf_k8s_tests/2017-12-20T16_11_01/tensorflow_k8s ....
Shouldn't we also use the release script from the PR that was just cloned before in clone_op?

@jlewi
Copy link
Contributor

jlewi commented Dec 21, 2017

You're right.

Reopened #189

Mailed out #237

/test all

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

/ok-to-test

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

So I submitted a fix to the issue with not using the checked out version. Right now it looks like the go program is failing to build

[2017-12-22 00:39:52,346] {base_task_runner.py:98} INFO - Subtask: INFO|2017-12-22T00:39:52|/var/lib/data/runs/tf_k8s_tests/2017-12-22T00_38_57/tensorflow_k8s/py/util.py|69| Subprocess output:
[2017-12-22 00:39:52,347] {base_task_runner.py:98} INFO - Subtask: /var/lib/data/runs/tf_k8s_tests/2017-12-22T00_38_57/go/src/github.com/tensorflow/k8s/dashboard/backend/handler/api_handler.go:13:2: cannot find package "k8s.io/client-go/pkg/api/v1" in any of:
[2017-12-22 00:39:52,347] {base_task_runner.py:98} INFO - Subtask: 	/var/lib/data/runs/tf_k8s_tests/2017-12-22T00_38_57/go/src/github.com/tensorflow/k8s/vendor/k8s.io/client-go/pkg/api/v1 (vendor tree)
[2017-12-22 00:39:52,347] {base_task_runner.py:98} INFO - Subtask: 	/usr/local/go/src/k8s.io/client-go/pkg/api/v1 (from $GOROOT)
[2017-12-22 00:39:52,347] {base_task_runner.py:98} INFO - Subtask: 	/var/lib/data/runs/tf_k8s_tests/2017-12-22T00_38_57/go/src/k8s.io/client-go/pkg/api/v1 (from $GOPATH)
[2017-12-22 00:39:52,348] {base_task_runner.py:98} INFO - Subtask: 

Travis is passing though.

The E2E test is running glide install. Maybe we should stop doing that since dependencies should now be checked in.

.gitignore Outdated
@@ -2,6 +2,10 @@
# only so we exclude them.
bin/

vendor/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this? Now that we check in vendor we don't want in gitignoe.

@coveralls
Copy link

coveralls commented Dec 22, 2017

Coverage Status

Coverage increased (+0.1%) to 37.915% when pulling fc63c29 on wbuchwalter:dashboard into e108d55 on tensorflow:master.

@coveralls
Copy link

coveralls commented Dec 22, 2017

Coverage Status

Coverage remained the same at 37.782% when pulling 38c6de1 on wbuchwalter:dashboard into e108d55 on tensorflow:master.

@coveralls
Copy link

coveralls commented Dec 22, 2017

Coverage Status

Coverage remained the same at 37.782% when pulling c2f8b39 on wbuchwalter:dashboard into e108d55 on tensorflow:master.

@wbuchwalter
Copy link
Contributor Author

@jlewi not exactly sure why the last airflow run was marked as failed as I can see that all steps were successful in airflow

screen shot 2017-12-22 at 9 58 21 am

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

Must be an issue with the test-infrastructure. Possibly a timeout?
Lets try again.
/ok-to-test

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

/test all

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

On a different test

I noticed the error

INFO|2017-12-22T16:44:57|/opt/tensorflow_k8s/py/airflow.py|182| Waiting for DAG tf_k8s_tests run 2017-12-22T16:32:53 to finish.
/usr/local/lib/python2.7/site-packages/urllib3-1.22-py2.7.egg/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/tensorflow_k8s/py/airflow.py", line 310, in <module>
    main()
  File "/opt/tensorflow_k8s/py/airflow.py", line 282, in main
    state = _run_dag_and_wait()
  File "/opt/tensorflow_k8s/py/airflow.py", line 215, in _run_dag_and_wait
    state = wait_for_tf_k8s_tests(client, run_id)
  File "/opt/tensorflow_k8s/py/airflow.py", line 169, in wait_for_tf_k8s_tests
    resp = client.get_task_status(E2E_DAG, run_id, "done")
  File "/opt/tensorflow_k8s/py/airflow.py", line 117, in get_task_status
    data = self._request(url, method="GET")
  File "/opt/tensorflow_k8s/py/airflow.py", line 65, in _request
    resp = getattr(requests, method.lower())(**params)  # pylint: disable=not-callable
  File "/usr/local/lib/python2.7/site-packages/requests-2.18.4-py2.7.egg/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests-2.18.4-py2.7.egg/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests-2.18.4-py2.7.egg/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests-2.18.4-py2.7.egg/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests-2.18.4-py2.7.egg/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='35.202.163.166', port=443): Read timed out. (read timeout=10)

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

I'll try adding retries as part of #241

@jlewi
Copy link
Contributor

jlewi commented Dec 23, 2017

/test ok

@jlewi
Copy link
Contributor

jlewi commented Dec 23, 2017

/test all

@coveralls
Copy link

coveralls commented Dec 23, 2017

Coverage Status

Coverage remained the same at 37.782% when pulling 89aec3f on wbuchwalter:dashboard into 37af20d on tensorflow:master.

@jlewi
Copy link
Contributor

jlewi commented Dec 23, 2017

/test all

@jlewi jlewi merged commit bcfc14f into kubeflow:master Dec 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants