Example CRD for katib operator #119

inc0 · 2018-06-15T20:12:52Z

I think operator-pattern would fit very nicely into Katib. This change shows example resource set that would be enough (I think) to create full set of katib-driven jobs.

This approach will also allow to leverage fully tf-operator, will help to reuse katib-provided models and with model provenance. Also operators will have easier time tracking and monitoring jobs.

This change is

I think operator-pattern would fit very nicely into Katib.

inc0 · 2018-06-15T20:13:30Z

/assign @jlewi
/assign @YujiOshima

k8s-ci-robot · 2018-06-15T20:28:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jlewi

Assign the PR to them by writing /assign @jlewi in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2018-06-15T20:42:53Z

@inc0: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
kubeflow-katib-presubmit	`3a88119`	link	`/test kubeflow-katib-presubmit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gaocegege · 2018-06-16T01:51:16Z

ref #6

gaocegege · 2018-06-16T01:54:05Z

We have a backend DB MySQL, thus I am not sure if operator works for us. If the changes about study and trial are not frequent, we could try to use etcd in Kubernetes. If they are changed frequently, maybe custom apiserver is better than crd operator. IMO

inc0 · 2018-06-16T05:11:39Z

Problem I have with DB backend is that it's another (hard!) component to maintain. MySQL on k8s is non-trivial at best. Even if you'd use something like Vitess it's hard. ETCD is always there on the other hand. I don't see changes to a Trial being more frequent than couple times per training run, which, even at most sophisticated environments will grow to hundreds a day maybe? ETCD should handle it without breaking a sweat.

jlewi · 2018-06-19T21:59:35Z

This proposal is really two things

Creating a CRD for study
Creating a CRD for model

These seem to be addressing two different points and it might be better to split them.

For 1; what problem; benefit is a CRD providing? IIUC the way Katib works is you have two pieces of information

A StudyConfig spec that describes the experiment
A go program like client-example

So right now if a user wants to launch an HP job

They create a config map containing their StudyConfig
They write a go program to control their experiment
The create a deployment to launch their controller using their StudyConfig

So it seems like the biggest hurdle right now is that users need to write a go program to control their experiment. It looks like their is some logic that is different (see here) for each suggestion algorithm. Furthermore, I'm guessing there is work to add support for more advanced features like early stopping.

It seems like a CRD would offer a bit of syntatic sugar by letting people create a single resource rather than creating a deployment + config map. Of course that wouldn't help anyone who wanted to customize the logic or add more advanced HP logic.

Right now the code for the controller controller is pretty simple. So if we implement it as a CRD we should use meta controller or some other pattern to keep the amount of code fairly tractable and not incur all the boilerplate we do for other controllers like tf-operator.

IMO though the big win is making the go program more powerful so that it can cover more use cases. So I wouldn't prioritize creating a CRD ahead of making those improvements.

For 2

Ideally Katib shouldn't be overly prescriptive about the storage backend. Katib should define an interface that should support using a variety of storage solutions. What is the current interface? Is it SQL? Or does Katibe define some interface around SQL? If its SQL should we leave it as SQL? Should we wrap it in some higher level API (e.g. a CRUD) API so that its easier to add support for different backends?

/cc @YujiOshima @gaocegege

YujiOshima · 2018-06-22T18:03:19Z

This proposal is really two things

Right.

First point.
I agree with creating CRD for controller for experiments.
It is worked as StudyController see #86 #87 .
The CRD should include StudyConfig, WorkerConfig, and SuggestionParameter.
It enable to start experiments by only making CRD for typically case.
But I don't know the tf-operator is sutable for it.
We are very welcome your help about CRD for StudyController.

Second point.
We define go interface for DB
You can implement db plugin for CRD and improve the interface itself if you need.
User will be able to switch use DB backend.
Does it work for you?
Your suggestion about DB interface is also so helpful.

inc0 · 2018-06-28T16:09:24Z

I think, actually, that 2 is more important. Reason for that is that, after we're done with model training, we will want to test it or serve it. That could use some cooperation between other kubeflow components like tf-serving.
There are 2 parts to Model storage:

Model path, metadata, metrics etc
Model blob - actual potentially multi-GB file

I think at least 1 can be prescribed by us and set to ETCD, 2 shouldn't ever be. That being said, 1 can hold information regarding 2 including path and type of storage. That would allow multiple components to act on same model.
As for performance considerations, if you're running a lot of models and are concerned with ETCD - you can setup dedicated ETCD for just models, that'd still be accessible as k8s resource.

jlewi · 2018-06-28T23:39:55Z

@inc0 Is there a particular reason you are trying to persuade everyone that ETCD is the right storage? As discussed above I don't think we want to be overly prescriptive about the storage backend. So if you feel strongly that ETCD is a good backend then why not just create an implementation of the go interface for the DB that uses ETCD?

jlewi · 2018-06-30T19:55:21Z

It looks to me like we need to a bunch of work to allow controller programs e.g. git-issue-summarize-demo.go to be reusable across models. Until programs are reusable I'm not sure a CRD would help.

Two items and examples that come to mind are

We need a better way of injecting a template for a K8s resource to be created for each trial e.g. [GH Issue Summarization] HP Controller should launch TFJobs directly examples#162
We need reusable mechanisms for getting metrics for models e.g. [GH Issue Summarization] HP Controller Needs to Extract And Report Metrics; Use tf.Events files? examples#163

I think it would be really helpful to have a concrete example working E2E that would help illuminate what improvements (e.g. adding a CRD) would be useful.

inc0 · 2018-07-05T15:23:55Z

@inc0 Is there a particular reason you are trying to persuade everyone that ETCD is the right storage? As discussed above I don't think we want to be overly prescriptive about the storage backend. So if you feel strongly that ETCD is a good backend then why not just create an implementation of the go interface for the DB that uses ETCD?

@jlewi ad ETCD, reason I think ETCD is right storage is because everybody has it. Supporting relational databases is hard problem, especially on k8s. Our mission is for kubeflow to run on every k8s, that will make it much harder. Without external database or (very hard to support) database on k8s, you won't be able to use Katib. Not to mention that if you deploy modeldb you also get mongodb to support, also not easy.

We need a better way of injecting a template for a K8s resource to be created for each trial e.g. kubeflow/examples#162

That's one of reason for CRD, much like injecting pod spec to tfjob

We need reusable mechanisms for getting metrics for models e.g. kubeflow/examples#163

As you've said, tf.Events integration.

jlewi · 2018-07-07T01:51:26Z

I think this is best addressed by #138 a design doc for Katib.

@jlewi ad ETCD, reason I think ETCD is right storage is because everybody has it. Supporting relational databases is hard problem, especially on k8s. Our mission is for kubeflow to run on every k8s, that will make it much harder. Without external database or (very hard to support) database on k8s, you won't be able to use Katib. Not to mention that if you deploy modeldb you also get mongodb to support, also not easy.

Ubiquity is not the only consideration. Lifcycle, scalbility, etc... they all matter.

Storing it in the K8s master really doesn't strike me as the right solution for Cloud since in cloud K8s clusters can be ephemeral. That's why I'd like the storage backend to be pluggable so that on Cloud we can swap in appropriate storage solutions. Most Clouds offer a variety of managed storage options; I'm pretty sure every cloud has some managed SQL database. So on Cloud using an external DB is a very good option.

For OnPrem I'm not sure. That's why I think we should focus on getting the right interface and making the storage pluggable rather than pushing a single solution for everyone.

inc0 · 2018-07-09T15:59:03Z

Right now we'll have issue building Katib on OnPrem. Managed databases might work, but unless you do have managed database, you're screwed. We also deploy database in ksonnet which gives illusion of stability. I think if we write too many backends, we'll dig a grave for ourselves. Pluggability on that level will be very hard to maintain. If anything, let's try with MongoDB because it's used by ModelDB. I still think we can handle it with ETCD under K8s tho. I don't think it's ephemeral...

jlewi · 2018-08-08T22:18:35Z

@YujiOshima has a StudyController CRD in #141 Closing this issue in favor of that.

Example CRD for katib operator

e6b1347

I think operator-pattern would fit very nicely into Katib.

k8s-ci-robot requested review from ddysher and mitake June 15, 2018 20:12

k8s-ci-robot added the size/M label Jun 15, 2018

k8s-ci-robot assigned jlewi and YujiOshima Jun 15, 2018

add note about storage

3a88119

k8s-ci-robot requested review from gaocegege and YujiOshima June 19, 2018 21:59

jlewi mentioned this pull request Jul 7, 2018

[feature] Use vitess instead of MySQL #34

Closed

jlewi closed this Aug 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example CRD for katib operator #119

Example CRD for katib operator #119

inc0 commented Jun 15, 2018 •

edited by jlewi

Loading

inc0 commented Jun 15, 2018

k8s-ci-robot commented Jun 15, 2018

k8s-ci-robot commented Jun 15, 2018

gaocegege commented Jun 16, 2018

gaocegege commented Jun 16, 2018

inc0 commented Jun 16, 2018

jlewi commented Jun 19, 2018

YujiOshima commented Jun 22, 2018

inc0 commented Jun 28, 2018

jlewi commented Jun 28, 2018

jlewi commented Jun 30, 2018 •

edited

Loading

inc0 commented Jul 5, 2018 •

edited

Loading

jlewi commented Jul 7, 2018

inc0 commented Jul 9, 2018

jlewi commented Aug 8, 2018

Example CRD for katib operator #119

Example CRD for katib operator #119

Conversation

inc0 commented Jun 15, 2018 • edited by jlewi Loading

inc0 commented Jun 15, 2018

k8s-ci-robot commented Jun 15, 2018

k8s-ci-robot commented Jun 15, 2018

gaocegege commented Jun 16, 2018

gaocegege commented Jun 16, 2018

inc0 commented Jun 16, 2018

jlewi commented Jun 19, 2018

YujiOshima commented Jun 22, 2018

inc0 commented Jun 28, 2018

jlewi commented Jun 28, 2018

jlewi commented Jun 30, 2018 • edited Loading

inc0 commented Jul 5, 2018 • edited Loading

jlewi commented Jul 7, 2018

inc0 commented Jul 9, 2018

jlewi commented Aug 8, 2018

inc0 commented Jun 15, 2018 •

edited by jlewi

Loading

jlewi commented Jun 30, 2018 •

edited

Loading

inc0 commented Jul 5, 2018 •

edited

Loading