Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example CRD for katib operator #119

Closed
wants to merge 2 commits into from
Closed

Example CRD for katib operator #119

wants to merge 2 commits into from

Conversation

inc0
Copy link

@inc0 inc0 commented Jun 15, 2018

I think operator-pattern would fit very nicely into Katib. This change shows example resource set that would be enough (I think) to create full set of katib-driven jobs.

This approach will also allow to leverage fully tf-operator, will help to reuse katib-provided models and with model provenance. Also operators will have easier time tracking and monitoring jobs.


This change is Reviewable

I think operator-pattern would fit very nicely into Katib.
@inc0
Copy link
Author

inc0 commented Jun 15, 2018

/assign @jlewi
/assign @YujiOshima

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jlewi

Assign the PR to them by writing /assign @jlewi in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

@inc0: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
kubeflow-katib-presubmit 3a88119 link /test kubeflow-katib-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gaocegege
Copy link
Member

ref #6

@gaocegege
Copy link
Member

We have a backend DB MySQL, thus I am not sure if operator works for us. If the changes about study and trial are not frequent, we could try to use etcd in Kubernetes. If they are changed frequently, maybe custom apiserver is better than crd operator. IMO

@inc0
Copy link
Author

inc0 commented Jun 16, 2018

Problem I have with DB backend is that it's another (hard!) component to maintain. MySQL on k8s is non-trivial at best. Even if you'd use something like Vitess it's hard. ETCD is always there on the other hand. I don't see changes to a Trial being more frequent than couple times per training run, which, even at most sophisticated environments will grow to hundreds a day maybe? ETCD should handle it without breaking a sweat.

@jlewi
Copy link
Contributor

jlewi commented Jun 19, 2018

This proposal is really two things

  1. Creating a CRD for study
  2. Creating a CRD for model

These seem to be addressing two different points and it might be better to split them.

For 1; what problem; benefit is a CRD providing? IIUC the way Katib works is you have two pieces of information

  1. A StudyConfig spec that describes the experiment
  2. A go program like client-example

So right now if a user wants to launch an HP job

  1. They create a config map containing their StudyConfig
  2. They write a go program to control their experiment
  3. The create a deployment to launch their controller using their StudyConfig

So it seems like the biggest hurdle right now is that users need to write a go program to control their experiment. It looks like their is some logic that is different (see here) for each suggestion algorithm. Furthermore, I'm guessing there is work to add support for more advanced features like early stopping.

It seems like a CRD would offer a bit of syntatic sugar by letting people create a single resource rather than creating a deployment + config map. Of course that wouldn't help anyone who wanted to customize the logic or add more advanced HP logic.

Right now the code for the controller controller is pretty simple. So if we implement it as a CRD we should use meta controller or some other pattern to keep the amount of code fairly tractable and not incur all the boilerplate we do for other controllers like tf-operator.

IMO though the big win is making the go program more powerful so that it can cover more use cases. So I wouldn't prioritize creating a CRD ahead of making those improvements.

For 2

Ideally Katib shouldn't be overly prescriptive about the storage backend. Katib should define an interface that should support using a variety of storage solutions. What is the current interface? Is it SQL? Or does Katibe define some interface around SQL? If its SQL should we leave it as SQL? Should we wrap it in some higher level API (e.g. a CRUD) API so that its easier to add support for different backends?

/cc @YujiOshima @gaocegege

@YujiOshima
Copy link
Contributor

This proposal is really two things

Right.

First point.
I agree with creating CRD for controller for experiments.
It is worked as StudyController see #86 #87 .
The CRD should include StudyConfig, WorkerConfig, and SuggestionParameter.
It enable to start experiments by only making CRD for typically case.
But I don't know the tf-operator is sutable for it.
We are very welcome your help about CRD for StudyController.

Second point.
We define go interface for DB
You can implement db plugin for CRD and improve the interface itself if you need.
User will be able to switch use DB backend.
Does it work for you?
Your suggestion about DB interface is also so helpful.

@inc0
Copy link
Author

inc0 commented Jun 28, 2018

I think, actually, that 2 is more important. Reason for that is that, after we're done with model training, we will want to test it or serve it. That could use some cooperation between other kubeflow components like tf-serving.
There are 2 parts to Model storage:

  1. Model path, metadata, metrics etc
  2. Model blob - actual potentially multi-GB file

I think at least 1 can be prescribed by us and set to ETCD, 2 shouldn't ever be. That being said, 1 can hold information regarding 2 including path and type of storage. That would allow multiple components to act on same model.
As for performance considerations, if you're running a lot of models and are concerned with ETCD - you can setup dedicated ETCD for just models, that'd still be accessible as k8s resource.

@jlewi
Copy link
Contributor

jlewi commented Jun 28, 2018

@inc0 Is there a particular reason you are trying to persuade everyone that ETCD is the right storage? As discussed above I don't think we want to be overly prescriptive about the storage backend. So if you feel strongly that ETCD is a good backend then why not just create an implementation of the go interface for the DB that uses ETCD?

@jlewi
Copy link
Contributor

jlewi commented Jun 30, 2018

It looks to me like we need to a bunch of work to allow controller programs e.g. git-issue-summarize-demo.go to be reusable across models. Until programs are reusable I'm not sure a CRD would help.

Two items and examples that come to mind are

  1. We need a better way of injecting a template for a K8s resource to be created for each trial e.g. [GH Issue Summarization] HP Controller should launch TFJobs directly examples#162

  2. We need reusable mechanisms for getting metrics for models e.g. [GH Issue Summarization] HP Controller Needs to Extract And Report Metrics; Use tf.Events files? examples#163

I think it would be really helpful to have a concrete example working E2E that would help illuminate what improvements (e.g. adding a CRD) would be useful.

@inc0
Copy link
Author

inc0 commented Jul 5, 2018

@inc0 Is there a particular reason you are trying to persuade everyone that ETCD is the right storage? As discussed above I don't think we want to be overly prescriptive about the storage backend. So if you feel strongly that ETCD is a good backend then why not just create an implementation of the go interface for the DB that uses ETCD?

@jlewi ad ETCD, reason I think ETCD is right storage is because everybody has it. Supporting relational databases is hard problem, especially on k8s. Our mission is for kubeflow to run on every k8s, that will make it much harder. Without external database or (very hard to support) database on k8s, you won't be able to use Katib. Not to mention that if you deploy modeldb you also get mongodb to support, also not easy.

We need a better way of injecting a template for a K8s resource to be created for each trial e.g. kubeflow/examples#162

That's one of reason for CRD, much like injecting pod spec to tfjob

We need reusable mechanisms for getting metrics for models e.g. kubeflow/examples#163

As you've said, tf.Events integration.

@jlewi
Copy link
Contributor

jlewi commented Jul 7, 2018

I think this is best addressed by #138 a design doc for Katib.

@jlewi ad ETCD, reason I think ETCD is right storage is because everybody has it. Supporting relational databases is hard problem, especially on k8s. Our mission is for kubeflow to run on every k8s, that will make it much harder. Without external database or (very hard to support) database on k8s, you won't be able to use Katib. Not to mention that if you deploy modeldb you also get mongodb to support, also not easy.

Ubiquity is not the only consideration. Lifcycle, scalbility, etc... they all matter.

Storing it in the K8s master really doesn't strike me as the right solution for Cloud since in cloud K8s clusters can be ephemeral. That's why I'd like the storage backend to be pluggable so that on Cloud we can swap in appropriate storage solutions. Most Clouds offer a variety of managed storage options; I'm pretty sure every cloud has some managed SQL database. So on Cloud using an external DB is a very good option.

For OnPrem I'm not sure. That's why I think we should focus on getting the right interface and making the storage pluggable rather than pushing a single solution for everyone.

@inc0
Copy link
Author

inc0 commented Jul 9, 2018

Right now we'll have issue building Katib on OnPrem. Managed databases might work, but unless you do have managed database, you're screwed. We also deploy database in ksonnet which gives illusion of stability. I think if we write too many backends, we'll dig a grave for ourselves. Pluggability on that level will be very hard to maintain. If anything, let's try with MongoDB because it's used by ModelDB. I still think we can handle it with ETCD under K8s tho. I don't think it's ephemeral...

@jlewi
Copy link
Contributor

jlewi commented Aug 8, 2018

@YujiOshima has a StudyController CRD in #141 Closing this issue in favor of that.

@jlewi jlewi closed this Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants