Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib needs to use durable storage that outlives the pod. #137

Closed
jlewi opened this issue Jul 3, 2018 · 25 comments
Closed

Katib needs to use durable storage that outlives the pod. #137

jlewi opened this issue Jul 3, 2018 · 25 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 3, 2018

Katib uses two databases

  • MySQL (for Katib)
  • MongoDB (for ModelDB)

It doesn't look like we are using any sort of durable storage for either one.

Here's the MySQL deployment it doesn't look like its using a persistent volume
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/katib/vizier.libsonnet#L185

Here's the MongoDB deployment:
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/katib/modeldb.libsonnet#L39

@YujiOshima @gaocegege Is this accurate or am I missing something?

@gaocegege
Copy link
Member

gaocegege commented Jul 3, 2018

No, I don't think so. It is accurate IMO.

We should add PVC to support the persistency.

@YujiOshima
Copy link
Contributor

@jlewi @gaocegege I agree we need PVC.
For Katib, adding PVC for MySQL is enough for persistence.
But for ModelDB, adding it for only Mongo may be not enough.
The ModelDB's backend component (ModelDB has frontend, backend, and DB) also has state.
We probably need to add PVC for both of ModelDB's backend and DB.

@gaocegege
Copy link
Member

@YujiOshima Since we have a DB for modeldb, then why is the backend stateful? Could we implement it as stateless?

@YujiOshima
Copy link
Contributor

@gaocegege Sorry I don't understand ModelDB backend perfectly. But I couldn't persistent model data by only DB.
I think it is not good design. We should make the DB the only stateful component.

@jlewi
Copy link
Contributor Author

jlewi commented Sep 17, 2018

Any update on this? For Katib to be minimally viable the data should be resilient to pod restarts. What needs to happen to address this?

@YujiOshima
Copy link
Contributor

@jlewi we need to persistent data of ModelDB. Maybe we need to do something. I will work on this.

@YujiOshima
Copy link
Contributor

/assign @YujiOshima

@jlewi
Copy link
Contributor Author

jlewi commented Sep 22, 2018

@YujiOshima any progress on this? I'd really like to include this in the 0.3 release?

@YujiOshima
Copy link
Contributor

@jlewi Sorry, I need more time since ModelDB-backend app stores sqlite data and it looks to resets at beginning of its process. We need to fix it.

@YujiOshima
Copy link
Contributor

We should fix this at first.

@jlewi
Copy link
Contributor Author

jlewi commented Sep 24, 2018

VertaAI/modeldb#221 suggests dropping SQLLite in favor of MongoDB.

This might require more thought about the storage/DB story for Katib.

I've heard from a number of folks that MongoDB is not the simplest DB to productionize/operationalize. So dropping SQLLIte in favor of MongoDB might not be a step in the right direction.

On the other hand, my understanding is that Katib not that closely coupled with ModelDB/MongoDB
Katib uses MySQL for storage of parameters during experiment and models are stored in ModelDB(MongoDB) only historically.

So maybe the next step is to start separating out the short term storage that Katib needs from long term model tracking. We can then start to think about what the right story is for model tracking and potentially look at alternatives to ModelDB (e.g. StudioML).

Especially since ModelDB doesn't seem that active.

@jlewi
Copy link
Contributor Author

jlewi commented Sep 24, 2018

@YujiOshima What would it take to make Katib (but not ModelDB) robust to pod failure? For example, if its true that ModelDB is only used for long term storage and not during actual hyperparameter searches; what needs to happen so that if pods are preempted during the HP tuning job, the job can complete successfully, even if not all of the data is successfully persisted in ModelDB?

@YujiOshima
Copy link
Contributor

@jlewi In Katib, the short-term storage that is needed for hp-tuning is already separated from long term storage(for model tracking). So without ModelDB, we can persistent data for Katib right now.

I agree with thinking about alternatives to ModelDB since it is not active.
I think MLFlow is a one of good choice (referred here kubeflow/kubeflow#136 )
I'm not familiar with SrudioML, but I know there are many tools for model management.
One idea, making a model management operator and it makes selectable like DL framework.
I can make POC for MLFlow operator and integrate with Katib.
WDYT?

@jlewi
Copy link
Contributor Author

jlewi commented Sep 28, 2018

For #178, we will need ot get those changes into our Katib prototype
https://github.com/kubeflow/kubeflow/tree/master/kubeflow/katib

I think for experiment / model tracking we need a database not an operator. Users could have 1000's of models but most of those will just be entries in a DB with data attached to them.

@YujiOshima
Copy link
Contributor

YujiOshima commented Sep 28, 2018

@jlewi >we will need ot get those changes into our Katib prototype
I will try it.

I think for experiment / model tracking we need a database not an operator. Users could have 1000's of models but most of those will just be entries in a DB with data attached to them.

OK, but experiment tracking and model management are extremely needed from Katib/KubeFlow users.
So I want to switch ModelDB to MLFlow and enrich the model management API of Katib the users can use model management without hp-tuning in v0.04.
WDYT?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 1, 2018

@YujiOshima Why MLFLow? When I last looked it didn't look like it was using a DB to track models. It looked like it was just using a filesystem. I'm not sure that's the best solution.

How about writing up a proposal and considering the various options?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 8, 2018

@inc0 Is this fixed?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 9, 2018

/area 0.4.0

@inc0
Copy link

inc0 commented Oct 9, 2018

PVC is fixed. For Model tracking please, include your feedback here: kubeflow/community#195 let's design this thing properly

@jlewi
Copy link
Contributor Author

jlewi commented Oct 14, 2018

@inc0 Does that mean there's no way to make model tracking work reliably with ModelDB? We have to go and build a whole new model tracking system?

@YujiOshima
Copy link
Contributor

I propose new UI #208 for Katib and want to remove ModelDB.
When we will introduce another tool(MLFlow or StudioML etc), let's open a new issue.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 30, 2018

kubeflow/kubeflow#1678 added the PVC for MySql for Katib.

@jlewi
Copy link
Contributor Author

jlewi commented Dec 17, 2018

@YujiOshima What's the status of this? It looks like you removed ModelDB. Is Katib now deployed with Durable storage?

/cc @richardsliu

@YujiOshima
Copy link
Contributor

/close

@k8s-ci-robot
Copy link

@YujiOshima: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants