-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Katib needs to use durable storage that outlives the pod. #137
Comments
No, I don't think so. It is accurate IMO. We should add PVC to support the persistency. |
@jlewi @gaocegege I agree we need PVC. |
@YujiOshima Since we have a DB for modeldb, then why is the backend stateful? Could we implement it as stateless? |
@gaocegege Sorry I don't understand ModelDB backend perfectly. But I couldn't persistent model data by only DB. |
Any update on this? For Katib to be minimally viable the data should be resilient to pod restarts. What needs to happen to address this? |
@jlewi we need to persistent data of ModelDB. Maybe we need to do something. I will work on this. |
/assign @YujiOshima |
@YujiOshima any progress on this? I'd really like to include this in the 0.3 release? |
@jlewi Sorry, I need more time since ModelDB-backend app stores sqlite data and it looks to resets at beginning of its process. We need to fix it. |
We should fix this at first. |
VertaAI/modeldb#221 suggests dropping SQLLite in favor of MongoDB. This might require more thought about the storage/DB story for Katib. I've heard from a number of folks that MongoDB is not the simplest DB to productionize/operationalize. So dropping SQLLIte in favor of MongoDB might not be a step in the right direction. On the other hand, my understanding is that Katib not that closely coupled with ModelDB/MongoDB So maybe the next step is to start separating out the short term storage that Katib needs from long term model tracking. We can then start to think about what the right story is for model tracking and potentially look at alternatives to ModelDB (e.g. StudioML). Especially since ModelDB doesn't seem that active. |
@YujiOshima What would it take to make Katib (but not ModelDB) robust to pod failure? For example, if its true that ModelDB is only used for long term storage and not during actual hyperparameter searches; what needs to happen so that if pods are preempted during the HP tuning job, the job can complete successfully, even if not all of the data is successfully persisted in ModelDB? |
@jlewi In Katib, the short-term storage that is needed for hp-tuning is already separated from long term storage(for model tracking). So without ModelDB, we can persistent data for Katib right now. I agree with thinking about alternatives to ModelDB since it is not active. |
For #178, we will need ot get those changes into our Katib prototype I think for experiment / model tracking we need a database not an operator. Users could have 1000's of models but most of those will just be entries in a DB with data attached to them. |
@jlewi >we will need ot get those changes into our Katib prototype
OK, but experiment tracking and model management are extremely needed from Katib/KubeFlow users. |
@YujiOshima Why MLFLow? When I last looked it didn't look like it was using a DB to track models. It looked like it was just using a filesystem. I'm not sure that's the best solution. How about writing up a proposal and considering the various options? |
@inc0 Is this fixed? |
/area 0.4.0 |
PVC is fixed. For Model tracking please, include your feedback here: kubeflow/community#195 let's design this thing properly |
@inc0 Does that mean there's no way to make model tracking work reliably with ModelDB? We have to go and build a whole new model tracking system? |
I propose new UI #208 for Katib and want to remove ModelDB. |
kubeflow/kubeflow#1678 added the PVC for MySql for Katib. |
@YujiOshima What's the status of this? It looks like you removed ModelDB. Is Katib now deployed with Durable storage? /cc @richardsliu |
/close |
@YujiOshima: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Katib uses two databases
It doesn't look like we are using any sort of durable storage for either one.
Here's the MySQL deployment it doesn't look like its using a persistent volume
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/katib/vizier.libsonnet#L185
Here's the MongoDB deployment:
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/katib/modeldb.libsonnet#L39
@YujiOshima @gaocegege Is this accurate or am I missing something?
The text was updated successfully, but these errors were encountered: