-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Make external SQL-compatible database requirement for Kubeflow #188
Comments
Thanks; this is great. Are we really in the position to start talking about databases at the Kubeflow level as opposed to in the context of individual microservices? Its unclear to me at least what the right database for individual services should be. For example, should Katib use SQL, NOSQL, key-value, etc....? Is SQL really a Kubeflow requirement? If a user doesn't have SQL, I would still expect that to be able to use large parts of the platform (pretty much everything except Katib at this point). |
Reason I want to stick to one is so if someone would want to use all of kubeflow, they would only need one database cluster not 4 different ones. We can still note which components needs db and which don't, but if a component needs database, it should be this one. How many things we could do if we'd have database to work with? Use cases I can think of:
As for picking SQL vs NoSQL etc, that's a valid discussion to have. I assumed SQL because we'd need to re-architect Katib to use different backend, and frankly, SQL is not a bad solution, I'd probably vote for it myself. I'd totally get rid of mongo if at all possible (I think it could be replaced with experiment tracking + tensorboard really...). |
I don't doubt the benefit of minimzing the number of databases used but I don't think we should mandate or require that each component only use SQL. That seems like a design decision best left to individual components. Uniformity seems like one consideration that should be taken into account. |
Whole k8s is built on top of single k/v store. As for various databases (and why SQL): k/v stores:
nosql:
SQL:
As for "require that each component only use SQL", it's rule we set for ourselves, which means we also can break it if we really need to. What I want to setup is a standard that "we already have database, so if you want to use different, you better have really good reason to". This is one of reasons why k8s succeed so much (from operators standpoint), you have one database - ETCD, and it's well maintained. We can pick db that's not SQL-based, that's fine, but as I said, we should pick 1 and try hard to keep 1. Currently all of cases I can think of can be modelled with SQL. It's also by far most familiar language with lots of high quality utility libs in any language. That does require us to maintain migrations and that's really hard (especially in projects which looks at having thousands of records, any blocking migration, like alter table, can lock out database for hours). I think we can avoid blocking migrations by clever deprecation (instead of changing datatype of a column, add new column with new datatype, in logic lookup both columns, provide tool to slowly migrate column1 to column2, remove column1 in next version). |
My two cents on the DB for experiment tracking: Having built V1 of ModelDB, I can say fairly confidently that a relational DB is not optimal for storing metadata. Different users want to store very different data for their models -- think authors, teams, descriptions etc -- and the type of data might also change as they go along. So we found that keeping the DB schemaless was important. As for ModelDB, we are implementing the storage as an interface, so users can customize what DBs to use. But we have found document DBs to best fit the bill. |
You can implement K/V storage in sql based db as well |
@inc0 What are the next steps for this proposal? |
This is really important for running production kubeflow clusters. Can't imagine anybody wanting to run stock mysql on kubernetes that katib uses (without replication, backups, monitoring etc). The same is true for any database though. |
Any new information of this proposal ? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
* added link to more parameters * added tfserving shortcode
SQL database as requirement to Kubeflow
Currently only one component of Kubeflow requires SQL, and that's Katib. In various discussion we determined that another planned component, experiment tracking, will also require some sort of proper database (due to scale and requirement for advanced querying).
Running database in Kubernetes is very hard to maintain and setup properly, we shouldn't be in business of writing operation code for databases. Currently in Katib we deploy simplistic mysql instance (single node, lack of persistence), which isn't viable for production use cases. Building code that will manage fully clustered, persistent and stateful relational database is very high and outside of scope of Kubeflow.
During discussions we determined that external database is reasonable requirement:
I'd like to propose that we will make SQL database hard requirement for those components of Kubeflow which require it, and make sure that configuration of it is uniform across project.
Additional cost to deployment
Adding big requirement like that shouldn't be taken lightly. Kubeflow already struggles with deployment experience, this will further require deployers (especially first time deployers) to make few additional steps to it. This problem can be partially solved by adding good documentation for all public clouds + bare metal. Besides docs we could also add small yaml manifest to deploy PoC database, that's not production ready but is enough to try out Kubeflow.
Uniform configuration
I think it should be requirement across project that all database configs should be handled in same way. One example approach would be to ask users to manually create Kubernetes secret and set it's name as parameter (with default being {kubeflow component name}-database). For example:
After that we can mount secret values as envs to pods and use them in code.
Note about schema upgrades
Since we'll assume existing relational database, we will need to manage schema migrations. This can be handled by projects like alembic in python. Seamless upgrades will be hard requirement for production projects. We will also need to take account that certain types of schema upgrades are very expensive computationally and can lock database for hours at the time, these should be avoid. Each schema migration should be detailed in release notes including warning if it's one of problematic ones.
We also should build CI infra that will test:
Cross-compatibility of SQL
Most of SQL databases use same set of instructions, but it's not always the case (for example Postgresql supports views). I think it's good for Kubeflow to avoid engine specific features. I propose that we will test compatibility against:
We should build test scenarios that will deploy each of these engines and run our CI on it, and run it periodically.
The text was updated successfully, but these errors were encountered: