Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Make external SQL-compatible database requirement for Kubeflow #188

Closed
inc0 opened this issue Sep 19, 2018 · 11 comments
Closed

Comments

@inc0
Copy link

inc0 commented Sep 19, 2018

SQL database as requirement to Kubeflow

Currently only one component of Kubeflow requires SQL, and that's Katib. In various discussion we determined that another planned component, experiment tracking, will also require some sort of proper database (due to scale and requirement for advanced querying).

Running database in Kubernetes is very hard to maintain and setup properly, we shouldn't be in business of writing operation code for databases. Currently in Katib we deploy simplistic mysql instance (single node, lack of persistence), which isn't viable for production use cases. Building code that will manage fully clustered, persistent and stateful relational database is very high and outside of scope of Kubeflow.

During discussions we determined that external database is reasonable requirement:

  • Every major public cloud already have managed database offering
  • On-prem can either deploy SQL compatible database on their own (inside or outside Kubernetes, for example with Vitess)

I'd like to propose that we will make SQL database hard requirement for those components of Kubeflow which require it, and make sure that configuration of it is uniform across project.

Additional cost to deployment

Adding big requirement like that shouldn't be taken lightly. Kubeflow already struggles with deployment experience, this will further require deployers (especially first time deployers) to make few additional steps to it. This problem can be partially solved by adding good documentation for all public clouds + bare metal. Besides docs we could also add small yaml manifest to deploy PoC database, that's not production ready but is enough to try out Kubeflow.

Uniform configuration

I think it should be requirement across project that all database configs should be handled in same way. One example approach would be to ask users to manually create Kubernetes secret and set it's name as parameter (with default being {kubeflow component name}-database). For example:

apiVersion: v1
kind: Secret
metadata:
  name: katib-database
type: Opaque
data:
  database_username: YWRtaW4=
  database_password: MWYyZDFlMmU2N2Rm
  database_name: ZGF0YWJhc2VfbmFtZQo=
  database_host: MTkyLjE2OC4xLjE6MzMwNgo=

After that we can mount secret values as envs to pods and use them in code.

Note about schema upgrades

Since we'll assume existing relational database, we will need to manage schema migrations. This can be handled by projects like alembic in python. Seamless upgrades will be hard requirement for production projects. We will also need to take account that certain types of schema upgrades are very expensive computationally and can lock database for hours at the time, these should be avoid. Each schema migration should be detailed in release notes including warning if it's one of problematic ones.

We also should build CI infra that will test:

  • Deploy version X of Kubeflow and fill it with fake data
  • Run upgrade to X+1 and test old data integrity

Cross-compatibility of SQL

Most of SQL databases use same set of instructions, but it's not always the case (for example Postgresql supports views). I think it's good for Kubeflow to avoid engine specific features. I propose that we will test compatibility against:

  • MySQL / MariaDB
  • Postgresql
  • CockroachDB (this is, afaik, best solution to k8s-driven relational database)

We should build test scenarios that will deploy each of these engines and run our CI on it, and run it periodically.

@jlewi
Copy link
Contributor

jlewi commented Sep 26, 2018

Thanks; this is great.

Are we really in the position to start talking about databases at the Kubeflow level as opposed to in the context of individual microservices?

Its unclear to me at least what the right database for individual services should be. For example, should Katib use SQL, NOSQL, key-value, etc....?

Is SQL really a Kubeflow requirement? If a user doesn't have SQL, I would still expect that to be able to use large parts of the platform (pretty much everything except Katib at this point).

@inc0
Copy link
Author

inc0 commented Sep 26, 2018

Reason I want to stick to one is so if someone would want to use all of kubeflow, they would only need one database cluster not 4 different ones. We can still note which components needs db and which don't, but if a component needs database, it should be this one.

How many things we could do if we'd have database to work with? Use cases I can think of:

  1. Experiment tracking - this one will be popular, it will tie all of kubeflow together if we do it right, most people will want it
  2. CI/CD for ML - same as above, I expect this to be quite popular too

As for picking SQL vs NoSQL etc, that's a valid discussion to have. I assumed SQL because we'd need to re-architect Katib to use different backend, and frankly, SQL is not a bad solution, I'd probably vote for it myself. I'd totally get rid of mongo if at all possible (I think it could be replaced with experiment tracking + tensorboard really...).

@jlewi
Copy link
Contributor

jlewi commented Sep 27, 2018

I don't doubt the benefit of minimzing the number of databases used but I don't think we should mandate or require that each component only use SQL. That seems like a design decision best left to individual components. Uniformity seems like one consideration that should be taken into account.

@inc0
Copy link
Author

inc0 commented Sep 27, 2018

Whole k8s is built on top of single k/v store. As for various databases (and why SQL):

k/v stores:

  • ETCD - my initial idea, but we'll have issues with scale and there is no good mechanism to do complex queries
  • Redis - highly performant, we can do really complex stuff via Lua (if we need to), but it's fault tolerance leaves a lot to wish for, not sure if we want to deal with data loss there

nosql:

  • MongoDB - personal bias - it totally blew up on me. It's fault tolerance is uhh... Also I ran into ton of scalability issues, but that might be my config, it was few years ago.
  • Cassandra - Option, I never ran it personally but many people love it

SQL:
First of all, it's familiar. Now operational considerations:

  • mysql and pgsql - very popular, ton of hosted options on public clouds and any operator worth their salary knows how to run them reasonably on-prem. Really bad on k8s tho, it wasn't designed with "pods die, pods restart" mindset.
  • cockroach - new kid in the block, works well on kubernetes, looks promising

As for "require that each component only use SQL", it's rule we set for ourselves, which means we also can break it if we really need to. What I want to setup is a standard that "we already have database, so if you want to use different, you better have really good reason to". This is one of reasons why k8s succeed so much (from operators standpoint), you have one database - ETCD, and it's well maintained. We can pick db that's not SQL-based, that's fine, but as I said, we should pick 1 and try hard to keep 1.

Currently all of cases I can think of can be modelled with SQL. It's also by far most familiar language with lots of high quality utility libs in any language. That does require us to maintain migrations and that's really hard (especially in projects which looks at having thousands of records, any blocking migration, like alter table, can lock out database for hours). I think we can avoid blocking migrations by clever deprecation (instead of changing datatype of a column, add new column with new datatype, in logic lookup both columns, provide tool to slowly migrate column1 to column2, remove column1 in next version).

@mpvartak
Copy link

My two cents on the DB for experiment tracking: Having built V1 of ModelDB, I can say fairly confidently that a relational DB is not optimal for storing metadata. Different users want to store very different data for their models -- think authors, teams, descriptions etc -- and the type of data might also change as they go along. So we found that keeping the DB schemaless was important. As for ModelDB, we are implementing the storage as an interface, so users can customize what DBs to use. But we have found document DBs to best fit the bill.

@inc0
Copy link
Author

inc0 commented Oct 19, 2018

You can implement K/V storage in sql based db as well

@jlewi
Copy link
Contributor

jlewi commented Nov 13, 2018

@inc0 What are the next steps for this proposal?

@boniek83
Copy link

boniek83 commented Aug 19, 2019

This is really important for running production kubeflow clusters. Can't imagine anybody wanting to run stock mysql on kubernetes that katib uses (without replication, backups, monitoring etc). The same is true for any database though.

@descampsk
Copy link

Any new information of this proposal ?
We really need to have the possibility to use an external database to be able to use Kubeflow in production.

@stale
Copy link

stale bot commented May 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.71

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

woop pushed a commit to woop/community that referenced this issue Nov 16, 2020
* added link to more parameters

* added tfserving shortcode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants