[Proposal] Make external SQL-compatible database requirement for Kubeflow #188

inc0 · 2018-09-19T14:27:53Z

SQL database as requirement to Kubeflow

Currently only one component of Kubeflow requires SQL, and that's Katib. In various discussion we determined that another planned component, experiment tracking, will also require some sort of proper database (due to scale and requirement for advanced querying).

Running database in Kubernetes is very hard to maintain and setup properly, we shouldn't be in business of writing operation code for databases. Currently in Katib we deploy simplistic mysql instance (single node, lack of persistence), which isn't viable for production use cases. Building code that will manage fully clustered, persistent and stateful relational database is very high and outside of scope of Kubeflow.

During discussions we determined that external database is reasonable requirement:

Every major public cloud already have managed database offering
On-prem can either deploy SQL compatible database on their own (inside or outside Kubernetes, for example with Vitess)

I'd like to propose that we will make SQL database hard requirement for those components of Kubeflow which require it, and make sure that configuration of it is uniform across project.

Additional cost to deployment

Adding big requirement like that shouldn't be taken lightly. Kubeflow already struggles with deployment experience, this will further require deployers (especially first time deployers) to make few additional steps to it. This problem can be partially solved by adding good documentation for all public clouds + bare metal. Besides docs we could also add small yaml manifest to deploy PoC database, that's not production ready but is enough to try out Kubeflow.

Uniform configuration

I think it should be requirement across project that all database configs should be handled in same way. One example approach would be to ask users to manually create Kubernetes secret and set it's name as parameter (with default being {kubeflow component name}-database). For example:

apiVersion: v1
kind: Secret
metadata:
  name: katib-database
type: Opaque
data:
  database_username: YWRtaW4=
  database_password: MWYyZDFlMmU2N2Rm
  database_name: ZGF0YWJhc2VfbmFtZQo=
  database_host: MTkyLjE2OC4xLjE6MzMwNgo=

After that we can mount secret values as envs to pods and use them in code.

Note about schema upgrades

Since we'll assume existing relational database, we will need to manage schema migrations. This can be handled by projects like alembic in python. Seamless upgrades will be hard requirement for production projects. We will also need to take account that certain types of schema upgrades are very expensive computationally and can lock database for hours at the time, these should be avoid. Each schema migration should be detailed in release notes including warning if it's one of problematic ones.

We also should build CI infra that will test:

Deploy version X of Kubeflow and fill it with fake data
Run upgrade to X+1 and test old data integrity

Cross-compatibility of SQL

Most of SQL databases use same set of instructions, but it's not always the case (for example Postgresql supports views). I think it's good for Kubeflow to avoid engine specific features. I propose that we will test compatibility against:

MySQL / MariaDB
Postgresql
CockroachDB (this is, afaik, best solution to k8s-driven relational database)

We should build test scenarios that will deploy each of these engines and run our CI on it, and run it periodically.

The text was updated successfully, but these errors were encountered:

jlewi · 2018-09-26T12:48:27Z

Thanks; this is great.

Are we really in the position to start talking about databases at the Kubeflow level as opposed to in the context of individual microservices?

Its unclear to me at least what the right database for individual services should be. For example, should Katib use SQL, NOSQL, key-value, etc....?

Is SQL really a Kubeflow requirement? If a user doesn't have SQL, I would still expect that to be able to use large parts of the platform (pretty much everything except Katib at this point).

inc0 · 2018-09-26T19:01:13Z

Reason I want to stick to one is so if someone would want to use all of kubeflow, they would only need one database cluster not 4 different ones. We can still note which components needs db and which don't, but if a component needs database, it should be this one.

How many things we could do if we'd have database to work with? Use cases I can think of:

Experiment tracking - this one will be popular, it will tie all of kubeflow together if we do it right, most people will want it
CI/CD for ML - same as above, I expect this to be quite popular too

As for picking SQL vs NoSQL etc, that's a valid discussion to have. I assumed SQL because we'd need to re-architect Katib to use different backend, and frankly, SQL is not a bad solution, I'd probably vote for it myself. I'd totally get rid of mongo if at all possible (I think it could be replaced with experiment tracking + tensorboard really...).

jlewi · 2018-09-27T14:32:37Z

I don't doubt the benefit of minimzing the number of databases used but I don't think we should mandate or require that each component only use SQL. That seems like a design decision best left to individual components. Uniformity seems like one consideration that should be taken into account.

inc0 · 2018-09-27T15:14:08Z

Whole k8s is built on top of single k/v store. As for various databases (and why SQL):

k/v stores:

ETCD - my initial idea, but we'll have issues with scale and there is no good mechanism to do complex queries
Redis - highly performant, we can do really complex stuff via Lua (if we need to), but it's fault tolerance leaves a lot to wish for, not sure if we want to deal with data loss there

nosql:

MongoDB - personal bias - it totally blew up on me. It's fault tolerance is uhh... Also I ran into ton of scalability issues, but that might be my config, it was few years ago.
Cassandra - Option, I never ran it personally but many people love it

SQL:
First of all, it's familiar. Now operational considerations:

mysql and pgsql - very popular, ton of hosted options on public clouds and any operator worth their salary knows how to run them reasonably on-prem. Really bad on k8s tho, it wasn't designed with "pods die, pods restart" mindset.
cockroach - new kid in the block, works well on kubernetes, looks promising

As for "require that each component only use SQL", it's rule we set for ourselves, which means we also can break it if we really need to. What I want to setup is a standard that "we already have database, so if you want to use different, you better have really good reason to". This is one of reasons why k8s succeed so much (from operators standpoint), you have one database - ETCD, and it's well maintained. We can pick db that's not SQL-based, that's fine, but as I said, we should pick 1 and try hard to keep 1.

Currently all of cases I can think of can be modelled with SQL. It's also by far most familiar language with lots of high quality utility libs in any language. That does require us to maintain migrations and that's really hard (especially in projects which looks at having thousands of records, any blocking migration, like alter table, can lock out database for hours). I think we can avoid blocking migrations by clever deprecation (instead of changing datatype of a column, add new column with new datatype, in logic lookup both columns, provide tool to slowly migrate column1 to column2, remove column1 in next version).

mpvartak · 2018-10-18T22:39:23Z

My two cents on the DB for experiment tracking: Having built V1 of ModelDB, I can say fairly confidently that a relational DB is not optimal for storing metadata. Different users want to store very different data for their models -- think authors, teams, descriptions etc -- and the type of data might also change as they go along. So we found that keeping the DB schemaless was important. As for ModelDB, we are implementing the storage as an interface, so users can customize what DBs to use. But we have found document DBs to best fit the bill.

inc0 · 2018-10-19T20:27:23Z

You can implement K/V storage in sql based db as well

jlewi · 2018-11-13T00:58:00Z

@inc0 What are the next steps for this proposal?

boniek83 · 2019-08-19T14:42:51Z

This is really important for running production kubeflow clusters. Can't imagine anybody wanting to run stock mysql on kubernetes that katib uses (without replication, backups, monitoring etc). The same is true for any database though.

descampsk · 2019-12-06T11:21:15Z

Any new information of this proposal ?
We really need to have the possibility to use an external database to be able to use Kubeflow in production.

stale · 2020-05-03T05:56:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

issue-label-bot · 2020-05-10T06:39:51Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.71

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

* added link to more parameters * added tfserving shortcode

inc0 mentioned this issue Oct 12, 2018

[WIP] Experiment tracking proposal #195

Closed

descampsk mentioned this issue Dec 6, 2019

[AWS] Provide managed MYSQL solution for kubeflow kubeflow/kubeflow#4546

Closed

stale bot added the lifecycle/stale label May 3, 2020

stale bot closed this as completed May 10, 2020

issue-label-bot bot added the kind/feature label May 10, 2020

woop pushed a commit to woop/community that referenced this issue Nov 16, 2020

added link to more parameters (kubeflow#188)

23849e0

* added link to more parameters * added tfserving shortcode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Make external SQL-compatible database requirement for Kubeflow #188

[Proposal] Make external SQL-compatible database requirement for Kubeflow #188

inc0 commented Sep 19, 2018 •

edited

Loading

jlewi commented Sep 26, 2018

inc0 commented Sep 26, 2018

jlewi commented Sep 27, 2018

inc0 commented Sep 27, 2018

mpvartak commented Oct 18, 2018

inc0 commented Oct 19, 2018

jlewi commented Nov 13, 2018

boniek83 commented Aug 19, 2019 •

edited

Loading

descampsk commented Dec 6, 2019

stale bot commented May 3, 2020

issue-label-bot bot commented May 10, 2020

[Proposal] Make external SQL-compatible database requirement for Kubeflow #188

[Proposal] Make external SQL-compatible database requirement for Kubeflow #188

Comments

inc0 commented Sep 19, 2018 • edited Loading

SQL database as requirement to Kubeflow

Additional cost to deployment

Uniform configuration

Note about schema upgrades

Cross-compatibility of SQL

jlewi commented Sep 26, 2018

inc0 commented Sep 26, 2018

jlewi commented Sep 27, 2018

inc0 commented Sep 27, 2018

mpvartak commented Oct 18, 2018

inc0 commented Oct 19, 2018

jlewi commented Nov 13, 2018

boniek83 commented Aug 19, 2019 • edited Loading

descampsk commented Dec 6, 2019

stale bot commented May 3, 2020

issue-label-bot bot commented May 10, 2020

inc0 commented Sep 19, 2018 •

edited

Loading

boniek83 commented Aug 19, 2019 •

edited

Loading