Improve repo structure and delete outdated manifests #1554

thesuperzapper · 2020-09-14T09:44:08Z

This repository is extremely messy and is a horrible user experience. The mess has caused us real issues, as we botched the 1.1 release due to a lack of visibility of what is even in the manifests (See: #1553).

When paired with the almost complete lack of docs relating to Kubflow 1.1 this leaves Kubeflow near impossible to install, just take a look at recent-issues (across all the repos), and our slack and you will see what I mean.

My proposal is to make a new manifests repository and make some changes as we do, we should:

seperate each of the 'distributions' into a seperate repository (similar to gcp-blueprints),
only have the actual kubeflow manifests in the core manifests repo, (that is, no Istio and cert-manger, knative, as they are ostensibly part of a distribution, not Kubeflow)
actually keep the docs up to date

EDIT: see this comment below for my proposal of a 'Generic' distribution (to replace most of the current ones)

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-09-14T09:44:15Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.60
platform/gcp	0.92

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jlewi · 2020-09-14T15:19:30Z

I think this is relevant to the discussion in kubeflow/community#402; particular clarifying the responsibility between application owners, the proposed deployments WG and the platform owners.

I think the lack of structure, and code accumulation in kubeflow/manifests is an outcome of lack of clear ownership/responsibilities.

thesuperzapper · 2020-09-15T00:51:35Z

/priority p1

thesuperzapper · 2020-09-15T04:42:28Z

I also believe we longer need the concept of a 'distribution' (in the form we currently have it), as this was primarily made to support GCP deploying a full K8S stack with Kubeflow on top.

If you look at Slack/GitHub-Issues you will see most users are installing Kubeflow on existing clusters of all types: GKE, AKS, EKS, native-k8s. We need to focus on supporting them!

To facilitate this, I propose we:

Do the above migration to a new kubeflow/manifests for all ACTUAL Kubeflow manifests
- INCLUDING: pipelines, jupyter-web-app, etc
- NOT INCLUDING: authentication (like dex), istio, argo-workflows, cert-manger, knative, etc
Consolidate all distributions (that are willing) into a 'Generic' one:
- this distribution would:
  - assume the user at least already has a K8S cluster running
  - not depend on any vendor specific feature
  - live in its own git repo (as proposed above)
  - include everything you would need to deploy Kubeflow on an empty cluster: auth (dex), istio, cert-manger, argo-workflows, knative, mysql, etc
    - NOTE: we would let users pick the pieces they need (potentially with kfctl)
    - NOTE: we would clearly document which versions/configs of these tools are supported (if they don't choose to install them with Kubeflow)
  - get its actual Kubeflow manifests from the new kubeflow/manifests repo (as proposed above)
  - be intended to be used in Git-Ops, (but not tied to any specific git-ops tool)
- I believe this is good for EVERYBODY:
  - as currently, all distributions are pretty much the same
  - vendors can still validate this distribution for their customers
  - the overhead for releasing a new Kubeflow version is reduced to a single place
  - it means that Kubeflow is an OPEN STANDARD (which can be run anywhere)
  - users will actually be able to install/upgrade Kubeflow!

mameshini · 2020-09-15T05:34:53Z

Agile Stacks distribution of Kubeflow is available here: https://github.com/agilestacks/stack-ml-eks
It is open source and it can be merged with Kubeflow repository in the future. We have taken a path of GitOps and customized the Kubeflow manifests to get more control over all components that are part of Kubernetes platform: dex, istio, letsencrypt, cert-manager, mysql, knative, ceph, csi plugin. The above example is available for AWS EKS, and we are working on the similar distributions for GKE, AKS, on-prem.

thesuperzapper · 2020-09-15T05:55:30Z

Another option (instead of Part 1) is that we discontinue the manifests repo entirely, and store the manifests for each component in the same repository as the component code.

However, I see a few issues with that approach:

Non-Kubeflow components, like seldon-operator, spark-operator will be homeless.
- (But I guess you could argue those should just go in the 'Generic' distribution repo, as not all users will need/want them)
The kubeflow/kubeflow git-repo has many parts of Kubeflow in the same repo:
- Example: centraldashboard, jupyter-web-app, etc
- (We could consider splitting them up, but that seems like a lot of work/overhead)
There is a need for cross-component resources/configs, and this would force those to live in the 'Generic'
distribution (where they probably don't belong)

alfsuse · 2020-09-15T06:25:49Z

What if we do split non-Kubeflow components and group them into a sort of "extension/plugin" repo? This way we would have the following:

Main Core Repo: the core KF components
Core Extension Repo: centraldashboard, jupyter-web-app
Other Extension Repo: seldon-op,spark-op, vendor specific-ops

This way the experience would result in a simpler and lighter installation for the core components plus a specific customized part based on requirements.
Resources/configs would be part of the extension configuration as patches or overwrite to the core components.

erikerlandson · 2020-09-15T14:51:37Z

a bit of a nit, but what is the benefit of per-release repo, compared to just branching/tagging for releases?

Jeffwan · 2020-09-15T15:58:19Z

I don't think we need to refactor the entire repo. This repo has been used since 0.6. We have so many breaking changes in each version. If user have some problems, we can address them separately. The major reason it looks messy recently is because components growing and v3 stack migration.

Seems all the problem you describe can be resolved by having a KFDef manifest with minimum components? From user's point, they are not supposed to look at all components, the entry point should always be the KFDef or other manifest files.

jzhoucliqr · 2020-09-15T17:06:46Z

I totally agree that we need better support for existing clusters. From what I can tell, most of the errors happen when there are conflicts with components already installed on existing cluster, like cert-manager. For this case, the admin who installs kubeflow does need to look at all components to do customizations.

On the other hand, I'm thinking maybe another reason we only see issues when install into existing clusters is because create a brand new cluster and install k8s mostly goes smoothly?

connorlwilkes · 2020-09-15T18:55:55Z

On this topic, is there a clear list of what components are core, what components are optional and of those components which are managed by Kubeflow and if they aren't what versions of the external components are compatible with which version of Kubeflow?

rmgogogo · 2020-09-15T23:49:58Z

/cc @Bobgy

thesuperzapper · 2020-09-16T00:54:06Z

I am seeing some confusion about why we need to create a new repo, rather than just use this existing one.

It's because this new repo will be completely different in both code and layout. It will ONLY HAVE Kubeflow component manifests, which are organised cleanly into folders. Distributions like GCP/AWS would live in their own repos, but the Kubeflow project would only be responsible for the 'Generic distribution'.

Given these changes are so significant, if we want to keep current releases (Kubeflow 1.0 and Kubeflow 1.1) available, we need a new repo.

The main outcomes we can achieve with this refactor:

The release process is massively simplified:
- we only depend on the 'Generic distribution' to be ready for release
- non-Kubeflow components don't get in the way (as they live in the 'distribution' repos)
We can make patch releases:
- currently the release-model is too cumbersome for 1.1.1 type releases
Kubeflow component maintainers can safely update their manifests:
- currently, it's not clear how to safely upgrade YAML because everything is such a spider-web of dependancies (See: all distributions of Kubeflow 1.1 have incorrect container images #1553)

veggiemonk · 2020-09-16T07:17:30Z

It's all good points @thesuperzapper made. I agree 👍

It's best for the project to focus on KubeFlow rather than half the CNCF Landscape + KubeFlow.

Also, my experience on trying to install KubeFlow on GKE was a total disaster as I did not need half of what was in the manifest and the other half, I'm still trying to figure out why do I need this ?

As far as I understood, Kubeflow is an architecture that gathers other projects as plugins.

It would be nice to allow people to understand which project manifest is tied to which features instead of asking people to install a custom build tool (kfctl) that spits out 40+k lines of YAML, hoping that they will suddenly understand what everything is doing.

I love the pick and choose approach but if I don't understand what I'm choosing, is the goal reached ?

I would be so happy to have a way to deploy KubeFlow on a kubernetes cluster with Istio already installed because I would understand what am I installing.

alfsuse · 2020-09-16T07:56:21Z

To @veggiemonk and @thesuperzapper comments, when a user installs KF over an existing cluster may have specific requirements like Istio or Dex versions so the need of customize the manifests before time to add/remove/change is there and from the UX experience is not that great because doing that way rise the risk of human errors a lot.

I personally like @Jeffwan approach here

Seems all the problem you describe can be resolved by having a KFDef manifest with minimum components? From user's point, they are not supposed to look at all components, the entry point should always be the KFDef or other manifest files.

But so I think we need to define a list of:

What is core component
What is optional component
What is an "exetension/vendor specific"

swiftdiaries · 2020-09-16T22:42:59Z

To give some context for the approach with kfctl and KFDef, initially we had a ksonnet based deployment that configured Kubeflow in an entirely transparent manner. By that what I mean is that it was multi-step and you could modify and customize individual components in any way that a user wanted to.

There was some friction with the tooling itself and how it used ksonnet/jsonnet over helm / plain yaml (which was what the k8s community was used to).
And so kfctl.sh - a bash script to automate some of that into three broad steps:

download -> modify jsonnet
build
deploy

That change was to paper over some of the difficulties in understanding ksonnet/jsonnet and abstracting away the tool's complexity with a layer on top.
Over time the bash script accrued some debt and it was decided to move the bash script to a go binary. Around the same time, ksonnet was archived. And our move to kustomize wasn't easy because the feature-set wasn't a 1-1 mapping. So we had to build some custom logic into the kfctl go binary and structure the manifests repo in a way that wasn't "standard". As with any project, we had deadlines to meet and were dealing with a deprecated/archived project and so we cut corners to deliver a "working" implementation.

Between this changes, we lost the ability to configure individual applications and kfctl w/ manifests became the only way to deploy Kubeflow.

After this, we introduced Istio as part of the stack to move from Ambassador as the API Gateway. This meant that we now had two "versions/flavors" of Kubeflow. One without Istio and one with Istio and transition over to Istio. This introduced the concept of kustomize overlays. But we also needed to order the install now because unless Istio and it's CRDs are installed first, the applications themselves using Istio CRDs (Virtual Services) would fail. This was done with the help of the application controller and meta-controller (I think). But this meant that we needed two overlays. One for Istio-related virtual services and one for ordering the applications with the application CRD. But kustomize at the time didn't have the feature to merge multiple overlays and order applications. So we introduced the concept of a KFDef to flesh out information on what overlays need to be merged together and order the application installs (Istio first, tf-operator next and so on...).

Now, notice how we started out with something to paper-over complexity by simply using a bash script that calls the individual CLIs (ksonnet, kubectl...) and to replace that with a maintainable Go binary but ended up adding more complexity. This was not intentional but driven by the need of the hour and being put on the spot with difficult choices. Maybe we could've made better choices but here we are.

Anywhoo, my point is ANY project OSS or otherwise, accrues debt over time for whatever reason. And I'm seeing a rising amount of rants in comments and issues, these are NOT helpful at all. Be kind. Offer something constructive. This is an open community. So open a PR, propose a design. For example, this repo is a mess -> rearchitecting manifests to remove tech debt or something would go a long way. The maintainers did their best at the time to take time to contribute and build this project into what it is today.

swiftdiaries · 2020-09-16T22:58:02Z

Now to addressing the issue, I see two requests:

Modular applications
Removing debt in the repo

I think the work the done with v3 kustomize is a good starting point for us to build towards manifests which are configurable.

And also, there were will always be people who want a simple installation process. I think in the past we went for a "low bar, high ceiling" approach where (most) people can install with a simple set of commands. And a set of people (power users) who want to customize / modify / extend. And we want to enable all users to be successful with Kubeflow.

[high ceiling] In order to work with existing installations of Istio+Dex, we need to make applications more modular.
[low bar] We need to make the default install experience more reliable.

jlewi · 2020-09-17T02:09:58Z

I also believe we longer need the concept of a 'distribution' (in the form we currently have it), as this was primarily made to support GCP deploying a full K8S stack with Kubeflow on top.

The concept of a "distribution" arises because a one size fits all (or even most) approach won't work.

One of the problems we encountered early on is that trying to create a configuration that works every where leads to a suboptimal experience.

A simple example, not every K8s cluster has a default storage class. So how should we configure Jupyter by default? Should jupyter use a default storage class to provide durable storage for notebooks? Which will lead to a non functional deployment on clusters without a default storage class. Do we use ephemeral storage which leads to data loss if the pod is preempted?

Multiple distributions are necessary in order to prevent regression to the lowest common denominator.

Distributions allow us to embrace diversity of use cases and oppinions by trying to make it easy for folks to create opinionated distributions of Kubeflow rather than try to achieve consensus.

Even for individual applications there is a growing number of ways to configure them; e.g.

KFP can be backed by a SQL instance running in cluster or an external Cloud SQL service
KFP can use Argo or potentially use the Tekton variant

Likewise everyone's existing clusters will be configured differently

Some won't have ISTIO
Some will have ISTIO but not KNative
Some will have Argo

It would be great if Kubernetes had a mechanism for application dependency management but it doesn't (or at least not one considered to be a standard).

Creating such a dependency management system is outside the scope of Kubeflow.

Distributions are a way for folks to try to create an oppinionated solution to that problem.

As an example, IBM is creating a distribution for Kubeflow based on OLM in part because OLM

With respect to upgrades I would suggest looking at kubeflow/kfctl#304 for some helpful context about some of the tech debt @swiftdiaries is referring to that is impeding upgrades.

Likewise #1007 provides context about vars and the v3 migration. vars originated as a way to handle necessary customizations for different Kubernetes installs; e.g. the ClusterDomain.

Per @swiftdiaries's suggestion above a great way to clean up tech debt would be to figure out if we can start removing the old legacy manifests.

Regarding outdated images #1553 this looks to me like a process issue; in particular ensuring application owners are ensuring their manifests are up to data. I think this is something we should try to address as part of the wg-deployments charter (kubeflow/community#402) by clarifying responsibilities and expectations.

thesuperzapper · 2020-09-17T04:14:49Z

@jlewi

Just a few points:

Nothing I have proposed will prevent vendors from making their own "distribution", but I believe the Kubeflow project itself should focus on making a "generic distribution" as its primary asset, rather than fragmenting like we have now.
- NOTE: I am not saying this "generic distribution" would be a one-size-fits all (quite the opposite), I am saying it would be modular, and configurable, (but focused on existing K8S clusters)
I believe it is possible for us to make a "generic" distribution which will work in all environments (provided some requirements are met):
- Specifically, things like DefaultStorageClass could be one of these requirements. (But we would document setting DefaultStorageClass up for common situations, like NFS, GKE, EKS)

Finally, do you really think that the current structure of this repo is acceptable? I think starting again is the best option, with clear requirements for basic things like preventing spider webs of Kustomize scripts.

thesuperzapper · 2020-09-17T04:18:46Z

@jlewi Also Regarding outdated images (See: #1553), at least part of this issue was clearly caused by the fact that component owners are overwhelmed by the current state of this repo

jlewi · 2020-09-17T13:06:43Z

Repositories and projects need OWNERs. Any new project/repo would require WG sponsorship. The most likely WG would be the deployments WG. So any progress on a new repo is blocked on formation of that WG.

Of course anyone is free to fork or create their own repo outside the Kubeflow org.

In the interim, we can make incremental progress towards overhauling the structure of this repo through a series of incremental refactorings.

@thesuperzapper given you have expressed an interest in notebooks the jupyter manifests seem like an obvious starting point.

yanniszark · 2020-09-18T10:16:36Z

@thesuperzapper thanks for putting this together!

I agree that the current manifests repo leaves a lot to be desired. We have managed to make it work by creating an end-to-end pure GitOps process building on top of this repository, but not depending on it, and this is how we are now building both MiniKF and our multi-node, enterprise deployments. But note that this was a significant effort that we had to undertake internally.

With the deployment working group just formed, I believe we should use this momentum and tackle this as the first project of wg-deployment.

In my experience, some major pain points of the current manifests are:

Two versions of manifests co-exist (base and base_v3) and one frequently gets left behind (the base_v3 one). We should only have one version and deprecate the other.
Manifests exist in application repos (e.g., KFServing, Jupyter) and then are rewritten for manifests. This means that the configurations in kubeflow/manifests frequently get left behind. This was mainly due to kfctl needing a specific form. We need to streamline this process by re-using the upstream application's configurations. For example, pipelines has started importing a part of their tree and using it to define their manifests: https://github.com/kubeflow/manifests/tree/master/pipeline/upstream. @jlewi has also underlined this issue.

This will also allow us to clean up a lot of the old code and practices and end up with a much more focused repo.
So my proposal is to:

Create wg-deployment by merging (create wg-deployment community#402)
Create the meetings for wg-deployments
Discuss this as our first project

PatrickXYS · 2020-12-15T05:59:44Z

Let fix this before 1.3

stale · 2021-06-04T15:26:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

thesuperzapper · 2021-06-11T06:28:36Z

I would say after #1735 was merged, this issue is mostly resolved.
/close

google-oss-robot · 2021-06-11T06:28:39Z

@thesuperzapper: Closing this issue.

In response to this:

I would say after #1735 was merged, this issue is mostly resolved.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

issue-label-bot bot added kind/feature platform/gcp labels Sep 14, 2020

k8s-ci-robot added the priority/p1 label Sep 15, 2020

This was referenced Sep 15, 2020

create wg-deployment kubeflow/community#402

Merged

Docs for installing on existing cluster GoogleCloudPlatform/kubeflow-distribution#53

Open

thesuperzapper mentioned this issue Sep 15, 2020

WG ownership of kubeflow/manifests - relation to application owners kubeflow/community#400

Closed

thesuperzapper mentioned this issue Sep 15, 2020

enabling notebook culling feature in notebook controller by setting required env verables as part of notebook controller manifest #1556

Closed

thesuperzapper mentioned this issue Sep 16, 2020

istio: Upgrade Istio version and stop using deprecated Istio RBAC policy #1551

Closed

jlewi changed the title ~~this repo is a mess~~ Improve repo structure and delete outdated manifests Sep 17, 2020

thesuperzapper mentioned this issue Oct 1, 2020

most Application Resource Selectors are Incorrect #1573

Closed

PatrickXYS removed the platform/gcp label Dec 16, 2020

stale bot added the lifecycle/stale label Jun 4, 2021

stale bot removed the lifecycle/stale label Jun 11, 2021

google-oss-robot closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve repo structure and delete outdated manifests #1554

Improve repo structure and delete outdated manifests #1554

thesuperzapper commented Sep 14, 2020 •

edited

Loading

issue-label-bot bot commented Sep 14, 2020

jlewi commented Sep 14, 2020

thesuperzapper commented Sep 15, 2020

thesuperzapper commented Sep 15, 2020

mameshini commented Sep 15, 2020

thesuperzapper commented Sep 15, 2020

alfsuse commented Sep 15, 2020

erikerlandson commented Sep 15, 2020 •

edited

Loading

Jeffwan commented Sep 15, 2020 •

edited

Loading

jzhoucliqr commented Sep 15, 2020

connorlwilkes commented Sep 15, 2020

rmgogogo commented Sep 15, 2020

thesuperzapper commented Sep 16, 2020

veggiemonk commented Sep 16, 2020

alfsuse commented Sep 16, 2020

swiftdiaries commented Sep 16, 2020 •

edited

Loading

swiftdiaries commented Sep 16, 2020 •

edited

Loading

jlewi commented Sep 17, 2020 •

edited

Loading

thesuperzapper commented Sep 17, 2020

thesuperzapper commented Sep 17, 2020

jlewi commented Sep 17, 2020

yanniszark commented Sep 18, 2020

PatrickXYS commented Dec 15, 2020

stale bot commented Jun 4, 2021

thesuperzapper commented Jun 11, 2021

google-oss-robot commented Jun 11, 2021

Improve repo structure and delete outdated manifests #1554

Improve repo structure and delete outdated manifests #1554

Comments

thesuperzapper commented Sep 14, 2020 • edited Loading

issue-label-bot bot commented Sep 14, 2020

jlewi commented Sep 14, 2020

thesuperzapper commented Sep 15, 2020

thesuperzapper commented Sep 15, 2020

mameshini commented Sep 15, 2020

thesuperzapper commented Sep 15, 2020

alfsuse commented Sep 15, 2020

erikerlandson commented Sep 15, 2020 • edited Loading

Jeffwan commented Sep 15, 2020 • edited Loading

jzhoucliqr commented Sep 15, 2020

connorlwilkes commented Sep 15, 2020

rmgogogo commented Sep 15, 2020

thesuperzapper commented Sep 16, 2020

veggiemonk commented Sep 16, 2020

alfsuse commented Sep 16, 2020

swiftdiaries commented Sep 16, 2020 • edited Loading

swiftdiaries commented Sep 16, 2020 • edited Loading

jlewi commented Sep 17, 2020 • edited Loading

thesuperzapper commented Sep 17, 2020

thesuperzapper commented Sep 17, 2020

jlewi commented Sep 17, 2020

yanniszark commented Sep 18, 2020

PatrickXYS commented Dec 15, 2020

stale bot commented Jun 4, 2021

thesuperzapper commented Jun 11, 2021

google-oss-robot commented Jun 11, 2021

thesuperzapper commented Sep 14, 2020 •

edited

Loading

erikerlandson commented Sep 15, 2020 •

edited

Loading

Jeffwan commented Sep 15, 2020 •

edited

Loading

swiftdiaries commented Sep 16, 2020 •

edited

Loading

swiftdiaries commented Sep 16, 2020 •

edited

Loading

jlewi commented Sep 17, 2020 •

edited

Loading