-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve repo structure and delete outdated manifests #1554
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
I think this is relevant to the discussion in kubeflow/community#402; particular clarifying the responsibility between application owners, the proposed deployments WG and the platform owners. I think the lack of structure, and code accumulation in kubeflow/manifests is an outcome of lack of clear ownership/responsibilities. |
/priority p1 |
I also believe we longer need the concept of a 'distribution' (in the form we currently have it), as this was primarily made to support GCP deploying a full K8S stack with Kubeflow on top. If you look at Slack/GitHub-Issues you will see most users are installing Kubeflow on existing clusters of all types: GKE, AKS, EKS, native-k8s. We need to focus on supporting them! To facilitate this, I propose we:
|
Agile Stacks distribution of Kubeflow is available here: https://github.com/agilestacks/stack-ml-eks |
Another option (instead of Part 1) is that we discontinue the manifests repo entirely, and store the manifests for each component in the same repository as the component code. However, I see a few issues with that approach:
|
What if we do split non-Kubeflow components and group them into a sort of "extension/plugin" repo? This way we would have the following:
This way the experience would result in a simpler and lighter installation for the core components plus a specific customized part based on requirements. |
a bit of a nit, but what is the benefit of per-release repo, compared to just branching/tagging for releases? |
I don't think we need to refactor the entire repo. This repo has been used since 0.6. We have so many breaking changes in each version. If user have some problems, we can address them separately. The major reason it looks messy recently is because components growing and v3 stack migration. Seems all the problem you describe can be resolved by having a KFDef manifest with minimum components? From user's point, they are not supposed to look at all components, the entry point should always be the KFDef or other manifest files. |
I totally agree that we need better support for existing clusters. From what I can tell, most of the errors happen when there are conflicts with components already installed on existing cluster, like cert-manager. For this case, the admin who installs kubeflow does need to look at all components to do customizations. On the other hand, I'm thinking maybe another reason we only see issues when install into existing clusters is because create a brand new cluster and install k8s mostly goes smoothly? |
On this topic, is there a clear list of what components are core, what components are optional and of those components which are managed by Kubeflow and if they aren't what versions of the external components are compatible with which version of Kubeflow? |
/cc @Bobgy |
I am seeing some confusion about why we need to create a new repo, rather than just use this existing one. It's because this new repo will be completely different in both code and layout. It will ONLY HAVE Kubeflow component manifests, which are organised cleanly into folders. Distributions like GCP/AWS would live in their own repos, but the Kubeflow project would only be responsible for the 'Generic distribution'. Given these changes are so significant, if we want to keep current releases ( The main outcomes we can achieve with this refactor:
|
It's all good points @thesuperzapper made. I agree 👍 It's best for the project to focus on KubeFlow rather than half the CNCF Landscape + KubeFlow. Also, my experience on trying to install KubeFlow on GKE was a total disaster as I did not need half of what was in the manifest and the other half, I'm still trying to figure out why do I need this ? As far as I understood, Kubeflow is an architecture that gathers other projects as plugins. It would be nice to allow people to understand which project manifest is tied to which features instead of asking people to install a custom build tool (kfctl) that spits out 40+k lines of YAML, hoping that they will suddenly understand what everything is doing. I love the pick and choose approach but if I don't understand what I'm choosing, is the goal reached ? I would be so happy to have a way to deploy KubeFlow on a kubernetes cluster with Istio already installed because I would understand what am I installing. |
To @veggiemonk and @thesuperzapper comments, when a user installs KF over an existing cluster may have specific requirements like Istio or Dex versions so the need of customize the manifests before time to add/remove/change is there and from the UX experience is not that great because doing that way rise the risk of human errors a lot. I personally like @Jeffwan approach here
But so I think we need to define a list of:
|
To give some context for the approach with There was some friction with the tooling itself and how it used
That change was to paper over some of the difficulties in understanding Between this changes, we lost the ability to configure individual applications and After this, we introduced Istio as part of the stack to move from Now, notice how we started out with something to paper-over complexity by simply using a bash script that calls the individual CLIs (ksonnet, kubectl...) and to replace that with a maintainable Go binary but ended up adding more complexity. This was not intentional but driven by the need of the hour and being put on the spot with difficult choices. Maybe we could've made better choices but here we are. Anywhoo, my point is ANY project OSS or otherwise, accrues debt over time for whatever reason. And I'm seeing a rising amount of rants in comments and issues, these are NOT helpful at all. Be kind. Offer something constructive. This is an open community. So open a PR, propose a design. For example, |
Now to addressing the issue, I see two requests:
I think the work the done with And also, there were will always be people who want a simple installation process. I think in the past we went for a "low bar, high ceiling" approach where (most) people can install with a simple set of commands. And a set of people (power users) who want to customize / modify / extend. And we want to enable all users to be successful with Kubeflow. [high ceiling] In order to work with existing installations of Istio+Dex, we need to make applications more modular. |
The concept of a "distribution" arises because a one size fits all (or even most) approach won't work. One of the problems we encountered early on is that trying to create a configuration that works every where leads to a suboptimal experience. A simple example, not every K8s cluster has a default storage class. So how should we configure Jupyter by default? Should jupyter use a default storage class to provide durable storage for notebooks? Which will lead to a non functional deployment on clusters without a default storage class. Do we use ephemeral storage which leads to data loss if the pod is preempted? Multiple distributions are necessary in order to prevent regression to the lowest common denominator. Distributions allow us to embrace diversity of use cases and oppinions by trying to make it easy for folks to create opinionated distributions of Kubeflow rather than try to achieve consensus. Even for individual applications there is a growing number of ways to configure them; e.g.
Likewise everyone's existing clusters will be configured differently
It would be great if Kubernetes had a mechanism for application dependency management but it doesn't (or at least not one considered to be a standard). Creating such a dependency management system is outside the scope of Kubeflow. Distributions are a way for folks to try to create an oppinionated solution to that problem. As an example, IBM is creating a distribution for Kubeflow based on OLM in part because OLM With respect to upgrades I would suggest looking at kubeflow/kfctl#304 for some helpful context about some of the tech debt @swiftdiaries is referring to that is impeding upgrades. Likewise #1007 provides context about vars and the v3 migration. vars originated as a way to handle necessary customizations for different Kubernetes installs; e.g. the ClusterDomain. Per @swiftdiaries's suggestion above a great way to clean up tech debt would be to figure out if we can start removing the old legacy manifests. Regarding outdated images #1553 this looks to me like a process issue; in particular ensuring application owners are ensuring their manifests are up to data. I think this is something we should try to address as part of the wg-deployments charter (kubeflow/community#402) by clarifying responsibilities and expectations. |
Just a few points:
Finally, do you really think that the current structure of this repo is acceptable? I think starting again is the best option, with clear requirements for basic things like preventing spider webs of Kustomize scripts. |
Repositories and projects need OWNERs. Any new project/repo would require WG sponsorship. The most likely WG would be the deployments WG. So any progress on a new repo is blocked on formation of that WG. Of course anyone is free to fork or create their own repo outside the Kubeflow org. In the interim, we can make incremental progress towards overhauling the structure of this repo through a series of incremental refactorings. @thesuperzapper given you have expressed an interest in notebooks the jupyter manifests seem like an obvious starting point. |
@thesuperzapper thanks for putting this together! I agree that the current manifests repo leaves a lot to be desired. We have managed to make it work by creating an end-to-end pure GitOps process building on top of this repository, but not depending on it, and this is how we are now building both MiniKF and our multi-node, enterprise deployments. But note that this was a significant effort that we had to undertake internally. With the deployment working group just formed, I believe we should use this momentum and tackle this as the first project of wg-deployment. In my experience, some major pain points of the current manifests are:
This will also allow us to clean up a lot of the old code and practices and end up with a much more focused repo.
|
Let fix this before 1.3 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions. |
I would say after #1735 was merged, this issue is mostly resolved. |
@thesuperzapper: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This repository is extremely messy and is a horrible user experience. The mess has caused us real issues, as we botched the 1.1 release due to a lack of visibility of what is even in the manifests (See: #1553).
When paired with the almost complete lack of docs relating to
Kubflow 1.1
this leaves Kubeflow near impossible to install, just take a look at recent-issues (across all the repos), and our slack and you will see what I mean.My proposal is to make a new manifests repository and make some changes as we do, we should:
EDIT: see this comment below for my proposal of a 'Generic' distribution (to replace most of the current ones)
The text was updated successfully, but these errors were encountered: