WG Data proposal #673

tarilabs · 2023-12-14T18:21:10Z

I'm following up on action item: raise WG proposal to Kubeflow per yesterday's Model Registry meeting (recording timestamp).

As discussed in KF community meeting.

Main links:

👉 I'm starting to raise a draft PR in order to "seed/bootstrap" the work in raising the request to form the WG--using a draft PR give us a branch we can collaborate on between stakeholders @andreyvelich @Tomcli @dhirajsb @rimolive

This also give us a medium we can keeps-tab-on so to report back on progress during Tuesdays' community plenary meetings, wdyt?

thesuperzapper · 2023-12-14T20:07:34Z

I am very strongly opposed to using the name WG-Lifecycle, because that implies that the working group is related to the lifecycle of Kubeflow itself.

My proposal for the name is: WG-Data

Where "data" can mean both actual data (spark) and metadata (model registry). We can also split it up in the future, if the members who are maintaining these components diverge.

tarilabs · 2023-12-14T20:22:14Z

My proposal for the name is: WG-Data

very well noted @thesuperzapper , as also marked here:
https://github.com/kubeflow/community/pull/673/files#diff-11b55409b3d27f083915bd4b910672caaf0e9550cf34d77fe76e8b6b9515023dR524

I just wanted to have a branch where to start collecting this kind of feedback in a non-sparse way and also to report back to you and the group on the progress on Tuesday meetings.

wgs.yaml

dhirajsb · 2023-12-14T20:26:18Z

@thesuperzapper how about we make it more explicit WG ML Model Data?

thesuperzapper · 2023-12-14T20:33:08Z

As it currently stands, this WG does not meet the requirement for diverse leadership given all chairs come from one company (IBM - which owns RedHat).

dhirajsb · 2023-12-14T20:36:27Z

@thesuperzapper Andrey is listed as a Chair, he's from Apple

tarilabs · 2023-12-14T20:37:04Z

noticing only now it was not marked as Draft PR despite being my intent:

using a draft PR give us a branch we can collaborate on

my sincerest apologies.

Marked as Draft PR per original message in thead.

rimolive · 2023-12-14T20:37:07Z

@thesuperzapper Is there a minimum number of companies to compose the chair to make the WG eligible?

thesuperzapper · 2023-12-14T20:52:25Z

While there is no specific number requirement, the steering comity must approve the new WG (currently, @jbottum @james-jwu) in line with the community's interests. I would expect at least some concern with having 4 leads from one company and only 1 from another.

For reference, here is the lifecycle and other info about forming a working group:

Also, there are only meant to be 2-3 chairs, some other WGs have more, but in most cases, there are 2 active members and we just need to formally clean up the inactive chairs.

thesuperzapper · 2023-12-14T20:59:06Z

Also, some of the proposed chairs are not even current Kubeflow org members, so are ineligible unless they go through that process first:

rimolive · 2023-12-14T21:03:52Z

Thank you for the references! Those are valid points though, and I'll see how we can work on the eligibility topic as well as your concerns.

tarilabs · 2023-12-14T21:08:56Z

As Ricardo noted, thanks !

Is there guidance for deputies to keep work WG ongoing during leaves, please?
The reason >3 is I was going through this point earlier today and seeing other WGs have >3 I assumed it was for that semantic.

As noted, will work out to account all the feedback received; thank you those are very helpful

andreyvelich · 2023-12-15T15:56:48Z

Thank you for starting this @tarilabs! Let's collaborate together on this PR for the WG Charter and Name.

Please provide your suggestion on how we should name this WG that initially will have Spark Operator and Model Registry component.

A few initial suggestions if WG Lifecycle is too ambitious:

WG Data
WG ML Data
WG ML Lifecycle

I would expect at least some concern with having 4 leads from one company and only 1 from another.

This is valid concern @thesuperzapper. We can add folks from Spark Operator maintainers to this WG
cc @mwielgus @vara-bonthu @yuchaoran2011

andreyvelich · 2023-12-15T15:59:26Z

cc @kubeflow/wg-training-leads
@kubeflow/wg-pipeline-leads
@kubeflow/wg-deployment-leads
@kubeflow/wg-notebooks-leads
@kubeflow/wg-manifests-leads

bigsur0 · 2023-12-15T18:20:37Z

I would request "WG ML Lifecycle" if the purpose of the group is to house things in the MLOps orbit that don't have a more specific working group yet so they can "incubate". Data Preparation, Feature Store, and Model Registry being 3 examples that have been recently discussed that likely aren't big enough yet to have their own working group. I guess one key aspect here is to consider how new efforts can happen without the overhead of setting-up a new working group for each one until it is truly merited and bandwidth is available.

Is there a process that exists for refactoring a topic out of one working group to a new working group?

jbottum · 2023-12-18T23:02:11Z

Kubeflow seems to be entering a new growth phase. The community needs a structure to support add-on components (Spark, Ray, Model Registry, Feature Store, etc). We want to encourage contributors and users to meet, discuss, experiment, decide, store code and produce documentation with a goal that integrations will help both Kubeflow and the add-on projects. We need to minimize overhead. We need to set expectations (of support...to/from Kubeflow and for users) especially if we are experimenting and trying to find market acceptance. Most importantly, we need active user participation, comment and leadership. I want to move this forward...I am a +1 to adding a single umbrella WG for all of these projects to get things moving. @james-jwu would you please provide your thoughts

thesuperzapper · 2023-12-19T04:44:27Z

I think that the name WG Data will happily encompass the various categories proposed:

distributed processing (spark, Ray, etc.)
model registry (unnamed redhat proposal)
feature store (potentially feast)

Also, WG Data follows the convention of being a single word, like all other working group names.

I am still very against WG Lifecycle, at best it's like calling it WG Other because the whole point of Kubeflow is to map across the MLOps lifecycle, so it's just confusing.

Separately to the discussion around names, I think we should confirm that the maintainers of these various components are actually overlapping, otherwise it will make it difficult for this "mega working group" to function.

vara-bonthu · 2023-12-19T13:11:40Z

+1 to @thesuperzapper

I would suggest voting for WG Data, as it seems most appropriate for the Spark Operator. This is because it is primarily used for data processing, both batch and streaming, as well as some ML processing.

tarilabs · 2023-12-19T14:06:45Z

New commit ae188fe incorporates some feedback received around:

put even more prominent name is provisional. Noted more recent feedback here and here seems will eventually converge into WG Data but while still draft is a chance to account for all proposals like here
reflected name provisional in PR title
reworked designated chairs

will keep posted during KF Community meeting on any further updates.

thesuperzapper · 2023-12-19T17:45:50Z

Just so we are clear, I think WG Data should be the name, not WG ML Data as the PR currently stands.

wg-data/README.md

juliusvonkohout · 2024-08-13T15:05:52Z

Would this working group be relevant for the minio replacement (seaweedfs) as well?

I am currently working on a PoC in Kubeflow/manifests.

address: kubeflow#673 (comment) Signed-off-by: tarilabs <[email protected]>

as suggested. Signed-off-by: tarilabs <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]>

tarilabs · 2024-08-27T09:21:32Z

I've added all comments pertaining to Feast in a single commit with fa3c318, so to more easily manage that addition to this wg charter if required or by feedback from SC.

tarilabs · 2024-08-27T12:46:00Z

Would this working group be relevant for the minio replacement (seaweedfs) as well?

not entirely sure, that to me is more a "storage"-related concern, while "data"-related concern expressed here are more orthogonal to the actual medium.

I am currently working on a PoC in Kubeflow/manifests.

I'm very happy however to engage in discussions, since "storage" is also a dimension we're exploring for Model Registry (bringing in OCI as first class, but potentially others with an abstraction layer). Let me know your thoughts!

andreyvelich · 2024-08-27T15:29:22Z

Thank you for addressing the feedback @tarilabs!

Given that we still have discussion around WG governance and what projects WGs should maintain: #673 (comment), should we include Feast addition as a separate PR after followup discussion ?

From my point of view, initially we should just establish the Data WG with 2 Kubeflow components: Spark Operator and Model Registry, and after that we can update charter to include Feast and other projects that we want to maintain under this WG.

Any thoughts @franciscojavierarceo @kubeflow/kubeflow-steering-committee @tarilabs ?

franciscojavierarceo · 2024-08-27T15:35:02Z

I would love for Feast to be included as I think the Data WG is a great opportunity to validate Feast's relevance and drive some urgency to closing the discussion on adding new projects, but I'll respect the outcome either way, of course.

See PR here: #741

CC @jbottum

andreyvelich · 2024-08-27T15:54:15Z

I would love for Feast to be included as I think the Data WG is a great opportunity to validate Feast's relevance and drive some urgency to closing the discussion on adding new projects

I agree with you @franciscojavierarceo, but should we include Feast in the Data WG once we make Feast as part of Kubeflow core components ?

jbottum · 2024-08-27T16:06:17Z

Per my comment in the Community meeting, I support Feast as part of the WG Data and as a core KF component. I am glad to pursue that path or another, if that cannot be accomplished (as I believe a defined relationship would help both communities).

franciscojavierarceo · 2024-08-27T17:24:14Z

@andreyvelich I am okay including Feast before making it a core component. :)

juliusvonkohout · 2024-08-28T08:24:59Z

I'm very happy however to engage in discussions, since "storage" is also a dimension we're exploring for Model Registry (bringing in OCI as first class, but potentially others with an abstraction layer). Let me know your thoughts!

Then kubeflow/manifests#2826 and kubeflow/pipelines#10998 might be interesting for you.

andreyvelich · 2024-08-28T15:18:40Z

Would this working group be relevant for the minio replacement (seaweedfs) as well?

I am currently working on a PoC in Kubeflow/manifests.

@juliusvonkohout This issue is related to Kubeflow Pipelines (e.g. Pipelines WG), isn't ?

juliusvonkohout · 2024-08-28T15:23:30Z

Would this working group be relevant for the minio replacement (seaweedfs) as well?

I am currently working on a PoC in Kubeflow/manifests.

@juliusvonkohout This issue is related to Kubeflow Pipelines (e.g. Pipelines WG), isn't ?

Anyone who needs S3 storage in Kubeflow, but especially pipelines.

rimolive · 2024-10-09T13:11:48Z

Bumping this PR. What is missing to get this merged?

andreyvelich · 2024-10-09T13:23:09Z

Bumping this PR. What is missing to get this merged?

I think, we need to make a decision with Feast.
@kubeflow/kubeflow-steering-committee What are your thoughts on this ?

varodrig · 2025-01-10T23:46:33Z

Bumping this PR. What is missing to get this merged?

I think, we need to make a decision with Feast. @kubeflow/kubeflow-steering-committee What are your thoughts on this ?

I'm following up on this @andreyvelich to follow up with the rest of the KSC.

andreyvelich · 2025-01-11T00:47:44Z

I would love to get this finally merged.

@franciscojavierarceo What do you think about keeping the existing charter of Data WG with Model Registry and Spark Operator projects, given that they are already part of Kubeflow ecosystem ?
And in the future, we can expand the scope of this WG to include FEAST as well ?

franciscojavierarceo · 2025-01-11T01:31:02Z

My preference is to include Feast in this working group.

Feast now supports RAG (alpha) and I believe that would help boost Kubeflow's place in the GenAI conversation.

Updating the PR on adding new projects to KF (https://github.com/kubeflow/community/pull/741/files) is on my to do list I've just been heads down working on the Milvus integration.

Of course, I don't want to block anything but, based on my knowledge of the field and my conversations with startups, Kubeflow not having a solution for RAG (and for Data Scientists to feel empowered with RAG) is a reasonably large gap as most AI Applications rely on some form of RAG.

franciscojavierarceo · 2025-01-11T01:32:55Z

Given the KSC is being decided this weekend, can we wait until next week?

andreyvelich · 2025-01-12T23:12:10Z

My preference is to include Feast in this working group.

Feast now supports RAG (alpha) and I believe that would help boost Kubeflow's place in the GenAI conversation.

Updating the PR on adding new projects to KF (https://github.com/kubeflow/community/pull/741/files) is on my to do list I've just been heads down working on the Milvus integration.

Of course, I don't want to block anything but, based on my knowledge of the field and my conversations with startups, Kubeflow not having a solution for RAG (and for Data Scientists to feel empowered with RAG) is a reasonably large gap as most AI Applications rely on some form of RAG.

I totally agree with you @franciscojavierarceo, but it will take time for Kubeflow Community to migrate Feast into Kubeflow Core Components, isn't ? No one stopping us to update Data WG charter once FEAST will be officially part of Kubeflow Core Components.

Given that this Working Group will be already established, we don't need to create a new Working Group for that.

franciscojavierarceo · 2025-01-13T02:01:53Z

My preference is to include Feast in this working group.
Feast now supports RAG (alpha) and I believe that would help boost Kubeflow's place in the GenAI conversation.
Updating the PR on adding new projects to KF (https://github.com/kubeflow/community/pull/741/files) is on my to do list I've just been heads down working on the Milvus integration.
Of course, I don't want to block anything but, based on my knowledge of the field and my conversations with startups, Kubeflow not having a solution for RAG (and for Data Scientists to feel empowered with RAG) is a reasonably large gap as most AI Applications rely on some form of RAG.

I totally agree with you @franciscojavierarceo, but it will take time for Kubeflow Community to migrate Feast into Kubeflow Core Components, isn't ? No one stopping us to update Data WG charter once FEAST will be officially part of Kubeflow Core Components.

Given that this Working Group will be already established, we don't need to create a new Working Group for that.

Yeah, that sounds good. 👍

Can Feast be included in the WG before it is officially part of the Core Components (since it's an add-on)?

andreyvelich · 2025-01-13T12:39:59Z

Can Feast be included in the WG before it is officially part of the Core Components (since it's an add-on)?

The main concern that I see is that Working Groups don't maintain add-on components since they live outside of Kubeflow organization. Given the limited number of active contributors, working groups were created to maintain components living under Kubeflow GitHub organization. KServe is an exception given that this project was part of Kubeflow GitHub org before.

As I mention before, in my personal opinion, is that we remove the concept of add-ons to not confuse our users.
And It doesn't mean that add-on project can't be migrated to the Kubeflow GitHub organizations if they want to.

We even don't have any clear explanation what is add-ons in the WG governance docs: https://github.com/kubeflow/community/tree/master/wgs

The only reference that I found is this proposal from @kubeflow/wg-manifests-leads to introduce contrib components: https://github.com/kubeflow/manifests/blob/master/proposals/20220926-contrib-component-guidelines.md

thesuperzapper · 2025-01-13T16:42:33Z

It's important that Kubeflow Working Groups only try and manage components that we own as an organization. It would be inappropriate for us to claim ownership of external projects.

Previously, we had the concept of a SIG (special interest group) for anything which was not directly maintaining one of the components of Kubeflow but which still had a community of users in Kubeflow.

PS: we should still consider splitting WG Data (transformation - Spark) and WG Metadata (Model Registry). However, I am not sure which one Feast will fit into once it joins Kubeflow officially.

Also, the concept of an "external add-ons" allows us to create an "ecosystem" of tools, beyond our own, for things that we don't have a competitor to, it's important we don't get rid of that concept.

We should even expand it to include model frameworks (PyTorch, TensorFlow), and other things that we already integrate with. This lets us create formal pages on the website for them, and help new users better understand our ecosystem.

franciscojavierarceo · 2025-01-13T16:50:00Z

Also, the concept of an "external add-ons" allows us to create an "ecosystem" of tools, beyond our own, for things that we don't have a competitor to, it's important we don't get rid of that concept.

We should even expand it to include model frameworks (PyTorch, TensorFlow), and other things that we already integrate with. This lets us create formal pages on the website for them, and help new users better understand our ecosystem.

💯

andreyvelich · 2025-01-13T17:30:35Z

Also, the concept of an "external add-ons" allows us to create an "ecosystem" of tools, beyond our own, for things that we don't have a competitor to, it's important we don't get rid of that concept.
We should even expand it to include model frameworks (PyTorch, TensorFlow), and other things that we already integrate with. This lets us create formal pages on the website for them, and help new users better understand our ecosystem.

@thesuperzapper Shouldn't this be a responsibility of distributions to allow users to deploy these additional tools on top of Kubeflow Core Components ? If Kubeflow Core Component needs to have integration with 3rd party tools, they can always have dedicated page in their documentation.

google-oss-prow bot requested review from james-jwu and theadactyl December 14, 2023 18:21

google-oss-prow bot added the size/M label Dec 14, 2023

tarilabs commented Dec 14, 2023

View reviewed changes

wgs.yaml Outdated Show resolved Hide resolved

tarilabs marked this pull request as draft December 14, 2023 20:35

google-oss-prow bot added the do-not-merge/work-in-progress label Dec 14, 2023

tarilabs changed the title ~~WG Lifecycle proposal~~ WG Data(name provisional) proposal Dec 19, 2023

tarilabs mentioned this pull request Jan 3, 2024

Model Registry proposal (ref KF community meeting 20240102) #682

Open

rareddy mentioned this pull request Jan 5, 2024

Action items for adoption of Model Registry in Kubeflow #685

Open

9 tasks

google-oss-prow bot added size/L and removed size/M labels Jan 23, 2024

tarilabs changed the title ~~WG Data(name provisional) proposal~~ WG Data proposal Aug 13, 2024

terrytangyuan reviewed Aug 13, 2024

View reviewed changes

wg-data/README.md Outdated Show resolved Hide resolved

implement review feedback

f8f544a

address: kubeflow#673 (comment) Signed-off-by: tarilabs <[email protected]>

tarilabs requested a review from terrytangyuan August 22, 2024 07:52

add Feast-related review comments

fa3c318

as suggested. Signed-off-by: tarilabs <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]>

terrytangyuan mentioned this pull request Oct 10, 2024

Update meetings info in the community page kubeflow/website#3902

Merged

terrytangyuan mentioned this pull request Nov 6, 2024

add Model Registry line in KF 1.9 release kubeflow/website#3918

Open

WG Data proposal #673

Are you sure you want to change the base?

WG Data proposal #673

Conversation

tarilabs commented Dec 14, 2023

thesuperzapper commented Dec 14, 2023

tarilabs commented Dec 14, 2023

dhirajsb commented Dec 14, 2023 • edited Loading

thesuperzapper commented Dec 14, 2023

dhirajsb commented Dec 14, 2023

tarilabs commented Dec 14, 2023

rimolive commented Dec 14, 2023

thesuperzapper commented Dec 14, 2023

thesuperzapper commented Dec 14, 2023

rimolive commented Dec 14, 2023

tarilabs commented Dec 14, 2023

andreyvelich commented Dec 15, 2023

andreyvelich commented Dec 15, 2023

bigsur0 commented Dec 15, 2023

jbottum commented Dec 18, 2023

thesuperzapper commented Dec 19, 2023

vara-bonthu commented Dec 19, 2023

tarilabs commented Dec 19, 2023 • edited Loading

thesuperzapper commented Dec 19, 2023

juliusvonkohout commented Aug 13, 2024 • edited Loading

tarilabs commented Aug 27, 2024

tarilabs commented Aug 27, 2024

andreyvelich commented Aug 27, 2024

franciscojavierarceo commented Aug 27, 2024 • edited Loading

andreyvelich commented Aug 27, 2024

jbottum commented Aug 27, 2024

franciscojavierarceo commented Aug 27, 2024

juliusvonkohout commented Aug 28, 2024

andreyvelich commented Aug 28, 2024

juliusvonkohout commented Aug 28, 2024

rimolive commented Oct 9, 2024

andreyvelich commented Oct 9, 2024

varodrig commented Jan 10, 2025

andreyvelich commented Jan 11, 2025

franciscojavierarceo commented Jan 11, 2025

franciscojavierarceo commented Jan 11, 2025

andreyvelich commented Jan 12, 2025 • edited Loading

franciscojavierarceo commented Jan 13, 2025 • edited Loading

andreyvelich commented Jan 13, 2025

thesuperzapper commented Jan 13, 2025

franciscojavierarceo commented Jan 13, 2025

andreyvelich commented Jan 13, 2025

dhirajsb commented Dec 14, 2023 •

edited

Loading

tarilabs commented Dec 19, 2023 •

edited

Loading

juliusvonkohout commented Aug 13, 2024 •

edited

Loading

franciscojavierarceo commented Aug 27, 2024 •

edited

Loading

andreyvelich commented Jan 12, 2025 •

edited

Loading

franciscojavierarceo commented Jan 13, 2025 •

edited

Loading