Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WG Data proposal #673

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
7 changes: 7 additions & 0 deletions OWNERS_ALIASES
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,13 @@ aliases:
- gaocegege
- johnugeorge
- tenzen-y
wg-data-leads:
- ChenYi015
- Tomcli
- andreyvelich
- franciscojavierarceo
- rareddy
- tarilabs
wg-deployment-leads:
- PatrickXYS
- animeshsingh
Expand Down
33 changes: 33 additions & 0 deletions wg-data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<!---
This is an autogenerated file!

Please do not edit this file directly, but instead make changes to the
sigs.yaml file in the project root.

To understand how this file is generated, see https://github.com/kubeflow/community/blob/master/generator/README.md
--->
# Data Working Group

The WG "Data" is focused on enhancing the support for data/metadata-related tasks within Kubeflow, with a specific focus on the Spark operato and Model Registry. The group aims to simplify and improve data processing between various stages of ML lifecycle. For example, from Data Preparation to model training and fine-tuning. The group also aims to facilitate the ML model's metadata management, while ensuring seamless integration with other Kubeflow components. The goal of Spark on Kubernetes Operator is to simplify the capability of running Apache Spark on Kubernetes. It automates deployment and simplifies lifecycle management of Spark Jobs on Kubernetes. The goal of Model Registry is gather, analyze, and develop model registry requirements of Kubeflow community users.
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
tarilabs marked this conversation as resolved.
Show resolved Hide resolved

The [charter](charter.md) defines the scope and governance of the Data Working Group.

## Meetings
* KF Model Registry community meeting (US/EMEA): [Mondays at 7:00PM-8:00PM Europe/Madrid]() (biweely - every other Monday of the month). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=7:00PM-8:00PM&tz=Europe%2FMadrid).
* [Meeting notes and Agenda](https://docs.google.com/document/d/1DmMhcae081SItH19gSqBpFtPfbkr9dFhSMCgs-JKzNo/edit?usp=sharing).

tarilabs marked this conversation as resolved.
Show resolved Hide resolved
## Organizers

* Tommy Li (**[@Tomcli](https://github.com/Tomcli)**), IBM
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
* Andrey Velichkevich (**[@andreyvelich](https://github.com/andreyvelich)**), Apple
* Ramesh Reddy (**[@rareddy](https://github.com/rareddy)**), Red Hat

## Contact
- Slack: [#https://cloud-native.slack.com/archives/C073W572LA2](https://kubeflow.slack.com/messages/https://cloud-native.slack.com/archives/C073W572LA2)
- [Mailing list](https://groups.google.com/forum/#!forum/kubeflow-discuss)
- [Open Community Issues/PRs](https://github.com/kubeflow/community/labels/wg%2Farea/wg-data)
- GitHub Teams:
- [@kubeflow/wg-data-leads](https://github.com/orgs/kubeflow/teams/wg-data-leads) - Team of Data Working Group leads
<!-- BEGIN CUSTOM CONTENT -->

<!-- END CUSTOM CONTENT -->
70 changes: 70 additions & 0 deletions wg-data/charter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# WG Data Charter

This charter adheres to the conventions, roles, and organisation management outlined in [wg-governance] for the Working Group "Data".

## Scope

The WG "Data" is focused on enhancing the support for data/metadata-related tasks within Kubeflow, with a specific focus on the [Spark operator](https://github.com/kubeflow/community/pull/672) and [Model Registry](https://github.com/kubeflow/kubeflow/issues/7396).
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
The group aims to simplify and improve data processing between various stages of ML lifecycle. For example, from Data Preparation to model training and fine-tuning.
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
The group also aims to facilitate the ML model's metadata management, while ensuring seamless integration with other Kubeflow components.

An additional goal of the group is to offer a common ground for data/metadata-related topics in the MLOps orbit that didn't have a more specific working group yet, so they can "incubate as one", coherent effort.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say data/metadata what exactly do we mean here ? What would be the differences from the ML perspective ?

Copy link
Member Author

@tarilabs tarilabs Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the intent was to make a difference between

  • Training data(-set)
    used for instance in a Spark job/task
    for example mnist dataset, iris dataset used to ML train a neural network, or using a Spark job to produce said data set from the enterprise data sources
  • Metadata
    as managed/indexed by the Model Registry
    for example the author, model format, accuracy metrics, etc as resulting or intentional for a ML train, and that the engineer wants to index in Model Registry while pointing at the resulting ML model

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is good to have a short description in the charter. E.g. add examples for ML data: model, dataset, and ML Metadata: artifact location, model training metrics.

tarilabs marked this conversation as resolved.
Show resolved Hide resolved

For example: Data Preparation, Feature Store, and Model Registry have been recently discussed in the Kubeflow community while not mature enough yet to have their own working group, they can be nurtured together as part of this WG.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data Preparation not 100% if this might be confused with Notebooks? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rimolive suggested to s/Data Preparation/Big Data Processing/ or something like that

tarilabs marked this conversation as resolved.
Show resolved Hide resolved

### In scope

#### Code, Binaries, and Other relevant assets

tarilabs marked this conversation as resolved.
Show resolved Hide resolved
- Onboarding and maintenance of the Spark operator for scalable and distributed data processing.
[See also](https://github.com/kubeflow/spark-operator)
- Continued development of the Model Registry to manage and version machine learning models efficiently.
[See also](https://github.com/kubeflow/model-registry)
- Model Registry REST server
- Model Registry Python client
- deployment Manifests
- BFF for Model Registry
- UI front-end for Model Registry
- SDKs and REST APIs for interacting with Kubeflow APIs related to data processing and ML models metadata management.
- CI/CD pipelines for Kubeflow subproject repositories in the scope of this WG.
- Documentation, in the forms of Kubeflow website sections and as necessary in each repository.
tarilabs marked this conversation as resolved.
Show resolved Hide resolved

#### Cross-cutting and Externally Facing Processes

- Ensuring seamless integration of these WG subprojects with the rest of the Kubeflow platform. For example:
- Coordinating with WG Pipelines for integrations of Model Registry with KFP.
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
- Coordinating with WG Serving for integrations of Model Registry with KServe and ModelMesh.
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
- Coordinating with release teams to ensure that the capabilities and subprojects in scope of this WG can be released properly.
- Offer mentorship to support contributors working on data-centric projects that want to integrate with Kubeflow.

### Out of scope

- APIs and components related to:
- ML exploration and experimentation (covered in Notebooks/Pipelines),
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
- ML training (covered in Training),
- serving ML models for inference (covered in Serving)
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
- Anything else not explicitly outlined in the scope of this WG.

## Roles and Organization Management

This WG adheres to the Roles and Organization Management outlined in [wg-governance] and opts-in to updates and modifications to [wg-governance].

### Additional responsibilities of Chairs

- Coordinating and facilitating discussions on Data-related topics in scope of the WG, within the WG itself and the Kubeflow community.
- Ensuring alignment with overall Kubeflow goals and objectives in the context of data processing and ML model metadata's management.

### Additional responsibilities of Tech Leads

- Providing technical guidance and mentorship to contributors working on Spark operator, Model Registry, and the projects in scope of this WG.
- Overseeing the technical direction of the subprojects and ensuring consistency with Kubeflow's vision for data processing and metadata management.

### Deviations from [wg-governance]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


This WG follows the outlined roles and governance in [wg-governance].

### Subproject Creation

WG Technical Leads

[wg-governance]: ../wgs/wg-governance.md
1 change: 1 addition & 0 deletions wg-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ When the need arises, a [new WG can be created](wgs/wg-lifecycle.md)
| Name | Label | Chairs | Contact | Meetings |
|------|-------|--------|---------|----------|
|[AutoML](wg-automl/README.md)|area/wg-automl|* [Andrey Velichkevich](https://github.com/andreyvelich), Apple<br>* [Ce Gao](https://github.com/gaocegege), Caicloud<br>* [Johnu George](https://github.com/johnugeorge), Nutanix<br>|* [Slack](https://kubeflow.slack.com/messages/wg-automl)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubeflow-discuss)|* Kubeflow AutoML Working Group Meeting (Asia & Europe friendly): [Wednesdays at 11:00am UTC (Coordinated Universal Time) (every 4 weeks on Wednesday from the 10th of March 2021)](https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)<br>* Kubeflow AutoML Working Group Meeting (US friendly): [Wednesdays at 5:00pm UTC (Coordinated Universal Time) (every 4 weeks on Wednesday from the 24th of March 2021)](https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)<br>
|[Data](wg-data/README.md)|area/wg-data|* [Tommy Li](https://github.com/Tomcli), IBM<br>* [Andrey Velichkevich](https://github.com/andreyvelich), Apple<br>* [Ramesh Reddy](https://github.com/rareddy), Red Hat<br>|* [Slack](https://kubeflow.slack.com/messages/https://cloud-native.slack.com/archives/C073W572LA2)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubeflow-discuss)|* KF Model Registry community meeting (US/EMEA): [Mondays at 7:00PM-8:00PM Europe/Madrid (biweely - every other Monday of the month)]()<br>
|[Deployment](wg-deployment/README.md)|area/wg-deployment|* [Yao Xiao](https://github.com/PatrickXYS), AWS<br>* [Animesh Singh](https://github.com/animeshsingh), IBM<br>* [Igor Mameshin](https://github.com/mameshini), Agile Stacks<br>* [Vaclav Pavlin](https://github.com/vpavlin), Red Hat<br>* [Yannis Zarkadas](https://github.com/yanniszark), Arrikto<br>|* [Slack](https://kubeflow.slack.com/messages/wg-deployment)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubeflow-discuss)|* Regular WG Meeting (Pacific PM): [Wednesdays at 17:30 PT (Pacific Time) (biweekly - every other Wednesday)]()<br>
|[Manifests](wg-manifests/README.md)|area/wg-manifests|* [Julius von Kohout](https://github.com/juliusvonkohout), DHL<br>* [Kimonas Sotirchos](https://github.com/kimwnasptd), Canonical<br>|* [Slack](https://kubeflow.slack.com/messages/wg-manifests)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubeflow-discuss)|* Regular WG Meeting (Pacific AM): [Thursdays at 08:00 PT (Pacific Time) (biweekly - every other Thursday)]()<br>
|[Notebooks](wg-notebooks/README.md)|area/wg-notebooks|* [Stefano Fioravanzo](https://github.com/StefanoFioravanzo), Arrikto<br>* [Ilias Katsakioris](https://github.com/elikatsis), Arrikto<br>* [Kimonas Sotirchos](https://github.com/kimwnasptd), Canonical<br>* [Mathew Wicks](https://github.com/thesuperzapper)<br>* [Yannis Zarkadas](https://github.com/yanniszark), Arrikto<br>|* [Slack](https://kubeflow.slack.com/messages/wg-notebooks)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubeflow-discuss)|* Regular Notebooks Meeting (Australia & Europe friendly): [Thursdays at 11:00 pm PT (Pacific Time) (weekly)]()<br>
Expand Down
60 changes: 60 additions & 0 deletions wgs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,66 @@ workinggroups:
- name: katib
owners:
- https://raw.githubusercontent.com/kubeflow/katib/master/OWNERS
- dir: wg-data
name: Data
mission_statement: >
The WG "Data" is focused on enhancing the support for data/metadata-related tasks
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
within Kubeflow, with a specific focus on the Spark operato and Model Registry.
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
The group aims to simplify and improve data processing between various stages
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
of ML lifecycle. For example, from Data Preparation to model training and fine-tuning.
The group also aims to facilitate the ML model's metadata management, while ensuring
seamless integration with other Kubeflow components. The goal of Spark on Kubernetes
Operator is to simplify the capability of running Apache Spark on Kubernetes.
It automates deployment and simplifies lifecycle management of Spark Jobs on Kubernetes.
The goal of Model Registry is gather, analyze, and develop model registry requirements
of Kubeflow community users.
tarilabs marked this conversation as resolved.
Show resolved Hide resolved

charter_link: charter.md
label: area/wg-data
leadership:
chairs:
- github: Tomcli
name: Tommy Li
company: IBM
- github: andreyvelich
name: Andrey Velichkevich
company: Apple
- github: rareddy
name: Ramesh Reddy
company: Red Hat
tech_leads:
- github: ChenYi015
name: Yi Chen
company: Alibaba Cloud
- github: andreyvelich
name: Andrey Velichkevich
company: Apple
- github: franciscojavierarceo
name: Francisco Javier Arceo
company: Red Hat
- github: tarilabs
name: Matteo Mortari
company: Red Hat
meetings:
- description: KF Model Registry community meeting (US/EMEA)
day: Monday
time: 7:00PM-8:00PM
tz: Europe/Madrid
frequency: biweely - every other Monday of the month
archive_url: https://docs.google.com/document/d/1DmMhcae081SItH19gSqBpFtPfbkr9dFhSMCgs-JKzNo/edit?usp=sharing
tarilabs marked this conversation as resolved.
Show resolved Hide resolved
contact:
slack: https://cloud-native.slack.com/archives/C073W572LA2
mailing_list: https://groups.google.com/forum/#!forum/kubeflow-discuss
teams:
- name: wg-data-leads
description: Team of Data Working Group leads
subprojects:
- name: model-registry
owners:
- https://raw.githubusercontent.com/kubeflow/model-registry/main/OWNERS
- name: spark-operator
Copy link

@franciscojavierarceo franciscojavierarceo Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: spark-operator
- name: feast
owners:
- https://raw.githubusercontent.com/feast-dev/feast/master/OWNERS
- name: spark-operator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I should not add subprojects belonging outside of github.com/kubeflow here, what is the @kubeflow/kubeflow-steering-committee view on this?

owners:
- https://raw.githubusercontent.com/kubeflow/spark-operator/blob/master/OWNERS
- dir: wg-deployment
name: Deployment
mission_statement: >
Expand Down