-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conduct market research on versioning #3933
Comments
We should also review #1871 |
Notice that the goal of this is not to assess current Kedro versioning capabilities, but rather to provide an outward looking perspective at what other systems are doing. That ideally should inform next steps in https://github.com/kedro-org/kedro/milestone/63 |
Data versioning (Miro Board)Why "data versioning" is important?Data versioning is the practice of tracking and managing changes to datasets over time. This includes capturing versions of data as it evolves, enabling reproducibility, rollback capabilities, and auditability. Data versioning is crucial for maintaining data integrity and ensuring that data pipelines and machine learning models are reproducible and reliable. Feature Comparison Matrix1. Delta LakeClick here to see Data Lake's versioning workflow Delta Lake, by Databricks, is an open-source storage that enables building a Lakehouse architecture on top of data lakes. It is designed to provide ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and unified streaming and batch data processing. Delta Lake is built on top of Apache Spark and enhances the capabilities of data lakes by addressing common challenges like data reliability and consistency. Strengths
Weaknesses
2. DVCClick here to see DVC's versioning workflow DVC, or Data Version Control, is an open-source tool specifically designed for data science and machine learning projects. It combines the version control power of Git with functionalities tailored for large datasets, allowing users to track data changes, collaborate efficiently, and ensure project reproducibility by referencing specific data versions. Imagine DVC as a special organizer for your data science projects. Just like how Git keeps track of changes you make to your code, DVC keeps track of changes you make to your data. DVC is your “Git for data”! Strengths
Weaknesses
3. Apache HudiClick here to see Hudi's versioning workflow Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data management framework that helps manage large datasets stored in data lakes. It brings core warehouse and database functionality directly to a data lake. Hudi is designed to provide efficient data ingestion, storage, and query capabilities with strong support for incremental data processing. It enables data engineers to build near real-time data pipelines with support for transactions, indexing, and upserts (updates and inserts). Strengths
Weaknesses
4. Apache IcebergClick here to see Iceberg's versioning workflow Apache Iceberg is an open-source table format for managing large-scale datasets in data lakes, designed for petabyte-scale data. It ensures data consistency, integrity, and performance, and works efficiently with big data processing engines like Apache Spark, Apache Flink, and Apache Hive. Iceberg combines the reliability and simplicity of SQL tables with high performance, enabling multiple engines to safely work with the same tables simultaneously. Strengths
Weaknesses
5. PachydermClick here to see Pachyderm's versioning workflow Pachyderm is an open-source data engineering platform that provides data versioning, pipeline management, and reproducibility for large-scale data processing. It combines data lineage and version control with the ability to manage complex data pipelines, making it an ideal tool for data science and machine learning workflows. Strengths
Weaknesses
|
Code versioning (Miro Board)Why "code versioning" is important?Code versioning is the practice of managing changes to source code over time. It involves tracking and controlling modifications to the codebase to ensure that all changes are recorded, identifiable, and reversible. Code versioning is a fundamental practice in software development and is typically facilitated by version control systems (VCS). Key Aspects of Code Versioning
Feature Comparison Matrix |
Model versioning (Miro Board)Why "model versioning" is important?Model versioning refers to the practice of managing different versions of machine learning models to track changes, ensure reproducibility, and manage deployments. It involves maintaining records of model parameters, architecture, training data, and performance metrics for each version of the model. This practice is crucial for model experimentation, collaboration, auditability, and continuous integration/continuous deployment (CI/CD) processes in machine learning workflows. Feature Comparison Matrix1. MLflowClick here to see MLflow's versioning workflow 2. DVCClick here to see DVC's versioning workflow 3. WeightS & BiasesClick here to see W&B's versioning workflow 4. TensorBoardClick here to see TensorBoard's versioning workflow 5. Neptune.ai |
Thanks for the summary, this is a great start. I found some of the use case a bit weird to me, for example "real time tracking is essential" is highlight as a strength of Tensorboard, I believe W&B & Mlflow both support this for a long time. I'd also classify TensorBoard as an experiment tracking tool rather than artifact versioning tool. These are not too important for the purpose so I'd focus my questions on versioning:
|
Thanks @noklam! |
The rationale that tools like MLflow and W&B are "Limited" in data versioning is based on their primary design goals and features compared to tools, like DVC, specifically built for comprehensive data versioning:
|
Happy to look into Dagster and other DAG-oriented tools! |
Moved this to https://github.com/kedro-org/kedro/wiki/Market-research-on-versioning-tools closing as done! |
In https://github.com/kedro-org/kedro/milestone/63 there are several linked issues related to Kedro's Dataset Versioning.
Before we start working on it, we'd want to do a bit of market research on other tools and formats that support versioning. At a minimum, it should include
The objectives are
The end goal is to inform decision making around Kedro Dataset Versioning.
The text was updated successfully, but these errors were encountered: