Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conduct market research on versioning #3933

Closed
astrojuanlu opened this issue Jun 6, 2024 · 10 comments
Closed

Conduct market research on versioning #3933

astrojuanlu opened this issue Jun 6, 2024 · 10 comments

Comments

@astrojuanlu
Copy link
Member

In https://github.com/kedro-org/kedro/milestone/63 there are several linked issues related to Kedro's Dataset Versioning.

Before we start working on it, we'd want to do a bit of market research on other tools and formats that support versioning. At a minimum, it should include

The objectives are

  • Assess the different ways these tools fulfill the goal of versioning,
  • Whether they treat different artifacts (data, models, generic outputs like images) differently (similar to how Kubeflow separates artifact types),
  • Clarify overlaps between the tools.

The end goal is to inform decision making around Kedro Dataset Versioning.

@noklam
Copy link
Contributor

noklam commented Jun 10, 2024

We should also review #1871

@astrojuanlu
Copy link
Member Author

Notice that the goal of this is not to assess current Kedro versioning capabilities, but rather to provide an outward looking perspective at what other systems are doing. That ideally should inform next steps in https://github.com/kedro-org/kedro/milestone/63

@astrojuanlu astrojuanlu moved this from To Do to In Progress in Kedro Framework Jun 10, 2024
@merelcht merelcht moved this from In Progress to In Review in Kedro Framework Jul 8, 2024
@iamelijahko
Copy link

iamelijahko commented Jul 9, 2024

Data versioning (Miro Board)

Why "data versioning" is important?

Data versioning is the practice of tracking and managing changes to datasets over time. This includes capturing versions of data as it evolves, enabling reproducibility, rollback capabilities, and auditability. Data versioning is crucial for maintaining data integrity and ensuring that data pipelines and machine learning models are reproducible and reliable.

Feature Comparison Matrix

image

1. Delta Lake

Click here to see Data Lake's versioning workflow

Delta Lake, by Databricks, is an open-source storage that enables building a Lakehouse architecture on top of data lakes. It is designed to provide ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and unified streaming and batch data processing. Delta Lake is built on top of Apache Spark and enhances the capabilities of data lakes by addressing common challenges like data reliability and consistency.

Strengths

  1. ACID Transactions: Delta Lake provides strong consistency guarantees through ACID transactions, ensuring data integrity and reliability.
  2. Unified Batch and Streaming Processing: Delta Lake supports both batch and streaming data processing in a unified manner.
  3. Time Travel: Delta Lake's time travel feature allows users to query historical versions of data.
  4. Schema Enforcement and Evolution: Delta Lake enforces schemas at write time and supports schema evolution, allowing changes to the schema without breaking existing queries.
  5. Scalability and Performance: Delta Lake optimizes storage and querying through techniques like data compaction and Z-Ordering.
  6. Integration with Spark: Built on top of Apache Spark, Delta Lake integrates seamlessly with the Spark ecosystem, enabling powerful data processing capabilities.
  7. Rich Ecosystem and Enterprise Support backed by Databricks

Weaknesses

  1. Limited Direct Support for Unstructured Data: Delta Lake is primarily designed for structured and semi-structured data.
  2. Complexity in Setup and Management: Setting up and managing Delta Lake can be complex, particularly for teams not familiar with Spark.
  3. Tight Coupling with Apache Spark: Delta Lake is heavily dependent on Apache Spark for its operations.

2. DVC

Click here to see DVC's versioning workflow

DVC, or Data Version Control, is an open-source tool specifically designed for data science and machine learning projects. It combines the version control power of Git with functionalities tailored for large datasets, allowing users to track data changes, collaborate efficiently, and ensure project reproducibility by referencing specific data versions. Imagine DVC as a special organizer for your data science projects. Just like how Git keeps track of changes you make to your code, DVC keeps track of changes you make to your data. DVC is your “Git for data”!

Strengths

  1. Integration with Git: DVC seamlessly integrates with Git, leveraging familiar version control workflows for managing datasets and models. This integration makes it easy for teams already using Git to adopt DVC without significant changes to their workflow.
  2. Efficient Large File Management: DVC efficiently handles large files by storing them in remote storage backends and only keeping metadata in the Git repository. This avoids bloating the Git repository and ensures efficient data management.
  3. Reproducibility: DVC's pipeline management and experiment tracking features ensure that data workflows are reproducible. Users can recreate specific experiment runs by tracking versions of data, models, and code.
  4. Flexible Remote Storage: DVC supports various remote storage options, including AWS S3, Google Cloud Storage, Azure Blob Storage, and more. This flexibility allows users to choose storage solutions that best fit their needs.
  5. Experiment Management: DVC's experiment management capabilities, including checkpointing and comparing experiment runs, provide a robust framework for tracking and optimizing machine learning experiments.
  6. Open Source and Community Support: DVC is open source, with an active community contributing to its development and providing support. This ensures continuous improvement and a wealth of shared knowledge and resources.

Weaknesses

  1. CLI Focused: DVC introduces new concepts and CLI commands that users need to learn, which can be a barrier for those not familiar with command-line tools or version control systems.
  2. Limited Scalability for Very Large Datasets: Managing very large projects with complex data pipelines can become cumbersome with DVC, as it requires careful organization and management of DVC files and configurations.
  3. Limited Native UI: While DVC provides a powerful CLI, its native graphical user interface (UI) options are limited. Users often rely on third-party tools or custom-built interfaces for visualization and management.
  4. Dependency on Git: DVC's strong dependency on Git means that it might not be suitable for environments where Git is not the primary version control system, or where users are not familiar with Git workflows.
  5. Complexity of Collaborative Configurations: Collaboration with others requires multiple configurations such as setting up remote storage, defining roles, and providing access to each contributor, which can be frustrating and time-consuming.
  6. Inefficient Data Addition Process: Adding new data to the storage requires pulling the existing data, then calculating the new hash before pushing back the whole data.
  7. Lack of Relational Database Features: DVC lacks crucial relational database features, making it an unsuitable choice for those familiar with relational databases.

3. Apache Hudi

Click here to see Hudi's versioning workflow

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data management framework that helps manage large datasets stored in data lakes. It brings core warehouse and database functionality directly to a data lake. Hudi is designed to provide efficient data ingestion, storage, and query capabilities with strong support for incremental data processing. It enables data engineers to build near real-time data pipelines with support for transactions, indexing, and upserts (updates and inserts).

Strengths

  1. Efficient Incremental Processing: Hudi excels at incremental data processing, allowing for efficient upserts (updates and inserts) and deletes.
  2. ACID Transactions: Hudi supports ACID transactions, ensuring data consistency and reliability.
  3. Real-Time Data Ingestion: Hudi is designed to support near real-time data ingestion and processing, making it suitable for streaming data applications.
  4. Time Travel and Historical Queries: Hudi supports time travel queries, allowing users to access historical versions of data efficiently.
  5. Schema Evolution: Supports schema evolution, allowing for changes to the schema without significant overhead.
  6. Integration with Big Data Ecosystem: Hudi integrates seamlessly with Apache Spark, Apache Hive, Presto, and other big data tools.

Weaknesses

  1. Complexity in Setup and Management: Hudi can be complex to set up and manage, particularly for teams not familiar with the Hadoop ecosystem.
  2. Limited Support for Unstructured Data: Hudi is primarily focused on structured and semi-structured data.
  3. Performance Overhead: Managing frequent updates and maintaining indexes can introduce performance overhead.
  4. Maturity and Ecosystem: While rapidly maturing, Hudi’s ecosystem may not be as mature as some traditional data management tools.

4. Apache Iceberg

Click here to see Iceberg's versioning workflow

Apache Iceberg is an open-source table format for managing large-scale datasets in data lakes, designed for petabyte-scale data. It ensures data consistency, integrity, and performance, and works efficiently with big data processing engines like Apache Spark, Apache Flink, and Apache Hive. Iceberg combines the reliability and simplicity of SQL tables with high performance, enabling multiple engines to safely work with the same tables simultaneously.

Strengths

  1. Schema and Partition Evolution: Supports non-disruptive schema changes and partition evolution, allowing tables to adapt to changing requirements without data rewriting.
  2. Snapshot Isolation and Time Travel: Offers robust snapshot isolation, enabling time travel to query historical versions of data.
  3. Hidden Partitioning: Abstracts partitioning details from users, simplifying query writing while ensuring efficient data access.
  4. Integration with Multiple Big Data Engines: Supports integration with Apache Spark, Flink, Hive, and other big data processing engines.
  5. Atomic Operations: Ensures atomicity for operations like appends, deletes, and updates, providing strong consistency guarantees.
  6. Integration with Multiple Big Data Engines: including Spark, Flink, and Hive.

Weaknesses

  1. Complexity in Setup and Management: Setting up and managing Iceberg tables can be complex, particularly for teams not familiar with big data ecosystems.
  2. Limited Direct Support for Unstructured Data: Primarily designed for structured and semi-structured data.
  3. Ecosystem Maturity: While rapidly maturing, Apache Iceberg's ecosystem is newer compared to some competitors like Delta Lake.

5. Pachyderm

Click here to see Pachyderm's versioning workflow

Pachyderm is an open-source data engineering platform that provides data versioning, pipeline management, and reproducibility for large-scale data processing. It combines data lineage and version control with the ability to manage complex data pipelines, making it an ideal tool for data science and machine learning workflows.

Strengths

  1. Comprehensive Data Lineage: Automatically tracks data transformations, making it easy to audit and trace the source of any data.
  2. Robust Versioning: Provides Git-like version control for data, ensuring all changes are tracked and reproducible.
  3. Scalability and Performance: Built to handle large datasets and complex workflows efficiently.
  4. Integration with Kubernetes: Benefits from Kubernetes’ powerful orchestration capabilities for scaling and managing resources.
  5. Reproducibility: Ensures that every step in a data pipeline can be reproduced exactly, which is critical for reliable data science and machine learning workflows.

Weaknesses

  1. Complexity: Can be complex to set up and manage, especially for users unfamiliar with Kubernetes.
  2. Learning Curve: Has a steep learning curve due to its powerful but intricate features.
  3. Resource Intensive: Requires significant computational resources, particularly for large-scale data processing tasks.

@iamelijahko
Copy link

iamelijahko commented Jul 9, 2024

Code versioning (Miro Board)

Why "code versioning" is important?

Code versioning is the practice of managing changes to source code over time. It involves tracking and controlling modifications to the codebase to ensure that all changes are recorded, identifiable, and reversible. Code versioning is a fundamental practice in software development and is typically facilitated by version control systems (VCS).

Key Aspects of Code Versioning

  1. Version Control Systems (VCS)
  • Centralized VCS: A single central repository where all versions of the code are stored.
  • Distributed VCS: Each developer has a local copy of the repository, including its full history.
  1. Repositories: A repository is a storage location for the codebase, including all versions of the code and its history.
  2. Commits: A commit is a record of changes made to the code. Each commit includes a unique identifier, a message describing the changes, and metadata such as the author and timestamp.
  3. Branches: Branches allow developers to work on different features, bug fixes, or experiments in parallel without affecting the main codebase. Branches can be merged back into the main branch once the changes are ready.
  4. Tags: Tags are used to mark specific points in the repository's history as significant, such as releases or milestones.
  5. Merging: Merging combines changes from different branches into a single branch, resolving any conflicts that arise from simultaneous modifications.
  6. Conflict Resolution: When changes from different branches conflict, developers must resolve these conflicts to integrate the changes.

Feature Comparison Matrix

image

Click here to see Git's versioning workflow

@iamelijahko
Copy link

iamelijahko commented Jul 9, 2024

Model versioning (Miro Board)

Why "model versioning" is important?

Model versioning refers to the practice of managing different versions of machine learning models to track changes, ensure reproducibility, and manage deployments. It involves maintaining records of model parameters, architecture, training data, and performance metrics for each version of the model. This practice is crucial for model experimentation, collaboration, auditability, and continuous integration/continuous deployment (CI/CD) processes in machine learning workflows.

Feature Comparison Matrix

image

1. MLflow

Click here to see MLflow's versioning workflow

2. DVC

Click here to see DVC's versioning workflow

3. WeightS & Biases

Click here to see W&B's versioning workflow

4. TensorBoard

Click here to see TensorBoard's versioning workflow

5. Neptune.ai

Click here to see Neptune.ai's versioning workflow

@noklam
Copy link
Contributor

noklam commented Jul 9, 2024

Thanks for the summary, this is a great start.

I found some of the use case a bit weird to me, for example "real time tracking is essential" is highlight as a strength of Tensorboard, I believe W&B & Mlflow both support this for a long time. I'd also classify TensorBoard as an experiment tracking tool rather than artifact versioning tool. These are not too important for the purpose so I'd focus my questions on versioning:

  1. How did you calculate the Market Share number? It seems to add up to 100%, but what about other solutions like just using s3 or equivalent? How many people are using a dedicated versioning tool versus general purpose tool?
  2. Why W&B data versioning is "Limited"? What's the reasoning behind this conclusion? https://wandb.ai/site/artifacts
  3. Have we considered research on Dagster? As I think they share some similarity to Kedro (as an asset oriented DAGs tool) and added a lot more feature around versioning which I am most curious of.

@iamelijahko
Copy link

iamelijahko commented Jul 9, 2024

Thanks @noklam!
I have just updated the Market Share sections to make the figure self-explainable, they basically based on the GitHub (Star / Fork), as well as the developer survey conducted by stack overflow in 2022/23.

@iamelijahko
Copy link

iamelijahko commented Jul 9, 2024

The rationale that tools like MLflow and W&B are "Limited" in data versioning is based on their primary design goals and features compared to tools, like DVC, specifically built for comprehensive data versioning:

  1. Specialized Data Versioning Tools: Tools like DVC are specifically designed to manage large datasets with features like data versioning, data pipeline management, and integration with various storage backends​​​​. (source / source)

  2. Integration and Workflow: While MLflow and W&B integrate data versioning as part of their broader ML lifecycle management, their capabilities are less when it comes to handling the complexities of large-scale data management independently from the model artifacts​​​​. (source)

  3. Community and Documentation: Both MLflow and W&B documentation and community discussions emphasize their strengths in model experiment tracking and visualization rather than data versioning​​​​. (source)

@iamelijahko
Copy link

Happy to look into Dagster and other DAG-oriented tools!

@astrojuanlu
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Archived in project
Development

No branches or pull requests

4 participants