From f2bc63bb3a6ae7973f464e69d929915428f87f74 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Georges=20Lorr=C3=A9?= <35808396+GeorgesLorre@users.noreply.github.com> Date: Mon, 25 Sep 2023 18:01:57 +0200 Subject: [PATCH] Update roadmap in readme (#462) Co-authored-by: Matthias Richter --- README.md | 4 +- docs/getting_started.md | 49 ++++++++++------------ docs/guides/build_a_simple_pipeline.md | 10 +++-- docs/guides/implement_custom_components.md | 8 ++-- mkdocs.yml | 2 +- 5 files changed, 34 insertions(+), 39 deletions(-) diff --git a/README.md b/README.md index 482e5a498..ad995c15b 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ Explore the docs »

- Discord + Hello PyPI version License GitHub Workflow Status @@ -307,10 +307,8 @@ expect to run into rough edges, the foundations are ready and Fondant should alr speed up your data preparation work. **The following topics are on our roadmap** -- Local pipeline execution - Non-linear pipeline DAGs - LLM-focused example pipelines and reusable components -- Static validation, caching, and partial execution of pipelines - Data lineage and experiment tracking - Distributed execution, both on and off cluster - Support other dataframe libraries such as HF Datasets, Polars, Spark diff --git a/docs/getting_started.md b/docs/getting_started.md index 273f2156e..39ca36c1b 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -1,46 +1,41 @@ # Getting started -Have a look at this page to learn how to run your first Fondant pipeline. It provides instructions for installing, executing a sample pipeline, and visually exploring the pipeline results using Fondant on your local machine. +Note: To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git installed on your system. -## Prerequisite -In this example, we will utilise Fondant's LocalRunner, which leverages docker compose for the pipeline execution. Therefore, it's important to ensure that docker compose is correctly installed. +Note: For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not arm64. - In Docker Dashboards’ Settings=3.8. -To install Fondant via Pip, run: +This pipeline loads an image dataset and reduces the dataset to png files. For more details on how you can build this pipeline from scratch, check out our [guide](/docs/guides/build_a_simple_pipeline.md). +Install Fondant by running: ``` pip install fondant ``` -You can validate the installation of fondant by running its root CLI command: - +Clone the Fondant GitHub repository ``` -fondant --help +git clone https://github.com/ml6team/fondant.git ``` - -## Demo -For demonstration purposes, we provide sample pipelines in the Fondant GitHub repository. A great starting point is the pipeline that loads and filters creative commons images. To follow along with the upcoming instructions, you can clone the [repository](https://github.com/ml6team/fondant) and navigate to the `examples/pipelines/filter-cc-25m` folder. - -This pipeline loads an image dataset and reduces the dataset to png files. For more details on how you can build this pipeline from scratch, check out our [guide](/docs/guides/build_a_simple_pipeline.md). - -## Running the sample pipeline and explore the data -After navigating to the pipeline directory, we can run the pipeline by using the LocalRunner as follow: +Make sure that Docker Compose is running, navigate to fondant/examples/pipelines/filter-cc-25m, and initiate the pipeline by executing: ``` fondant run pipeline --local ``` - -The sample pipeline will run and execute three steps, which you can monitor in the logs. It will load data from HuggingFace, filter out images, and then download those images. The pipeline results will be saved to parquet files. If you wish to visually explore the results, you can use the data explorer. -The following command will start the data explorer: +Note: For local testing purposes, the pipeline will only download the first 100,000 images. If you want to download the full dataset, you will need to modify the component arguments in the pipeline.py file, specifically the following part: +```python +load_from_hf_hub = ComponentOp( + component_dir="components/load_from_hf_hub", + arguments={ + "dataset_name": "fondant-ai/fondant-cc-25m", + "column_name_mapping": load_component_column_mapping, + "n_rows_to_load": + }, +) +``` +To visually inspect the results quickly, you can use: ``` -fondant explore --base_path +fondant explore --base_path ./data ``` ### Custom pipelines -Fondant enables you to leverage existing reusable components and integrate them with custom components. To delve deeper into creating your own pipelines, please explore our [guide](/docs/guides/build_a_simple_pipeline.md). There, you will gain insights into components, various component types, and how to effectively utilise them. +Fondant enables you to leverage existing reusable components and integrate them with custom components. To delve deeper into creating your own pipelines, please explore our [guide](/docs/guides/build_a_simple_pipeline.md). There, you will gain insights into components, various component types, and how to effectively utilise them. \ No newline at end of file diff --git a/docs/guides/build_a_simple_pipeline.md b/docs/guides/build_a_simple_pipeline.md index 814db430c..cb0d46d23 100644 --- a/docs/guides/build_a_simple_pipeline.md +++ b/docs/guides/build_a_simple_pipeline.md @@ -2,9 +2,9 @@ We present a walkthrough to build by yourself the pipeline presented in the Getting Started section. Have fun! -**Level**: Beginner -**Time**: 20min -**Goal**: After completing this tutorial with Fondant, you will be able to understand the different elements of a pipeline, build, and execute your first pipeline by using existing components. +**Level**: Beginner
+**Time**: 20min
+**Goal**: After completing this tutorial with Fondant, you will be able to understand the different elements of a pipeline, build, and execute your first pipeline by using existing components.
**Prerequisite**: Make sure docker compose is installed on your local system @@ -30,6 +30,7 @@ base_path="./data" # The directory that will be used to store the data All you need to initialise a Fondant pipeline are two key parameters: + - **pipeline_name**: This is a name you can use to reference your pipeline. In this example, we've named it after the creative commons-licensed dataset used in the pipeline. - **base_path**: This is the base path that Fondant should use for storing artifacts and data. In our case, it's a local directory path. However, it can also be a path to a remote storage bucket provided by a cloud service. Please note that the directory you reference must exist; if it doesn't, make sure to create it. @@ -38,6 +39,7 @@ All you need to initialise a Fondant pipeline are two key parameters: Now it's time to incrementally build our pipeline by adding different execution steps. We refer to these steps as `Components`. Components are executable elements of a pipeline that consume and produce dataframes. The components are defined by a component specification. The component specification is a YAML file that outlines the input and output data structures, along with the arguments utilised by the component and a reference the the docker image used to run the component. Fondant offers three distinct component types: + - **Reusable components**: These can be readily used without modification. - **Generic components**: They provide the business logic but may require adjustments to the component spec. - **Custom components**: The component implementation is user-dependent. @@ -165,7 +167,7 @@ Finally, we add the component to the pipeline using the `add_op` method. Notably Now, you can proceed to execute your pipeline once more and explore the results. In the explorer, you will be able to view the images that have been downloaded. -![explorer](/docs/art/guides/explorer.png) +![explorer](/art/guides/explorer.png) diff --git a/docs/guides/implement_custom_components.md b/docs/guides/implement_custom_components.md index 70d0360a2..00ea769b2 100644 --- a/docs/guides/implement_custom_components.md +++ b/docs/guides/implement_custom_components.md @@ -1,8 +1,8 @@ # Guide - Implement custom components -**Level**: Beginner -**Time**: 20min -**Goal**: After completing this tutorial with Fondant, you will be able to build your own custom component and integrate it into a fondant pipeline. +**Level**: Beginner
+**Time**: 20min
+**Goal**: After completing this tutorial with Fondant, you will be able to build your own custom component and integrate it into a fondant pipeline.
**Prerequisite**: Make sure docker compose is installed on your local system. We recommend completing the [first tutorial](/docs/guides/build_a_simple_pipeline.md) before proceeding with this one, as this tutorial builds upon the knowledge gained in the previous one. @@ -22,7 +22,7 @@ This pipeline is an extension of the one introduced in the first tutorial. After A component comprises several key elements. First, there's the ComponentSpec YAML file, serving as a blueprint for the component. It defines crucial aspects such as input and output dataframes, along with component arguments. -![component architecture](/docs/art/guides/component.png) +![component architecture](/art/guides/component.png) The second essential part is a python class, which encapsulates the business logic that operates on the input dataframe. diff --git a/mkdocs.yml b/mkdocs.yml index a2b279754..b2a0d9a51 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -41,7 +41,6 @@ nav: - Data explorer: data_explorer.md - Infrastructure: infrastructure.md - Manifest: manifest.md - - Contributing: contributing.md plugins: - mkdocstrings @@ -61,3 +60,4 @@ markdown_extensions: emoji_index: !!python/name:materialx.emoji.twemoji emoji_generator: !!python/name:materialx.emoji.to_svg - admonition + - def_list \ No newline at end of file