diff --git a/README.md b/README.md index 400db5915..54ad6d611 100644 --- a/README.md +++ b/README.md @@ -24,50 +24,64 @@ It offers: ## 🪤 Why Fondant? -In the age of Foundation Models, control over your data is key and building pipelines -for large-scale data processing is costly, especially when they require advanced -machine learning-based operations. This need not be the case, however, if processing -components would be reusable and exchangeable and pipelines were easily composable. -Realizing this is the main vision behind Fondant. +In the age of Foundation Models, control over your data is key and building pipelines for +large-scale data processing is costly, especially when they require advanced machine learning-based operations. +This need not be the case, however, if processing components would be reusable and exchangeable and pipelines were +easily composable. Realizing this is the main vision behind Fondant.

(back to top)

## 💨 Getting Started -Anxious to get started? Here's is a [step by step guide](https://fondant.readthedocs.io/en/latest/getting_started) to get your first pipeline up and running. +Eager to get started? Here is a [step by step guide](https://fondant.readthedocs.io/en/latest/getting_started) to get your first pipeline up and running. ## 🪄 Example pipelines @@ -80,28 +94,37 @@ We have created several ready-made example pipelines for you to use as a startin ## 🧩 Reusable components -Fondant comes with a library of reusable components, which can jumpstart your pipeline. - -| COMPONENT | DESCRIPTION | -| -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- | -| **Data loading / writing** | | -| [load_from_hf_hub](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub) | Load a dataset from the Hugging Face Hub | -| [write_to_hf_hub](https://github.com/ml6team/fondant/tree/main/components/write_to_hf_hub) | Write a dataset to the Hugging Face Hub | -| [prompt_based_laion_retrieval](https://github.com/ml6team/fondant/tree/main/components/prompt_based_laion_retrieval) | Retrieve images-text pairs from LAION using prompt similarity | -| [embedding_based_laion_retrieval](https://github.com/ml6team/fondant/tree/main/components/embedding_based_laion_retrieval) | Retrieve images-text pairs from LAION using embedding similarity | -| [download_images](https://github.com/ml6team/fondant/tree/main/components/download_images) | Download images from urls | -| **Image processing** | | -| [embed_images](https://github.com/ml6team/fondant/tree/main/components/embed_images) | Create embeddings for images using a model from the HF Hub | -| [image_resolution_extraction](https://github.com/ml6team/fondant/tree/main/components/image_resolution_extraction) | Extract the resolution from images | -| [filter_image_resolution](https://github.com/ml6team/fondant/tree/main/components/filter_image_resolution) | Filter images based on their resolution | -| [caption images](https://github.com/ml6team/fondant/tree/main/components/caption_images) | Generate captions for images using a model from the HF Hub | -| [segment_images](https://github.com/ml6team/fondant/tree/main/components/segment_images) | Generate segmentation maps for images using a model from the HF Hub | -| [image_cropping](https://github.com/ml6team/fondant/tree/main/components/image_cropping) | Intelligently crop out image borders | -| **Language processing** | Coming soon | -| **Clustering** | Coming soon | +Fondant comes with a library of reusable components, which can jumpstart your pipeline, here are a selected few: + +| COMPONENT | DESCRIPTION | +|----------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------| +| **Data loading** | | +| [load_from_hf_hub](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub) | Load a dataset from the Hugging Face Hub | +| [load_from_parquet](https://github.com/ml6team/fondant/tree/main/components/load_from_parquet) | Load a dataset from a parquet file stored on remotely | +| **Data Retrieval** | | +| [prompt_based_laion_retrieval](https://github.com/ml6team/fondant/tree/main/components/prompt_based_laion_retrieval) | Retrieve images-text pairs from LAION using prompt similarity | +| [embedding_based_laion_retrieval](https://github.com/ml6team/fondant/tree/main/components/embedding_based_laion_retrieval) | Retrieve images-text pairs from LAION using embedding similarity | +| [download_images](https://github.com/ml6team/fondant/tree/main/components/download_images) | Download images from urls | +| **Data Writing** | | +| [write_to_hf_hub](https://github.com/ml6team/fondant/tree/main/components/write_to_hf_hub) | Write a dataset to the Hugging Face Hub | +| [index_weaviate](https://github.com/ml6team/fondant/tree/main/components/index_weaviate) | Index text and writes it to a [Weaviate](https://weaviate.io/) database | +| **Image processing** | | +| [embed_images](https://github.com/ml6team/fondant/tree/main/components/embed_images) | Create embeddings for images using a model from the HF Hub | +| [image_resolution_extraction](https://github.com/ml6team/fondant/tree/main/components/image_resolution_extraction) | Extract the resolution from images | +| [filter_image_resolution](https://github.com/ml6team/fondant/tree/main/components/filter_image_resolution) | Filter images based on their resolution | +| [caption images](https://github.com/ml6team/fondant/tree/main/components/caption_images) | Generate captions for images using a model from the HF Hub | +| [segment_images](https://github.com/ml6team/fondant/tree/main/components/segment_images) | Generate segmentation maps for images using a model from the HF Hub | +| [image_cropping](https://github.com/ml6team/fondant/tree/main/components/image_cropping) | Intelligently crop out image borders | +| **Text processing** | | +| [embed_text](https://github.com/ml6team/fondant/tree/main/components/embed_text) | Create embeddings for images using a model from the HF Hub | +| [chunk_text](https://github.com/ml6team/fondant/tree/main/components/chunk_text) | Extract chunks from long text paragraphs | +| [normalize_text](https://github.com/ml6team/fondant/tree/main/components/normalize_text) | Implements several normalization techniques to clean and preprocess textual | +| [filter_text_length](https://github.com/ml6team/fondant/tree/main/components/filter_text_length) | Filters text based on character length | +

(back to top)

+Check out the [components](https://github.com/ml6team/fondant/tree/main/components) section for a full list of available components. ## ⚒️ Installation Fondant can be installed using pip: @@ -118,22 +141,45 @@ pip install git+https://github.com/ml6team/fondant.git ### 🧱 Running Fondant pipelines -There are 3 ways to run fondant pipelines: - -- [**Local runner**](https://github.com/ml6team/fondant/blob/main/docs/pipeline.md#local-runner): leverages [docker compose](https://docs.docker.com/compose/). The local runner is mainly aimed -at helping you develop fondant pipelines and components faster since it allows you to develop on your local machine or a Virtual Machine. -- This enables you to quickly iterate on development.Once you have a pipeline developed, you can use the other runners mentioned below -for better scaling, monitoring and reproducibility. -- [**Vertex runner**](https://github.com/ml6team/fondant/blob/main/docs/pipeline.md#vertex-runner): Uses Google cloud's [Vertex AI pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) to help you -orchestrate your Fondant pipelines in a serverless manner. This makes it easy to scale up your pipelines without worrying about infrastructure -deployment. -- [**Kubeflow runner**](https://github.com/ml6team/fondant/blob/main/docs/pipeline.md#kubeflow): Leverages [Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) on any Kubernetes cluster. -All Fondant needs is a url pointing to the Kubeflow pipeline host and an Object Storage provider (S3, GCS, etc) to store data produced in the pipeline between steps. -We have compiled some references and created some scripts to [get you started](https://fondant.readthedocs.io/en/latest/infrastructure) with setting up the required infrastructure. - - -It is worth noting that the same pipeline can be used across all runners allowing you to quickly develop and iterate using the local -runner and then using the Vertex or Kubeflow runner to run a large scale pipeline. +Fondant pipelines can be run on different platforms. + + + + + + + + + + +
+
+ +
LocalRunner: Uses Docker Compose to run locally on your machine – great for developing, testing, and debugging.
+
+
+
+ +
VertexRunner: Runs on VertexAI Pipelines.
+
+
+
+ +
KubeflowRunner: Runs on Kubeflow Pipelines.
+
+
+
+ +
🚧SageMakerRunner🚧 : Runs on Sagemaker Pipelines.
+
+

(back to top)

diff --git a/components/text_length_filter/Dockerfile b/components/filter_text_length/Dockerfile similarity index 100% rename from components/text_length_filter/Dockerfile rename to components/filter_text_length/Dockerfile diff --git a/components/text_length_filter/README.md b/components/filter_text_length/README.md similarity index 86% rename from components/text_length_filter/README.md rename to components/filter_text_length/README.md index 01ee0ba1a..ed89dd128 100644 --- a/components/text_length_filter/README.md +++ b/components/filter_text_length/README.md @@ -29,15 +29,15 @@ You can add this component to your pipeline using the following code: from fondant.pipeline import ComponentOp -text_length_filter_op = ComponentOp.from_registry( - name="text_length_filter", +filter_text_length_op = ComponentOp.from_registry( + name="filter_text_length", arguments={ # Add arguments # "min_characters_length": 0, # "min_words_length": 0, } ) -pipeline.add_op(text_length_filter_op, dependencies=[...]) #Add previous component as dependency +pipeline.add_op(filter_text_length_op, dependencies=[...]) #Add previous component as dependency ``` ### Testing diff --git a/components/text_length_filter/fondant_component.yaml b/components/filter_text_length/fondant_component.yaml similarity index 100% rename from components/text_length_filter/fondant_component.yaml rename to components/filter_text_length/fondant_component.yaml diff --git a/components/text_length_filter/requirements.txt b/components/filter_text_length/requirements.txt similarity index 100% rename from components/text_length_filter/requirements.txt rename to components/filter_text_length/requirements.txt diff --git a/components/text_length_filter/src/main.py b/components/filter_text_length/src/main.py similarity index 95% rename from components/text_length_filter/src/main.py rename to components/filter_text_length/src/main.py index 90ce14883..3e2f472a4 100644 --- a/components/text_length_filter/src/main.py +++ b/components/filter_text_length/src/main.py @@ -8,7 +8,7 @@ logger = logging.getLogger(__name__) -class TextLengthFilterComponent(PandasTransformComponent): +class FilterTextLengthComponent(PandasTransformComponent): """A component that filters out text based on their length.""" def __init__(self, *_, min_characters_length: int, min_words_length: int): diff --git a/components/text_length_filter/tests/text_length_filter_test.py b/components/filter_text_length/tests/text_length_filter_test.py similarity index 54% rename from components/text_length_filter/tests/text_length_filter_test.py rename to components/filter_text_length/tests/text_length_filter_test.py index fbbbb1aba..eea98864e 100644 --- a/components/text_length_filter/tests/text_length_filter_test.py +++ b/components/filter_text_length/tests/text_length_filter_test.py @@ -1,8 +1,7 @@ """Unit test for text length filter component.""" import pandas as pd -from fondant.core.component_spec import ComponentSpec -from components.text_length_filter.src.main import TextLengthFilterComponent +from components.filter_text_length.src.main import FilterTextLengthComponent def test_run_component_test(): @@ -16,17 +15,10 @@ def test_run_component_test(): dataframe = pd.concat({"text": pd.DataFrame(data)}, axis=1, names=["text", "data"]) - # When: The text filter component proceed the dataframe - spec = ComponentSpec.from_file("../fondant_component.yaml") - - component = TextLengthFilterComponent( - spec, - input_manifest_path="./dummy_input_manifest.json", - output_manifest_path="./dummy_input_manifest.json", - metadata={}, - user_arguments={"min_characters_length": 20, "min_words_length": 4}, + component = FilterTextLengthComponent( + min_characters_length=20, + min_words_length=4, ) - component.setup(min_characters_length=20, min_words_length=4) dataframe = component.transform(dataframe=dataframe) # Then: dataframe only contains one row diff --git a/components/text_normalization/Dockerfile b/components/normalize_text/Dockerfile similarity index 100% rename from components/text_normalization/Dockerfile rename to components/normalize_text/Dockerfile diff --git a/components/text_normalization/README.md b/components/normalize_text/README.md similarity index 91% rename from components/text_normalization/README.md rename to components/normalize_text/README.md index 6ae6fb97f..edc955a79 100644 --- a/components/text_normalization/README.md +++ b/components/normalize_text/README.md @@ -44,8 +44,8 @@ You can add this component to your pipeline using the following code: from fondant.pipeline import ComponentOp -text_normalization_op = ComponentOp.from_registry( - name="text_normalization", +normalize_text_op = ComponentOp.from_registry( + name="normalize_text", arguments={ # Add arguments # "remove_additional_whitespaces": False, @@ -55,7 +55,7 @@ text_normalization_op = ComponentOp.from_registry( # "remove_punctuation": , } ) -pipeline.add_op(text_normalization_op, dependencies=[...]) #Add previous component as dependency +pipeline.add_op(normalize_text_op, dependencies=[...]) #Add previous component as dependency ``` ### Testing diff --git a/components/text_normalization/fondant_component.yaml b/components/normalize_text/fondant_component.yaml similarity index 97% rename from components/text_normalization/fondant_component.yaml rename to components/normalize_text/fondant_component.yaml index cff40afb0..1bc01dc2f 100644 --- a/components/text_normalization/fondant_component.yaml +++ b/components/normalize_text/fondant_component.yaml @@ -1,5 +1,5 @@ name: Normalize text -image: fndnt/text_normalization:latest +image: fndnt/normalize_text:latest description: | This component implements several text normalization techniques to clean and preprocess textual data: diff --git a/components/text_normalization/requirements.txt b/components/normalize_text/requirements.txt similarity index 100% rename from components/text_normalization/requirements.txt rename to components/normalize_text/requirements.txt diff --git a/components/text_normalization/src/main.py b/components/normalize_text/src/main.py similarity index 98% rename from components/text_normalization/src/main.py rename to components/normalize_text/src/main.py index 7b527cbb8..47220fba4 100644 --- a/components/text_normalization/src/main.py +++ b/components/normalize_text/src/main.py @@ -41,7 +41,7 @@ def any_condition_met(line, discard_condition_functions): ) -class TextNormalizationComponent(PandasTransformComponent): +class NormalizeTextComponent(PandasTransformComponent): """Component that normalizes text.""" def __init__( diff --git a/components/text_normalization/src/utils.py b/components/normalize_text/src/utils.py similarity index 100% rename from components/text_normalization/src/utils.py rename to components/normalize_text/src/utils.py diff --git a/components/text_normalization/test_requirements.txt b/components/normalize_text/test_requirements.txt similarity index 100% rename from components/text_normalization/test_requirements.txt rename to components/normalize_text/test_requirements.txt diff --git a/components/text_normalization/tests/component_test.py b/components/normalize_text/tests/component_test.py similarity index 90% rename from components/text_normalization/tests/component_test.py rename to components/normalize_text/tests/component_test.py index 179e1433c..1c10e94ea 100644 --- a/components/text_normalization/tests/component_test.py +++ b/components/normalize_text/tests/component_test.py @@ -1,6 +1,6 @@ import pandas as pd -from src.main import TextNormalizationComponent +from src.main import NormalizeTextComponent def test_transform_custom_componen_test(): @@ -12,7 +12,7 @@ def test_transform_custom_componen_test(): "do_lowercase": True, "remove_punctuation": True, } - component = TextNormalizationComponent(**user_arguments) + component = NormalizeTextComponent(**user_arguments) input_dataframe = pd.DataFrame( [ diff --git a/components/text_normalization/tests/utils_test.py b/components/normalize_text/tests/utils_test.py similarity index 100% rename from components/text_normalization/tests/utils_test.py rename to components/normalize_text/tests/utils_test.py diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 000000000..5f863d2ad --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,10 @@ +# Architecture + +An overview of the architecture of Fondant + +### Coming soon + + +## Conceptual overview + +#### TODO: Add a diagram here \ No newline at end of file diff --git a/docs/art/runners/docker_compose.png b/docs/art/runners/docker_compose.png new file mode 100644 index 000000000..3b1aaafea Binary files /dev/null and b/docs/art/runners/docker_compose.png differ diff --git a/docs/art/runners/kubeflow_pipelines.png b/docs/art/runners/kubeflow_pipelines.png new file mode 100644 index 000000000..739597c8d Binary files /dev/null and b/docs/art/runners/kubeflow_pipelines.png differ diff --git a/docs/art/runners/sagemaker.png b/docs/art/runners/sagemaker.png new file mode 100644 index 000000000..cda0cbb17 Binary files /dev/null and b/docs/art/runners/sagemaker.png differ diff --git a/docs/art/runners/vertex_ai.png b/docs/art/runners/vertex_ai.png new file mode 100644 index 000000000..ceb1b23b2 Binary files /dev/null and b/docs/art/runners/vertex_ai.png differ diff --git a/docs/caching.md b/docs/caching.md new file mode 100644 index 000000000..3eef3ea04 --- /dev/null +++ b/docs/caching.md @@ -0,0 +1,39 @@ +When Fondant runs a pipeline, it checks to see whether an execution exists in the base path based on +the cache key of each component. + +The cache key is defined as the combination of the following: + +1) The **pipeline step's inputs.** These inputs include the input arguments' value (if any). + +2) **The component's specification.** This specification includes the image tag and the fields + consumed and produced by each component. + +3) **The component resources.** Defines the hardware that was used to run the component (GPU, + nodepool). + +If there is a matching execution in the base path (checked based on the output manifests), +the outputs of that execution are used and the step computation is skipped. +This helps to reduce costs by skipping computations that were completed in a previous pipeline run. + +Additionally, only the pipelines with the same pipeline name will share the cache. Caching for +components +with the `latest` image tag is disabled by default. This is because using "latest" image tags can +lead to unpredictable behavior due to +image updates. Moreover, if one component in the pipeline is not cached then caching will be +disabled for all +subsequent components. + +### Disabling caching +You can turn off execution caching at component level by setting the following: + +```python +from fondant.pipeline.pipeline import ComponentOp + +caption_images_op = ComponentOp( + component_dir="...", + arguments={ + ... + }, + cache=False, +) +``` \ No newline at end of file diff --git a/docs/components/hub.md b/docs/components/hub.md index 9b49b00a2..26fc74f2a 100644 --- a/docs/components/hub.md +++ b/docs/components/hub.md @@ -76,11 +76,11 @@ Below you can find the reusable components offered by Fondant. ??? "text_length_filter" - --8<-- "components/text_length_filter/README.md:1" + --8<-- "components/filter_text_length/README.md:1" ??? "text_normalization" - --8<-- "components/text_normalization/README.md:1" + --8<-- "components/normalize_text/README.md:1" ??? "write_to_hf_hub" diff --git a/docs/components/publishing_components.md b/docs/components/publishing_components.md new file mode 100644 index 000000000..388bdd891 --- /dev/null +++ b/docs/components/publishing_components.md @@ -0,0 +1 @@ +# Coming Soon diff --git a/docs/documentation_guide.md b/docs/documentation_guide.md new file mode 100644 index 000000000..0702eb2cb --- /dev/null +++ b/docs/documentation_guide.md @@ -0,0 +1,105 @@ +# Documentation Guide + +## Getting started with Fondant + +Learn about the Fondant project and how to get started with it. + +→ Start with the official guide on how to [install](guides/installation.md) Fondant. +→ Get started by running your first fondant [pipeline](guides/first_pipeline.md) using the [local +runner](runners/local.md). +→ Learn how to build your own [Fondant Pipeline](guides/build_a_simple_pipeline.md) and implement your +own [custom components](guides/implement_custom_components.md). +→ Learn how to use the [data explorer](data_explorer.md) to explore the outputs of your pipeline. + +## Fondant fundamentals + +Learn how to use Fondant to build your own data processing pipeline. + +-> Design your own fondant [pipeline](pipeline.md) using the Fondant pipeline API. +-> Use existing [reusable components](components/hub.md) to build your pipeline. +-> Use [generic components](components/generic_component.md) to load/write your custom data format +to/from Fondant. +-> Build your own [custom component](components/custom_component.md) using the Fondant component +API. +-> Learn how to publish your own [components](components/publishing_components.md) to a container +registry so that you can reuse them in your pipelines. + +## Components hub + +Have a look at the [components hub](components/hub.md) to see what components are available. + +## Fondant Runners + +Learn how to run your Fondant pipeline on different platforms. + + + + + + + +
+
+ +
LocalRunner
+
+
+
+ +
VertexRunner
+
+
+
+ +
KubeflowRunner
+
+
+
+ +
🚧SageMakerRunner🚧
+
+
+ + + +-> [LocalRunner](runners/local.md): ideal for developing fondant pipelines and components faster. +-> [VertexRunner](runners/vertex.md): used for running a fondant pipeline on Vertex AI. +-> [KubeflowRunner](runners/kfp.md): used for running a fondant pipeline on a Kubeflow cluster. +-> [SageMakerRunner](runners/kfp.md): used for running a fondant pipeline on a SageMaker pipelines ( +🚧 Coming Soon 🚧). + +## Fondant Explorer + +Discover how to utilize the Fondant [data explorer](data_explorer.md) to navigate your pipeline +outputs, including visualizing intermediary steps between components. + +## Advanced Concepts + +Learn about some of the more advanced concepts in Fondant. + +-> Learn more about the [architecture](architecture.md) of Fondant and how it works under the +hood. +-> Understand how Fondant passes data between components with the [manifest](manifest.md). +-> Learn how Fondant uses [caching](caching.md) to speed up your pipeline development. +-> Find out how Fondant uses [partitions](partitions.md) to parallelize and scale your pipeline and +how you can use it to your advantage. + +## Contributing + +Learn how to contribute to the Fondant project through +our [contribution guidelines](contributing.md). + +## FAQ + +Browse through the [frequently asked questions](faq.md) about Fondant. + +## Announcements + +Check out our latest [announcements] about Fondant. + +-> 25 million Creative Commons image dataset released. Read more about it [here](announcements/CC_25M_press_release.md). + diff --git a/docs/faq.md b/docs/faq.md new file mode 100644 index 000000000..94eea9941 --- /dev/null +++ b/docs/faq.md @@ -0,0 +1 @@ +# Coming Soon \ No newline at end of file diff --git a/docs/guides/build_a_simple_pipeline.md b/docs/guides/build_a_simple_pipeline.md index 5ebce4a67..02fdf6876 100644 --- a/docs/guides/build_a_simple_pipeline.md +++ b/docs/guides/build_a_simple_pipeline.md @@ -12,18 +12,22 @@ We present a walkthrough to build by yourself the pipeline presented in the Gett The sample pipeline that is going to be built in this tutorial demonstrates how to effectively utilise a creative commons image dataset within a fondant pipeline. This dataset comprises images from diverse sources and is available in various data formats. -The pipeline starts with the initialization of the image dataset sourced from HuggingFace and it proceeds with the downloading of these carefully selected images. Accomplishing these tasks necessitates the use of a pre-built generic component (HuggingFace dataset loading) and a reusable component (image downloading). +The pipeline starts with the initialization of the image dataset sourced from HuggingFace and proceeds with the downloading of these carefully selected images. Accomplishing these tasks necessitates the use of: + +* [load_from_hf_hub](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub): A [generic component](../components/generic_component.md) that loads the initial dataset from the Huggingface hub. +* [download_images](https://github.com/ml6team/fondant/tree/main/components/download_images): A [reusable component](../components/components.md) that downloades images from urls. ## Setting up the environment -To set up your local environment, please refer to our getting started documentation. There, you will find the necessary steps to configure your environment. +We will be using the [local runner](../runners/local.md) to run this pipelines. To set up your local environment, please refer to our [installation](installation.md) documentation. ## Building the pipeline -Everything begins with the pipeline definition. Start by creating a 'pipeline.py' file and adding the following code. +Everything begins with the pipeline definition. Start by creating a `pipeline.py` file and adding the following script. ```python from fondant.pipeline import ComponentOp, Pipeline + pipeline = Pipeline( pipeline_name="creative_commons_pipline", # This is the name of your pipeline base_path="./data" # The directory that will be used to store the data @@ -49,7 +53,7 @@ If you want to learn more about components, you can check out the [components do ### First component to load the dataset -For every pipeline, the initial step is data initialization. In our case, we aim to load the dataset into our pipeline base from HuggingFace. Fortunately, there is already a generic component available called `load_from_hub`. +For every pipeline, the initial step is data initialization. In our case, we aim to load the dataset into our pipeline base from HuggingFace. Fortunately, we already have a generic component available called `load_from_hf_hub`. This component is categorised as a generic component because the structure of the datasets we load from HuggingFace can vary from one dataset to another. While we can leverage the implemented business logic of the component, we must customise the component spec. This customization is necessary to inform the component about the specific columns it will produce. To utilise this component, it's time to create your first component spec. Create a folder `component/load_from_hub` and create a `fondant_component.yaml` with the following content: @@ -96,8 +100,8 @@ args: default: None ``` -As mentioned earlier, the component spec specifies the data structure consumed and/or produced by the component. In this case, the component solely produces data, and this structure is defined within the `produces` section. Fondant operates with hierarchical column structures. In our example, we are defining a column called images with several subset fields. -Now that we have created the component spec, we can incorporate the component into our python code. The next steps involve initialising the component from the component spec and adding it to our pipeline using the following code: +As mentioned earlier, the component spec specifies the data structure consumed and/or produced by the component. In this case, the component solely produces data, and this structure is defined within the `produces` section. Fondant operates with hierarchical column structures. In our example, we are defining a column called `images` with several subset fields. +Now that we have created the component specification file, we can incorporate the component into our python code. The next steps involve initialising the component from the component spec and adding it to our pipeline using the following code: ```python from fondant.pipeline import ComponentOp @@ -126,18 +130,20 @@ Two key actions are taking place here: 1. We create a ComponentOp from the registry, configuring the component with specific arguments. In this process, we override default arguments as needed. If we don't provide an argument override, the default values are used. Notably, we are modifying the dataset to be loaded, specifying the number of rows to load (which can be a small number for testing purposes), and mapping columns from the HuggingFace dataset to columns in our dataframe. -2. The add_op method registers the configured component into the pipeline. +2. We provide a column mapping argument for the component to change the column names from the [initial dataset](https://huggingface.co/datasets/mrchtr/cc-test) to ones that match the component specification. + +3. The add_op method registers the configured component into the pipeline. To test the pipeline, you can execute the following command within the pipeline directory: -``` +```bash fondant run local pipeline.py ``` The pipeline execution will start, initiating the download of the dataset from HuggingFace. After the pipeline has completed, you can explore the pipeline result using the fondant explorer: -``` +```bash fondant explore --base_path ./data ``` @@ -158,17 +164,17 @@ download_images = ComponentOp.from_registry( arguments={} ) -pipeline.add_op(download_images, dependencies=[load_from_hf_hub]) +pipeline.add_op(download_images, dependencies=load_from_hf_hub) ``` -The reusable component requires a specific dataset input format to function effectively. Referring to the ComponentHub documentation, this component downloads images based on the URLs provided in the `image_url` column. Fortunately, the column generated by the first component is already named correctly for this purpose. +The reusable component requires a specific dataset input format to function effectively. Referring to the [component's documentation](https://hub.docker.com/r/fndnt/download_images), this component downloads images based on the URLs provided in the `image_url` column. Fortunately, the column generated by the first component is already named correctly for this purpose. Instead of initialising the component from a YAML file, we'll use the method `ComponentOp.from_registry(...)` where we can easily specify the name of the reusable component. This is arguably the simplest way to start using a Fondant component. -Finally, we add the component to the pipeline using the `add_op` method. Notably, we define `dependencies=[load_from_hf_hub]` in this step. This command ensures that we chain both components together. Specifically, the `download_images` component awaits the execution input from the `load_from_hf_hub` component. +Finally, we add the component to the pipeline using the `add_op` method. Notably, we define `dependencies=load_from_hf_hub` in this step. This command ensures that we chain both components together. Specifically, the `download_images` component awaits the execution input from the `load_from_hf_hub` component. Now, you can proceed to execute your pipeline once more and explore the results. In the explorer, you will be able to view the images that have been downloaded. ![explorer](https://github.com/ml6team/fondant/blob/main/docs/art/guides/explorer.png?raw=true) -Well done! You have now acquired the skills to construct a simple Fondant pipeline by leveraging generic and reusable components. In our [upcoming tutorial](../guides//implement_custom_components.md), we'll demonstrate how you can customise the pipeline by implementing a custom component. +Well done! You have now acquired the skills to construct a simple Fondant pipeline by leveraging generic and reusable components. In the [following tutorial](implement_custom_components.md), we'll demonstrate how you can customise the pipeline by implementing a custom component. diff --git a/docs/getting_started.md b/docs/guides/first_pipeline.md similarity index 60% rename from docs/getting_started.md rename to docs/guides/first_pipeline.md index afdf35dd9..1bcb2710a 100644 --- a/docs/getting_started.md +++ b/docs/guides/first_pipeline.md @@ -1,26 +1,9 @@ # Getting started -!!! note - - To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git - installed on your system. - -!!! note - - For Apple M1/M2 ship users:
- - Make sure that Docker uses linux/amd64 platform and not arm64.
- - In Docker Dashboards’ Settings **Prerequisite**: Make sure docker compose is installed on your local system. -We recommend completing the [first tutorial](../guides/build_a_simple_pipeline.md) before proceeding with this one, as this tutorial builds upon the knowledge gained in the previous one. +We recommend completing the [first tutorial](/build_a_simple_pipeline.md) before proceeding with this one, as this tutorial builds upon the knowledge gained in the previous one. ## Overview -In the [initial tutorial](../guides/build_a_simple_pipeline.md), you learned how to create your first Fondant pipeline. While the example demonstrates initialising the dataset from HuggingFace and using a reusable component to download images, this is just the beginning. +In the [initial tutorial](/build_a_simple_pipeline.md), you learned how to create your first Fondant pipeline. While the example demonstrates initialising the dataset from HuggingFace and using a reusable component to download images, this is just the beginning. The true power of Fondant lies in its capability to enable you to construct your own data pipelines to create high-quality datasets. To achieve this, we need to implement custom components. @@ -29,12 +29,12 @@ In addition to these core components, there are a few other necessary items, inc ### Creating the ComponentSpec -First of all we create the following ComponentSpec (fondant_component.yaml) file in the folder `components/filter_images`: +First of all we create the following ComponentSpec as a `fondant_component.yaml` file in the folder `components/filter_images`: ```yaml name: Filter file type description: Component that filters on mime types -image: filter_images +image: filter_images:latest consumes: images: @@ -56,7 +56,7 @@ args: It begins by specifying the component name, providing a brief description, and naming the component's Docker image url. -Following this, we define the structure of input and output dataframes, consumes and `produces`, which dictate the columns and subset fields the component will operate on. In this example, our goal is to filter images based on file types. For the sake of simplicity, we will work with image URLs, assuming that the file type is identifiable within the URL (e.g., \*.png). Consequently, our component consumes image_urls and produces image_urls as well. +Following this, we define the structure of input and output dataframes specified by the `consumes` and `produces`, which dictate the columns and subset fields the component will operate on. In this example, our goal is to filter images based on file types. For the sake of simplicity, we will work with image URLs, assuming that the file type is identifiable within the URL (e.g., \*.png). Consequently, our component consumes and produces `image_urls`. Lastly, we define custom arguments that the component will support. In our case, we include the `mime_type argument`, which allows us to filter images by different file types in the future. @@ -98,11 +98,11 @@ class FileTypeFilter(PandasTransformComponent): By doing this, we create a custom component that inherits from a `PandasTransformComponent`. This specialised component works with pandas dataframes, allowing us the freedom to modify them as needed before returning the resulting dataframe. In this particular example, our component guesses the MIME type of each image based on its URL. Subsequently, it adds this information to the `images` subset of the dataframe and returns the filtered dataset based on the desired MIME type. -### Build the component +### Building the component To use the component, Fondant must package it into an executable Docker image. To achieve this, we need to define a Dockerfile. You can create this file within the `components/filter_images` folder using the following content: -``` +```bash FROM --platform=linux/amd64 python:3.8-slim as base # System dependencies @@ -128,11 +128,11 @@ COPY src/ . ENTRYPOINT ["fondant", "execute", "main"] ``` -As part of the Dockerfile build process, we install necessary dependencies. Consequently, we must create a `requirements.txt` file in the `components/filter_images` folder. If your components logic demands custom libraries, you can include them in the requirements.txt file but for this example it can be empty since we don't need any extra libraries. +As part of the Dockerfile build process, we install necessary dependencies. Consequently, we must create a `requirements.txt` file in the `components/filter_images` folder. If your components logic demands custom libraries, you can include them in the `requirements.txt` file but for this example it can be empty since we don't need any additional libraries. ## Make use of your component -To utilise your component, you can incorporate it into the pipeline created in [this guide](../guides/build_a_simple_pipeline.md). To do this, you'll need to add the following code to the `pipeline.py` file: +To utilise your component, you can incorporate it into the pipeline created in [this guide](/build_a_simple_pipeline.md). To do this, you'll need to add the following code to the `pipeline.py` file: ```python # Filter mime type component @@ -142,14 +142,14 @@ filter_mime_type = ComponentOp( ) ``` -We initialise the component from a local path, similar to the generic component. However, in this case, the component will be built entirely based on your local files, as the folders contain additional information beyond the ComponentSpec. +We initialise the component from a local path, similar to the generic component. However, in this case, the component will be built using your local files. Lastly, we need to make adjustments to the pipeline. The step for downloading images can be network-intensive since it involves actual downloads. As a result, we want to pre-filter the files before proceeding with the downloads. To achieve this, we'll modify the pipeline as follows: ```python pipeline.add_op(load_from_hf_hub) -pipeline.add_op(filter_mime_type, dependencies=[load_from_hf_hub]) -pipeline.add_op(download_images, dependencies=[filter_mime_type]) +pipeline.add_op(filter_mime_type, dependencies=load_from_hf_hub) +pipeline.add_op(download_images, dependencies=filter_mime_type) ``` We are inserting our custom component as an intermediary step within our pipeline. diff --git a/docs/guides/installation.md b/docs/guides/installation.md new file mode 100644 index 000000000..4b4b17212 --- /dev/null +++ b/docs/guides/installation.md @@ -0,0 +1,69 @@ +## Installing Fondant + +Install Fondant by running: + +```bash +pip install fondant +``` + +Fondant also includes extra dependencies for specific runners, storage integrations and publishing components to registries. + +### Runner specific dependencies + +For Kubeflow runner: +```bash +pip install fondant[kfp] +``` + +For SageMaker runner: +```bash +pip install fondant[SageMaker] +``` + +For Vertex runner: +```bash +pip install fondant[Vertex] +``` + +### Storage integration dependencies + +For google cloud storage (GCS): +```bash +pip install fondant[gcp] +``` + +For s3 storage: +```bash +pip install fondant[aws] +``` + +For Azure storage: +```bash +pip install fondant[azure] +``` + +### Publishing components dependencies + +For publishing components to registries: +```bash +pip install fondant[docker] +``` + +Check out the [guide](../components/publishing_components.md) on publishing components to registries. + +### Runner specific dependencies + + +## Docker installation + +To execute pipelines locally, you must +have [Docker compose](https://docs.docker.com/compose/install/) and Python >=3.8 +installed on your system. + +#### TODO: Modify/extend considerations for Docker-compose/desktop + +For Apple M1/M2 ship users:
+ +- Make sure that Docker uses linux/amd64 platform and not arm64.
+- In Docker Dashboards’ Settings -For the **Vertex** and **Kubeflow** runners, make sure that the service account attached to those runners has read/write access. -* **A local directory**: only valid for the local runner, points to a local directory. This is useful for local development. - -Next, we define two operations: `load_from_hub_op`, which is a based from a reusable component loaded from the Fondant registry, and `caption_images_op`, which is a custom component defined by you. We add these components to the pipeline using the `.add_op()` method and specify the dependencies between components to build the DAG. +* **A remote cloud location (S3, GCS, Azure Blob storage)**: valid across all runners. + For the **local runner**, make sure that your local credentials or service account have read/write + access to the + designated base path and that they are mounted.
+ For the **Vertex** and **Kubeflow** runners, make sure that the service account attached to those + runners has read/write access. +* **A local directory**: only valid for the local runner, points to a local directory. This is + useful for local development. +Next, we define two operations: `load_from_hub_op`, which is a based from a reusable component +loaded from the Fondant registry, and `caption_images_op`, which is a custom component defined by +you. We add these components to the pipeline using the `.add_op()` method and specify the +dependencies between components to build the DAG. !!! note "IMPORTANT" - Currently Fondant supports linear DAGs with single dependencies. Support for non-linear DAGs will be available in future releases. +Currently Fondant supports linear DAGs with single dependencies. Support for non-linear DAGs will be +available in future releases. ## Compiling and Running a pipeline -Once all your components are added to your pipeline you can use different compilers to run your pipeline: +Once all your components are added to your pipeline you can use different compilers to run your +pipeline: !!! note "IMPORTANT" - When using other runners you will need to make sure that your new environment has access to: - - The base path of your pipeline (as mentioned above) - - The images used in your pipeline (make sure you have access to the registries where the images are stored) +When using other runners you will need to make sure that your new environment has access to: + +- The base path of your pipeline (as mentioned above) +- The images used in your pipeline (make sure you have access to the registries where the images are + stored) ```bash fondant compile ``` -The pipeline ref is reference to a fondant pipeline (e.g. `pipeline.py`) where a pipeline instance exists (see above). +The pipeline ref is reference to a fondant pipeline (e.g. `pipeline.py`) where a pipeline instance +exists (see above). This will produce a pipeline spec file associated with a given runner. -To run the pipeline you can use the following command: +To run the pipeline you can use the following command: ```bash fondant run ``` -Here, the pipeline ref can be either be a path to a compiled pipeline spec or a reference to fondant pipeline (e.g. `pipeline.py`) in which case -the pipeline will first be compiled to the corresponding runner specification before running the pipeline. -### Local Runner - -The local runner is mainly aimed at local development and quick iterations, it only scales to the machine that is running the pipeline. -Switching to either the Vertex or Kubeflow runners offers many advantages such as the ability to assign specific hardware requirements, better monitoring and pipeline reproducibility. - -In order to use the local runner, you need to have a recent version of [docker-compose](https://docs.docker.com/compose/install/) installed. - -#### Running a Docker compiled pipeline - - -```bash -fondant run local -``` - -NOTE: that the pipeline ref is the path to the compiled pipeline spec OR a reference to a fondant pipeline in which case a Docker compiler will compile the pipeline -to a docker compose specification before running the pipeline.This will start the pipeline and provide logs per component (service). - -Components that are not located in the registry (local custom components) will be built on runtime. This allows for quicker iteration -during component development. - -The local runner will try to check if the `base_path` of the pipeline is a local or remote storage. If it's local, the `base_path` will be mounted as a bind volume on every service/component. - -If you want to use remote paths (GCS, S3, etc.) you can use the `--auth-gcp`, `--auth-aws` or `--auth-azure`. -This will mount your default local cloud credentials to the pipeline. Make sure you are authenticated locally before running the pipeline and -that you have the correct permissions to access the `base_path` of the pipeline (read/write/create). - -You can also use the `--extra_volumes` argument to mount extra credentials or additional files. -This volumes will be mounted to every component/service of the docker-compose spec. - -```bash -fondant run local --auth-gcp -``` - -### Vertex Runner - -Vertex AI pipelines leverages Kubeflow pipelines under the hood. The Vertex compiler will take your pipeline and compile it to a Kubeflow pipeline spec. -This spec can be used to run your pipeline on Vertex. - -### Running a Vertex compiled pipeline - -You will first need to make sure that your Google Cloud environment is properly setup. More info [here](https://codelabs.developers.google.com/vertex-pipelines-intro#2) - -```bash -fondant run vertex \ ---project-id \ ---project-region \ ---service-account -``` - -Once your pipeline is running you can monitor it using the Vertex UI - -### Kubeflow Runner - -You will need a Kubeflow cluster to run your pipeline on and specify the host of that cluster. More info on setting up a Kubeflow pipelines deployment and the host path can be found in the [kubeflow infrastructure documentation](kubeflow_infrastructure.md). - -```bash -fondant run kubeflow \ - --host -``` - -Once your pipeline is running you can monitor it using the Kubeflow UI. - -### Assigning custom resources to components - - - - - - - - - - - -
Local runnerVertex RunnerKubeflow Runner
- -```python -component = ComponentOp( - component_dir="...", - arguments={ - ..., - }, - resources=Resources( - accelerator_number=1, - accelerator_name="GPU", - ) -) - -``` - - - -```python -component = ComponentOp( - component_dir="...", - arguments={ - ..., - }, - resources=Resources( - accelerator_number=1, - accelerator_name="NVIDIA_TESLA_K80", - memory_limit="512M", - cpu_limit="4", - ) -) -``` - - - -```python -component = ComponentOp( - component_dir="...", - arguments={ - ..., - }, - resources=Resources( - accelerator_number=1, - accelerator_name="GPU", - node_pool_label="node_pool", - node_pool_name="n2-standard-128-pool", -) -``` - -
- -* **Local Runner**: The local runner uses the computation resources (RAM, CPU) of the host machine. In case a GPU is available and is needed for a component, -it needs to be assigned explicitly. - - -* **Vertex Runner**: The computation resources needs to be assigned explicitly, Vertex will then randomly attempt to allocate -a machine that fits the resources. The GPU name needs to be assigned explicitly. Check this [link](https://github.com/googleapis/python-aiplatform/blob/main/google/cloud/aiplatform_v1/types/accelerator_type.py) -for a list of available GPU resources. Make sure to check that the chosen GPU is available in the region where the pipeline will be run. - - -* **Kubeflow Runner**: Each component can optionally be constrained to run on particular node(s) using `node_pool_label` and `node_pool_name`. You can find these under the Kubernetes labels of your cluster. -You can use the default node label provided by Kubernetes or attach your own. Note that the value of these labels is cloud provider specific. Make sure to assign a GPU if required, the specified node needs to -have an available GPU. Note that you can also setup a component to use a preemptible VM by setting `preemptible` to `True`. -This Requires the setup and assignment of a preemptible node pool. Note that preemptibles only work -when KFP is setup on GCP. More info [here](https://v1-6-branch.kubeflow.org/docs/distributions/gke/pipelines/preemptible/). - - - -## Caching pipeline runs - -When Fondant runs a pipeline, it checks to see whether an execution exists in the base path based on the cache key of each component. - -The cache key is defined as the combination of the following: - -1) The **pipeline step's inputs.** These inputs include the input arguments' value (if any). - -2) **The component's specification.** This specification includes the image tag and the fields consumed and produced by each component. - -3) **The component resources.** Defines the hardware that was used to run the component (GPU, nodepool). - -If there is a matching execution in the base path (checked based on the output manifests), -the outputs of that execution are used and the step computation is skipped. -This helps to reduce costs by skipping computations that were completed in a previous pipeline run. - - -Additionally, only the pipelines with the same pipeline name will share the cache. Caching for components -with the `latest` image tag is disabled by default. This is because using "latest" image tags can lead to unpredictable behavior due to -image updates. Moreover, if one component in the pipeline is not cached then caching will be disabled for all -subsequent components. - -You can turn off execution caching at component level by setting the following: - -```python -caption_images_op = ComponentOp( - component_dir="...", - arguments={ - ... - }, - cache=False, -) -``` - -## Setting Custom partitioning parameters - -When working with Fondant, each component deals with datasets. Fondant leverages [Dask](https://www.dask.org/) internally -to handle datasets larger than the available memory. To achieve this, the data is divided -into smaller chunks called "partitions" that can be processed in parallel. Ensuring a sufficient number of partitions -enables parallel processing, where multiple workers process different partitions simultaneously, -and smaller partitions ensure they fit into memory. - -Check this [link](https://docs.dask.org/en/latest/dataframe-design.html#:~:text=dd.from_delayed.-,Partitions%C2%B6,-Internally%2C%20a%20Dask) for more info on Dask partitions. -### How Fondant handles partitions - -Fondant repartitions the loaded dataframe if the number of partitions is fewer than the available workers on the data processing instance. -By repartitioning, the maximum number of workers can be efficiently utilized, leading to faster -and parallel processing. - - -### Customizing Partitioning - -By default, Fondant automatically handles the partitioning, but you can disable this and create your -own custom partitioning logic if you have specific requirements. - -Here's an example of disabling the automatic partitioning: - -```python - -caption_images_op = ComponentOp( - component_dir="components/captioning_component", - arguments={ - "model_id": "Salesforce/blip-image-captioning-base", - "batch_size": 2, - "max_new_tokens": 50, - }, - input_partition_rows='disable', -) -``` - -The code snippet above disables automatic partitions for both the loaded and written dataframes, -allowing you to define your own partitioning logic inside the components. - -Moreover, you have the flexibility to set your own custom partitioning parameters to override the default settings: - -```python - -caption_images_op = ComponentOp( - component_dir="components/captioning_component", - arguments={ - "model_id": "Salesforce/blip-image-captioning-base", - "batch_size": 2, - "max_new_tokens": 50, - }, - input_partition_rows=100, -) -``` - -In the example above, each partition of the loaded dataframe will contain approximately one hundred rows, -and the size of the output partitions will be around 10MB. This capability is useful in scenarios -where processing one row significantly increases the number of rows in the dataset -(resulting in dataset explosion) or causes a substantial increase in row size (e.g., fetching images from URLs). -By setting a lower value for input partition rows, you can mitigate issues where the processed data -grows larger than the available memory before being written to disk. \ No newline at end of file +Here, the pipeline ref can be either be a path to a compiled pipeline spec or a reference to fondant +pipeline (e.g. `pipeline.py`) in which case +the pipeline will first be compiled to the corresponding runner specification before running the +pipeline. diff --git a/docs/runners/kfp.md b/docs/runners/kfp.md new file mode 100644 index 000000000..19dd85e85 --- /dev/null +++ b/docs/runners/kfp.md @@ -0,0 +1,55 @@ +### Kubeflow Runner + +Leverages [Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) +on any Kubernetes cluster. +All Fondant needs is a url pointing to the Kubeflow pipeline host and an Object Storage provider ( +S3, GCS, etc) to store data produced in the pipeline between steps. +We have compiled some references and created some scripts +to [get you started](https://fondant.readthedocs.io/en/latest/infrastructure) with setting up the +required infrastructure. + +### Installing the Vertex runner + +Make sure to install Fondant with the Vertex runner extra. + +```bash +pip install fondant[kfp] +``` + +### Running a pipeline with Kubeflow + +You will need a Kubeflow cluster to run your pipeline on and specify the host of that cluster. More +info on setting up a Kubeflow pipelines deployment and the host path can be found in +the [kubeflow infrastructure documentation](kfp_infrastructure.md). + +```bash +fondant run kubeflow \ + --host +``` + +Once your pipeline is running you can monitor it using the Kubeflow UI. + +#### Assigning custom resources to the pipeline + +Each component can optionally be constrained to run on particular node(s) using `node_pool_label` +and `node_pool_name`. You can find these under the Kubernetes labels of your cluster. +You can use the default node label provided by Kubernetes or attach your own. Note that the value of +these labels is cloud provider specific. Make sure to assign a GPU if required, the specified node +needs to +have an available GPU. + +```python +from fondant.pipeline.pipeline import ComponentOp, Resources + +component = ComponentOp( + component_dir="...", + arguments={ + ..., + }, + resources=Resources( + accelerator_number=1, + accelerator_name="GPU", + node_pool_label="node_pool", + node_pool_name="n2-standard-128-pool", + ) +``` \ No newline at end of file diff --git a/docs/kubeflow_infrastructure.md b/docs/runners/kfp_infrastructure.md similarity index 66% rename from docs/kubeflow_infrastructure.md rename to docs/runners/kfp_infrastructure.md index dc6dabade..e8467b350 100644 --- a/docs/kubeflow_infrastructure.md +++ b/docs/runners/kfp_infrastructure.md @@ -1,13 +1,18 @@ -# Setting up kubeflow +# Setting up kubeflow ## Introduction -In order to run Fondant on [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/), we'll need: + +In order to run Fondant +on [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/), we'll +need: - A kubernetes cluster - Kubeflow pipelines installed on the cluster - A registry to store custom component images like (docker hub, Github Container Registry, etc) -This can be on any kubernetes cluster, if you don't have access to a setup like this or you feel uncomfortable to setup your own we have provided some basic scripts to get you started on GCP or on a small scale locally. +This can be on any kubernetes cluster, if you don't have access to a setup like this or you feel +uncomfortable to setup your own we have provided some basic scripts to get you started on GCP or on +a small scale locally. !!! note "IMPORTANT" - These script serve just a kickstart to help you setup Kubeflow for running Fondant, these are not production ready environments. @@ -15,9 +20,10 @@ This can be on any kubernetes cluster, if you don't have access to a setup like - You should never run a script without inspecting it so please familiarize yourself with the commands defined in the Makefiles and adapt it to your own needs. ## If you already have a kubernetes cluster - -If you already have setup a kubernetes cluster and you have configured kubectl you can install kubeflow pipelines following this [guide](https://www.kubeflow.org/docs/components/pipelines/v1/installation/standalone-deployment/#deploying-kubeflow-pipelines) +If you already have a kubernetes cluster set up, and you have configured kubectl you can install +kubeflow pipelines following +this [guide](https://www.kubeflow.org/docs/components/pipelines/v1/installation/standalone-deployment/#deploying-kubeflow-pipelines) ## Kubeflow on AWS @@ -27,10 +33,12 @@ There are multiple guides on how to setup kubeflow pipelines on AWS: - [Kubeflow Pipelines on AWS](https://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-install.html) - [deployment guide by kubeflow](https://awslabs.github.io/kubeflow-manifests/docs/deployment/) -Fondant needs the host url of kubeflow pipelines which you can [fetch](https://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-install.html#:~:text=.-,Access%20the%20KFP%20UI%20(Kubeflow%20Dashboard),-The%20Kubeflow%20Pipelines) (depending on your setup). - -The BASE_PATH can be an [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) +Fondant needs the host url of kubeflow pipelines which you +can [fetch](https://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-install.html#:~:text=.-,Access%20the%20KFP%20UI%20(Kubeflow%20Dashboard),-The%20Kubeflow%20Pipelines) ( +depending on your setup). +The BASE_PATH can be +an [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) ## Kubeflow on Google Cloud @@ -42,26 +50,35 @@ There are several ways to get up and running with kubeflow pipelines on Google C ### OR you can use the scripts we provide to get a simple setup going -1. If you don't already have a google cloud project ready you can follow this [guide](https://v1-5-branch.kubeflow.org/docs/distributions/gke/deploy/project-setup/) to set one up, you will need to have set up billing. +1. If you don't already have a google cloud project ready you can follow + this [guide](https://v1-5-branch.kubeflow.org/docs/distributions/gke/deploy/project-setup/) to + set one up, you will need to have set up billing. -2. Make sure you have the [gcloud cli](https://cloud.google.com/sdk/docs/install) installed (and it is the latest version) and that you have it configured to use your project by using `gcloud init`. +2. Make sure you have the [gcloud cli](https://cloud.google.com/sdk/docs/install) installed (and it + is the latest version) and that you have it configured to use your project by + using `gcloud init`. -3. Setup [Default compute Region and Zone](https://cloud.google.com/compute/docs/gcloud-compute#default-region-zone) +3. +Setup [Default compute Region and Zone](https://cloud.google.com/compute/docs/gcloud-compute#default-region-zone) -3. Install [kubectl](https://kubernetes.io/docs/tasks/tools/) +3. Install [kubectl](https://kubernetes.io/docs/tasks/tools/) 4. Run gcp.mk Makefile (located in the `scripts/` folder) which will do the following: -- Setup all gcp services needed + +- Setup all gcp services needed - Start a GKE cluster - Create a google storage bucket for data artifact storage -- Authenticate the local machine +- Authenticate the local machine - Install kubeflow pipelines on the cluster To run the complete makefile use (note this might take some time to complete): + ``` make -f gcp.mk ``` + Or run specific steps: + ``` make -f gcp.mk authenticate-gcp-cluster ``` @@ -69,12 +86,16 @@ make -f gcp.mk authenticate-gcp-cluster ### Getting the variables for your pipeline Running the following command: + ``` make -f gcp.mk kubeflow-ui ``` -Will print out the BASE_PATH and HOST which you can use to configure your pipeline. The HOST url will also allow you to use the kubeflow ui when opened in a browser. + +Will print out the BASE_PATH and HOST which you can use to configure your pipeline. The HOST url +will also allow you to use the kubeflow ui when opened in a browser. ### In order to delete the setup: + ``` make -f gcp.mk delete ``` @@ -84,8 +105,6 @@ make -f gcp.mk delete - [Official documentation on cluster creation](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-zonal-cluster) - [Provision a GKE cluster with terraform](https://developer.hashicorp.com/terraform/tutorials/kubernetes/gke) - [Use kubespray to setup a cluster](https://github.com/kubernetes-sigs/kubespray) - -### More Information - [Standalone deployments](https://www.kubeflow.org/docs/components/pipelines/v1/installation/standalone-deployment/) - [Other local cluster installations](https://www.kubeflow.org/docs/components/pipelines/v1/installation/localcluster-deployment/) - [Authenticating google cloud resources like storage and artifact registry](https://minikube.sigs.k8s.io/docs/handbook/addons/gcp-auth/) diff --git a/docs/runners/local.md b/docs/runners/local.md new file mode 100644 index 000000000..addc9a2d7 --- /dev/null +++ b/docs/runners/local.md @@ -0,0 +1,61 @@ +### Local Runner + +Leverages [docker compose](https://docs.docker.com/compose/). The local runner is mainly aimed +at helping you develop fondant pipelines and components faster since it allows you to develop on +your local machine or a Virtual Machine. This enables you to quickly iterate on development. Once +you have a pipeline developed, Switching to either the [Vertex](vertex.md) or [Kubeflow](kfp.md) runners +offers many advantages such as the ability to assign specific hardware requirements, +better monitoring and pipeline reproducibility. + +In order to use the local runner, you need to have a recent version of [docker-compose](https://docs.docker.com/compose/install/) installed. + +### Installing the Local runner + +Make sure that you have installed Docker compose on your system. You can find more information +about this in the [installation](../guides/installation.md) guide. + +```bash +#### Running a Docker compiled pipeline + +```bash +fondant run local +``` + +NOTE: that the pipeline ref is the path to the compiled pipeline spec OR a reference to a fondant pipeline in which case a Docker compiler will compile the pipeline +to a docker compose specification before running the pipeline.This will start the pipeline and provide logs per component (service). + +Components that are not located in the registry (local custom components) will be built on runtime. This allows for quicker iteration +during component development. + +The local runner will try to check if the `base_path` of the pipeline is a local or remote storage. If it's local, the `base_path` will be mounted as a bind volume on every service/component. + +If you want to use remote paths (GCS, S3, etc.) you can use the `--auth-gcp`, `--auth-aws` or `--auth-azure`. +This will mount your default local cloud credentials to the pipeline. Make sure you are authenticated locally before running the pipeline and +that you have the correct permissions to access the `base_path` of the pipeline (read/write/create). + +You can also use the `--extra_volumes` argument to mount extra credentials or additional files. +This volumes will be mounted to every component/service of the docker-compose spec. + +```bash +fondant run local --auth-gcp +``` + +#### Assigning custom resources to the pipeline + +The local runner uses the computation resources (RAM, CPU) of the host machine. In case a GPU is available and is needed for a component, +it needs to be assigned explicitly. + +```python +from fondant.pipeline.pipeline import ComponentOp, Resources + +component = ComponentOp( + component_dir="...", + arguments={ + ..., + }, + resources=Resources( + accelerator_number=1, + accelerator_name="GPU", + ) +) +``` \ No newline at end of file diff --git a/docs/runners/sagmaker.md b/docs/runners/sagmaker.md new file mode 100644 index 000000000..816280ddd --- /dev/null +++ b/docs/runners/sagmaker.md @@ -0,0 +1 @@ +# 🚧 Coming Soon 🚧 \ No newline at end of file diff --git a/docs/runners/vertex.md b/docs/runners/vertex.md new file mode 100644 index 000000000..3319e0f66 --- /dev/null +++ b/docs/runners/vertex.md @@ -0,0 +1,60 @@ +### Vertex Runner + +Uses Google +cloud's [Vertex AI pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) to +help you +orchestrate your Fondant pipelines in a serverless manner. This makes it easy to scale up your +pipelines without worrying about infrastructure +deployment. + +Vertex AI pipelines leverages Kubeflow pipelines under the hood. The Vertex compiler will take your +pipeline and compile it to a Kubeflow pipeline spec. +This spec can be used to run your pipeline on Vertex. + +### Installing the Vertex runner + +Make sure to install Fondant with the Vertex runner extra. + +```bash +pip install fondant[vertex] +``` + +### Running a pipeline with Vertex + +You will first need to make sure that your Google Cloud environment is properly setup. More +info [here](https://codelabs.developers.google.com/vertex-pipelines-intro#2) + +```bash +fondant run vertex \ +--project-id \ +--project-region \ +--service-account +``` + +Once your pipeline is running you can monitor it using the Vertex UI. + +#### Assigning custom resources to the pipeline + +The computation resources needs to be assigned explicitly, Vertex will then randomly attempt to +allocate +a machine that fits the resources. The GPU name needs to be assigned explicitly. Check +this [link](https://github.com/googleapis/python-aiplatform/blob/main/google/cloud/aiplatform_v1/types/accelerator_type.py) +for a list of available GPU resources. Make sure to check that the chosen GPU is available in the +region where the pipeline will be run. + +```python +from fondant.pipeline.pipeline import ComponentOp, Resources + +component = ComponentOp( + component_dir="...", + arguments={ + ..., + }, + resources=Resources( + accelerator_number=1, + accelerator_name="NVIDIA_TESLA_K80", + memory_limit="512M", + cpu_limit="4", + ) +) +``` diff --git a/mkdocs.yml b/mkdocs.yml index ba49ec693..31024553f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -32,21 +32,33 @@ extra_css: - stylesheets/extra.css nav: - Home: index.md - - Getting Started: getting_started.md - - Contributing: contributing.md - - Guides: - - Build a simple pipeline: guides/build_a_simple_pipeline.md - - Implement custom components: guides/implement_custom_components.md - - Building a pipeline: pipeline.md + - Using the documentation: documentation_guide.md + - Getting Started: + - Installation: guides/installation.md + - Running your first pipeline: guides/first_pipeline.md + - Building your own pipeline: guides/build_a_simple_pipeline.md + - Implement custom components: guides/implement_custom_components.md + - Pipeline: pipeline.md - Components: - - Components: components/components.md - - Creating custom components: components/custom_component.md - - Read / write components: components/generic_component.md - - Component spec: components/component_spec.md - - Hub: components/hub.md - - Data explorer: data_explorer.md - - Infrastructure: infrastructure.md - - Manifest: manifest.md + - Components: components/components.md + - Using generic components: components/generic_component.md + - Creating custom components: components/custom_component.md + - Component spec: components/component_spec.md + - Publishing components: components/publishing_components.md + - Components hub: components/hub.md + - Runners: + - Local: runners/local.md + - Vertex: runners/vertex.md + - Kubeflow: runners/kfp.md + - Explorer: data_explorer.md + - Advanced: + - Architecture: architecture.md + - Manifest: manifest.md + - Caching: caching.md + - Handling partitions: partitions.md + - Setting up Kubeflow: runners/kfp_infrastructure.md + - FAQ: faq.md + - Contributing: contributing.md - Announcements: - announcements/CC_25M_community.md