Large-scale data processing made easy and reusable
Explore the docs »
🍫Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making containerized components reusable across pipelines and execution environments and shareable within the community. It offers:
- 🔧 Plug ‘n’ play composable pipelines for creating datasets for
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
- Large language model fine-tuning (LLaMA, Falcon)
- Code generation model fine-tuning (StarCoder)
- 🧱 Library of off-the-shelf reusable components for
- Extracting data from public sources such as Common Crawl, LAION, ...
- Filtering on
- Content, e.g. language, visual style, topic, format, aesthetics, etc.
- Context, e.g. copyright license, origin
- Metadata
- Removal of unwanted data such as toxic, NSFW or generated content
- Removal of unwanted data patterns such as societal bias
- Transforming data (resizing, cropping, reformatting, …)
- Tuning the data for model performance (normalization, deduplication, …)
- Enriching data (captioning, metadata generation, synthetics, …)
- Transparency, auditability, compliance
- 📖 🖼️ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
- 🐍 Standardized, Python/Pandas-based way of creating custom components
- 🏭 Production-ready, scalable deployment
- ☁️ Multi-cloud integrations
In the age of Foundation Models, control over your data is key and building pipelines for large-scale data processing is costly, especially when they require advanced machine learning-based operations. This need not be the case, however, if processing components would be reusable and exchangeable and pipelines were easily composable. Realizing this is the main vision behind Fondant.
Anxious to get started? Here's is a step by step guide to get your first pipeline up and running.
Curious to see what Fondant can do? Have a look at our example pipelines:
We have published an image dataset containing 25 million images. As a result, we have provided a sample pipeline that demonstrates the download and filtering of these images. In the pipeline folder, you will find detailed instructions on how to execute the pipeline and explore the images.
Our example pipeline to generate data for ControlNet fine-tuning allows you to create models that you can control using inpainting, segmentation, and regeneration. All you need to get started is a set of prompts describing the type of images to generate.
For instance, using our ControlNet model fine-tuned on interior design images, allows you to generate the room of your dreams:
Input image | Output image |
---|---|
Want to try out the resulting model yourself, head over to our Hugging Face space!
Using our example pipeline to fine-tune Stable Diffusion allows you to create models that generate better images within a specific domain. All you need to get started is a small seed dataset of example images.
Eg. generating logos:
Stable Diffusion 1.5 | Fine-tuned Stable Diffusion 1.5 |
---|---|
Using our example pipeline to train StarCoder provides a starting point to create datasets for training code assistants.
Fondant comes with a library of reusable components, which can jumpstart your pipeline.
COMPONENT | DESCRIPTION |
---|---|
Data loading / writing | |
load_from_hf_hub | Load a dataset from the Hugging Face Hub |
write_to_hf_hub | Write a dataset to the Hugging Face Hub |
prompt_based_laion_retrieval | Retrieve images-text pairs from LAION using prompt similarity |
embedding_based_laion_retrieval | Retrieve images-text pairs from LAION using embedding similarity |
download_images | Download images from urls |
Image processing | |
embed_images | Create embeddings for images using a model from the HF Hub |
image_resolution_extraction | Extract the resolution from images |
filter_image_resolution | Filter images based on their resolution |
caption images | Generate captions for images using a model from the HF Hub |
segment_images | Generate segmentation maps for images using a model from the HF Hub |
image_cropping | Intelligently crop out image borders |
Code processing | |
pii_redaction | Redact Personal Identifiable Information (PII) |
filter_comments | Filter code based on code to comment ratio |
filter_line_length | Filter code based on line length |
Language processing | Coming soon |
Clustering | Coming soon |
Fondant can be installed using pip:
pip install fondant
For the latest development version, you might want to install from source instead:
pip install git+https://github.com/ml6team/fondant.git
There are 2 ways of using fondant:
- Leveraging Kubeflow pipelines on any Kubernetes cluster. All Fondant needs is an url pointing to the Kubeflow pipeline host and an Object Storage provider (S3, GCS, etc) to store data produced in the pipeline between steps. We have compiled some references and created some scripts to get you started with setting up the required infrastructure.
- Or locally by using docker compose. This way is mainly aimed at helping you develop fondant pipelines and components faster by making it easier to run things on a smaller scale.
The same pipeline can be used in both variants allowing you to quickly develop and iterate using the local Docker Compose implementation and then using the power of Kubeflow pipelines to run a large scale pipeline.
Fondant allows you to easily define data pipelines comprised of both reusable and custom
components. The following pipeline for instance uses the reusable load_from_hf_hub
component
to load a dataset from the Hugging Face Hub and process it using a custom component:
from fondant.pipeline import ComponentOp, Pipeline, Client
def build_pipeline():
pipeline = Pipeline(pipeline_name="example pipeline", base_path="fs://bucket")
load_from_hub_op = ComponentOp.from_registry(
name="load_from_hf_hub",
arguments={"dataset_name": "lambdalabs/pokemon-blip-captions"},
)
pipeline.add_op(load_from_hub_op)
custom_op = ComponentOp(
component_dir="components/custom_component",
arguments={
"min_width": 600,
"min_height": 600,
},
)
pipeline.add_op(custom_op, dependencies=load_from_hub_op)
return pipeline
if __name__ == "__main__":
client = Client(host="https://kfp-host.com/")
pipeline = build_pipeline()
client.compile_and_run(pipeline=pipeline)
To create a custom component, you first need to describe its contract as a yaml specification. It defines the data consumed and produced by the component and any arguments it takes.
name: Custom component
description: This is a custom component
image: custom_component:latest
consumes:
images:
fields:
data:
type: binary
produces:
captions:
fields:
data:
type: utf8
args:
argument1:
description: An argument passed to the component at runtime
type: str
argument2:
description: Another argument passed to the component at runtime
type: str
Once you have your component specification, all you need to do is implement a constructor
and a single .transform
method and Fondant will do the rest. You will get the data defined in
your specification partition by partition as a Pandas dataframe.
import pandas as pd
from fondant.component import PandasTransformComponent
from fondant.executor import PandasTransformExecutor
class ExampleComponent(PandasTransformComponent):
def __init__(self, *args, argument1, argument2) -> None:
"""
Args:
argumentX: An argument passed to the component
"""
# Initialize your component here based on the arguments
def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Implement your custom logic in this single method
Args:
dataframe: A Pandas dataframe containing the data
Returns:
A pandas dataframe containing the transformed data
"""
For more advanced use cases, you can use the DaskTransformComponent
instead.
Once you have a pipeline you can easily run (and compile) it by using the built-in CLI:
fondant run pipeline.py --local
To see all available arguments you can check the fondant CLI help pages
fondant --help
Or for a subcommand:
fondant <subcommand> --help
Fondant is currently in the alpha stage, offering a minimal viable interface. While you should expect to run into rough edges, the foundations are ready and Fondant should already be able to speed up your data preparation work.
The following topics are on our roadmap
- Local pipeline execution
- Non-linear pipeline DAGs
- LLM-focused example pipelines and reusable components
- Static validation, caching, and partial execution of pipelines
- Data lineage and experiment tracking
- Distributed execution, both on and off cluster
- Support other dataframe libraries such as HF Datasets, Polars, Spark
- Move reusable components into a decentralized component registry
- Create datasets of copy-right free data for fine-tuning
- Create reusable components for bias detection and mitigation
The roadmap and priority are defined based on community feedback. To provide input, you can join our discord or submit an idea in our Github Discussions.
For a detailed view on the roadmap and day to day development, you can check our github project board.
We welcome contributions of different kinds:
Issues | If you encounter any issue or bug, please submit them as a Github issue. You can also submit a pull request directly to fix any clear bugs. |
Suggestions and feedback | If you have any suggestions or feedback, please reach out via our Discord server or Github Discussions! |
Framework code contributions | If you want to help with the development of the Fondant framework, have a look at the issues marked with the good first issue label. If you want to add additional functionality, please submit an issue for it first. |
Reusable components | Extending our library of reusable components is a great way to contribute. If you built a component which would be useful for other users, please submit a PR adding them to the components/ directory. You can find a list of possible components here or your own ideas are also welcome! |
Example pipelines | If you built a pipeline with Fondant which can serve as an example to other users, please submit a PR adding them to the examples/ directory. |
We use poetry and pre-commit to enable a smooth developer flow. Run the following commands to set up your development environment:
pip install poetry
poetry install
pre-commit install