From c2f60fa8f933df3c5ab9f2ca170dfcfb91ed3247 Mon Sep 17 00:00:00 2001 From: Matthias Richter Date: Mon, 25 Sep 2023 14:10:16 +0200 Subject: [PATCH] Update readme --- README.md | 72 +++++++++++++++++++++++++++++++++---------------------- 1 file changed, 43 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index abd97ac82..bca1bf4ac 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@

- Sweet data-centric foundation model fine-tuning + Large-scale data processing made easy and reusable
Explore the docs Β»
@@ -15,36 +15,37 @@

--- -**Fondant helps you create high quality datasets to train or fine-tune foundation models such as:** - -- 🎨 Stable Diffusion -- πŸ“„ GPT-like Large Language Models (LLMs) -- πŸ”Ž CLIP -- βœ‚οΈ Segment Anything (SAM) -- βž• And many more +🍫**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making +containerized components reusable across pipelines and execution environments and shareable within the community.** +It offers: +- πŸ”§ Plug β€˜n’ play composable pipelines for creating datasets for + - AI image generation model fine-tuning (Stable Diffusion, ControlNet) + - Large language model fine-tuning (LLaMA, Falcon) + - Code generation model fine-tuning (StarCoder) +- 🧱 Library of off-the-shelf reusable components for + - Extracting data from public sources such as Common Crawl, LAION, ... + - Filtering on + - Content, e.g. language, visual style, topic, format, aesthetics, etc. + - Context, e.g. copyright license, origin + - Metadata + - Removal of unwanted data such as toxic, NSFW or generated content + - Removal of unwanted data patterns such as societal bias + - Transforming data (resizing, cropping, reformatting, …) + - Tuning the data for model performance (normalization, deduplication, …) + - Enriching data (captioning, metadata generation, synthetics, …) + - Transparency, auditability, compliance +- πŸ“– πŸ–ΌοΈ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc. +- 🐍 Standardized, Python/Pandas-based way of creating custom components +- 🏭 Production-ready, scalable deployment +- ☁️ Multi-cloud integrations ## πŸͺ€ Why Fondant? -Foundation models simplify inference by solving multiple tasks across modalities with a simple -prompt-based interface. But what they've gained in the front, they've lost in the back. -**These models require enormous amounts of data, moving complexity towards data preparation**, and -leaving few parties able to train their own models. - -We believe that **innovation is a group effort**, requiring collaboration. While the community has -been building and sharing models, everyone is still building their data preparation from scratch. -**Fondant is the platform where we meet to build and share data preparation workflows.** - -Fondant offers a framework to build **composable data preparation pipelines, with reusable -components, optimized to handle massive datasets**. Stop building from scratch, and start -reusing components to: - -- Extend your data with public datasets -- Generate new modalities using captioning, segmentation, translation, image generation, ... -- Distill knowledge from existing foundation models -- Filter out low quality data -- Deduplicate data - -And create high quality datasets to fine-tune your own foundation models. +In the age of Foundation Models, control over your data is key and building pipelines +for large-scale data processing is costly, especially when they require advanced +machine learning-based operations. This need not be the case, however, if processing +components would be reusable and exchangeable and pipelines were easily composable. +Realizing this is the main vision behind Fondant.

(back to top)

@@ -56,6 +57,13 @@ Anxious to get started? Here's is a [step by step guide](https://fondant.readthe Curious to see what Fondant can do? Have a look at our example pipelines: +### Filtering creative commons image dataset + +We have published an image dataset containing 25 million images. +As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that +demonstrates the download and filtering of these images. In the pipeline folder, +you will find detailed instructions on how to execute the pipeline and explore the images. + ### Fine-tuning ControlNet Our @@ -94,6 +102,12 @@ point to create datasets for training code assistants.

(back to top)

+### Filtering creative commons image dataset + +We have published an image dataset containing 25 million images. +As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that +demonstrates the download and filtering of these images. In the pipeline folder, +you will find detailed instructions on how to execute the pipeline and explore the images. ## 🧩 Reusable components @@ -326,4 +340,4 @@ poetry install pre-commit install ``` -

(back to top)

+

(back to top)

\ No newline at end of file