From c2f60fa8f933df3c5ab9f2ca170dfcfb91ed3247 Mon Sep 17 00:00:00 2001
From: Matthias Richter
Date: Mon, 25 Sep 2023 14:10:16 +0200
Subject: [PATCH] Update readme
---
README.md | 72 +++++++++++++++++++++++++++++++++----------------------
1 file changed, 43 insertions(+), 29 deletions(-)
diff --git a/README.md b/README.md
index abd97ac82..bca1bf4ac 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
- Sweet data-centric foundation model fine-tuning
+ Large-scale data processing made easy and reusable
Explore the docs Β»
@@ -15,36 +15,37 @@
---
-**Fondant helps you create high quality datasets to train or fine-tune foundation models such as:**
-
-- π¨ Stable Diffusion
-- π GPT-like Large Language Models (LLMs)
-- π CLIP
-- βοΈ Segment Anything (SAM)
-- β And many more
+π«**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
+containerized components reusable across pipelines and execution environments and shareable within the community.**
+It offers:
+- π§ Plug βnβ play composable pipelines for creating datasets for
+ - AI image generation model fine-tuning (Stable Diffusion, ControlNet)
+ - Large language model fine-tuning (LLaMA, Falcon)
+ - Code generation model fine-tuning (StarCoder)
+- 𧱠Library of off-the-shelf reusable components for
+ - Extracting data from public sources such as Common Crawl, LAION, ...
+ - Filtering on
+ - Content, e.g. language, visual style, topic, format, aesthetics, etc.
+ - Context, e.g. copyright license, origin
+ - Metadata
+ - Removal of unwanted data such as toxic, NSFW or generated content
+ - Removal of unwanted data patterns such as societal bias
+ - Transforming data (resizing, cropping, reformatting, β¦)
+ - Tuning the data for model performance (normalization, deduplication, β¦)
+ - Enriching data (captioning, metadata generation, synthetics, β¦)
+ - Transparency, auditability, compliance
+- π πΌοΈ ποΈ βΎοΈ Out of the box multimodal capabilities: text, images, video, etc.
+- π Standardized, Python/Pandas-based way of creating custom components
+- π Production-ready, scalable deployment
+- βοΈ Multi-cloud integrations
## πͺ€ Why Fondant?
-Foundation models simplify inference by solving multiple tasks across modalities with a simple
-prompt-based interface. But what they've gained in the front, they've lost in the back.
-**These models require enormous amounts of data, moving complexity towards data preparation**, and
-leaving few parties able to train their own models.
-
-We believe that **innovation is a group effort**, requiring collaboration. While the community has
-been building and sharing models, everyone is still building their data preparation from scratch.
-**Fondant is the platform where we meet to build and share data preparation workflows.**
-
-Fondant offers a framework to build **composable data preparation pipelines, with reusable
-components, optimized to handle massive datasets**. Stop building from scratch, and start
-reusing components to:
-
-- Extend your data with public datasets
-- Generate new modalities using captioning, segmentation, translation, image generation, ...
-- Distill knowledge from existing foundation models
-- Filter out low quality data
-- Deduplicate data
-
-And create high quality datasets to fine-tune your own foundation models.
+In the age of Foundation Models, control over your data is key and building pipelines
+for large-scale data processing is costly, especially when they require advanced
+machine learning-based operations. This need not be the case, however, if processing
+components would be reusable and exchangeable and pipelines were easily composable.
+Realizing this is the main vision behind Fondant.
(back to top)
@@ -56,6 +57,13 @@ Anxious to get started? Here's is a [step by step guide](https://fondant.readthe
Curious to see what Fondant can do? Have a look at our example pipelines:
+### Filtering creative commons image dataset
+
+We have published an image dataset containing 25 million images.
+As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that
+demonstrates the download and filtering of these images. In the pipeline folder,
+you will find detailed instructions on how to execute the pipeline and explore the images.
+
### Fine-tuning ControlNet
Our
@@ -94,6 +102,12 @@ point to create datasets for training code assistants.
(back to top)
+### Filtering creative commons image dataset
+
+We have published an image dataset containing 25 million images.
+As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that
+demonstrates the download and filtering of these images. In the pipeline folder,
+you will find detailed instructions on how to execute the pipeline and explore the images.
## 𧩠Reusable components
@@ -326,4 +340,4 @@ poetry install
pre-commit install
```
-(back to top)
+(back to top)
\ No newline at end of file