ml6team · GeorgesLorre · Sep 25, 2023 · Sep 25, 2023 · PhilippeMoussalli · Sep 25, 2023
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
     <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
 </p>
 <p align="center">
-    <i>Sweet data-centric foundation model fine-tuning</i>
+    <i>Large-scale data processing made easy and reusable</i>
     <br>
     <a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs »</strong></a>
     <br>
@@ -15,36 +15,37 @@
 </p>
 
 ---
-**Fondant helps you create high quality datasets to train or fine-tune foundation models such as:**
-
-- 🎨 Stable Diffusion  
-- 📄 GPT-like Large Language Models (LLMs)  
-- 🔎 CLIP  
-- ✂️ Segment Anything (SAM)  
-- ➕ And many more
+🍫**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making 
+containerized components reusable across pipelines and execution environments and shareable within the community.**
+It offers:
+- 🔧 Plug ‘n’ play composable pipelines for creating datasets for
+    - AI image generation model fine-tuning (Stable Diffusion, ControlNet)
+    - Large language model fine-tuning (LLaMA, Falcon)
+    - Code generation model fine-tuning (StarCoder)
+- 🧱 Library of off-the-shelf reusable components for
+    - Extracting data from public sources such as Common Crawl, LAION, ...
+    - Filtering on 
+        - Content, e.g. language, visual style, topic, format, aesthetics, etc.
+        - Context, e.g. copyright license, origin
+        - Metadata
+    - Removal of unwanted data such as toxic, NSFW or generated content
+    - Removal of unwanted data patterns such as societal bias
+    - Transforming data (resizing, cropping, reformatting, …)
+    - Tuning the data for model performance (normalization, deduplication, …)
+    - Enriching data (captioning, metadata generation, synthetics, …)
+    - Transparency, auditability, compliance
+- 📖 🖼️ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
+- 🐍 Standardized, Python/Pandas-based way of creating custom components
+- 🏭 Production-ready, scalable deployment
+- ☁️ Multi-cloud integrations
 
 ## 🪤 Why Fondant?
 
-Foundation models simplify inference by solving multiple tasks across modalities with a simple
-prompt-based interface. But what they've gained in the front, they've lost in the back. 
-**These models require enormous amounts of data, moving complexity towards data preparation**, and 
-leaving few parties able to train their own models.
-
-We believe that **innovation is a group effort**, requiring collaboration. While the community has 
-been building and sharing models, everyone is still building their data preparation from scratch.
-**Fondant is the platform where we meet to build and share data preparation workflows.**
-
-Fondant offers a framework to build **composable data preparation pipelines, with reusable 
-components, optimized to handle massive datasets**. Stop building from scratch, and start 
-reusing components to:
-
-- Extend your data with public datasets
-- Generate new modalities using captioning, segmentation, translation, image generation, ...
-- Distill knowledge from existing foundation models
-- Filter out low quality data
-- Deduplicate data
-
-And create high quality datasets to fine-tune your own foundation models.
+In the age of Foundation Models, control over your data is key and building pipelines
+for large-scale data processing is costly, especially when they require advanced
+machine learning-based operations. This need not be the case, however, if processing
+components would be reusable and exchangeable and pipelines were easily composable. 
+Realizing this is the main vision behind Fondant.
 
 <p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
 
@@ -56,6 +57,13 @@ Anxious to get started? Here's is a [step by step guide](https://fondant.readthe
 
 Curious to see what Fondant can do? Have a look at our example pipelines:
 
+### Filtering creative commons image dataset
+
+We have published an image dataset containing 25 million images. 
+As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that 
+demonstrates the download and filtering of these images. In the pipeline folder, 
+you will find detailed instructions on how to execute the pipeline and explore the images.
+
 ### Fine-tuning ControlNet
 
 Our 
@@ -94,6 +102,12 @@ point to create datasets for training code assistants.
 
 <p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
 
+### Filtering creative commons image dataset
+
+We have published an image dataset containing 25 million images. 
+As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that 
+demonstrates the download and filtering of these images. In the pipeline folder, 
+you will find detailed instructions on how to execute the pipeline and explore the images.
 
 ## 🧩 Reusable components
 
@@ -326,4 +340,4 @@ poetry install
 pre-commit install
 ```
 
-<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
+<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>