Skip to content

Commit

Permalink
Update readme (#459)
Browse files Browse the repository at this point in the history
  • Loading branch information
mrchtr authored Sep 25, 2023
1 parent 0522483 commit dd50f02
Showing 1 changed file with 43 additions and 29 deletions.
72 changes: 43 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
</p>
<p align="center">
<i>Sweet data-centric foundation model fine-tuning</i>
<i>Large-scale data processing made easy and reusable</i>
<br>
<a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs »</strong></a>
<br>
Expand All @@ -15,36 +15,37 @@
</p>

---
**Fondant helps you create high quality datasets to train or fine-tune foundation models such as:**

- 🎨 Stable Diffusion
- 📄 GPT-like Large Language Models (LLMs)
- 🔎 CLIP
- ✂️ Segment Anything (SAM)
- ➕ And many more
🍫**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
containerized components reusable across pipelines and execution environments and shareable within the community.**
It offers:
- 🔧 Plug ‘n’ play composable pipelines for creating datasets for
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
- Large language model fine-tuning (LLaMA, Falcon)
- Code generation model fine-tuning (StarCoder)
- 🧱 Library of off-the-shelf reusable components for
- Extracting data from public sources such as Common Crawl, LAION, ...
- Filtering on
- Content, e.g. language, visual style, topic, format, aesthetics, etc.
- Context, e.g. copyright license, origin
- Metadata
- Removal of unwanted data such as toxic, NSFW or generated content
- Removal of unwanted data patterns such as societal bias
- Transforming data (resizing, cropping, reformatting, …)
- Tuning the data for model performance (normalization, deduplication, …)
- Enriching data (captioning, metadata generation, synthetics, …)
- Transparency, auditability, compliance
- 📖 🖼️ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
- 🐍 Standardized, Python/Pandas-based way of creating custom components
- 🏭 Production-ready, scalable deployment
- ☁️ Multi-cloud integrations

## 🪤 Why Fondant?

Foundation models simplify inference by solving multiple tasks across modalities with a simple
prompt-based interface. But what they've gained in the front, they've lost in the back.
**These models require enormous amounts of data, moving complexity towards data preparation**, and
leaving few parties able to train their own models.

We believe that **innovation is a group effort**, requiring collaboration. While the community has
been building and sharing models, everyone is still building their data preparation from scratch.
**Fondant is the platform where we meet to build and share data preparation workflows.**

Fondant offers a framework to build **composable data preparation pipelines, with reusable
components, optimized to handle massive datasets**. Stop building from scratch, and start
reusing components to:

- Extend your data with public datasets
- Generate new modalities using captioning, segmentation, translation, image generation, ...
- Distill knowledge from existing foundation models
- Filter out low quality data
- Deduplicate data

And create high quality datasets to fine-tune your own foundation models.
In the age of Foundation Models, control over your data is key and building pipelines
for large-scale data processing is costly, especially when they require advanced
machine learning-based operations. This need not be the case, however, if processing
components would be reusable and exchangeable and pipelines were easily composable.
Realizing this is the main vision behind Fondant.

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>

Expand All @@ -56,6 +57,13 @@ Anxious to get started? Here's is a [step by step guide](https://fondant.readthe

Curious to see what Fondant can do? Have a look at our example pipelines:

### Filtering creative commons image dataset

We have published an image dataset containing 25 million images.
As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that
demonstrates the download and filtering of these images. In the pipeline folder,
you will find detailed instructions on how to execute the pipeline and explore the images.

### Fine-tuning ControlNet

Our
Expand Down Expand Up @@ -94,6 +102,12 @@ point to create datasets for training code assistants.

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>

### Filtering creative commons image dataset

We have published an image dataset containing 25 million images.
As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that
demonstrates the download and filtering of these images. In the pipeline folder,
you will find detailed instructions on how to execute the pipeline and explore the images.

## 🧩 Reusable components

Expand Down Expand Up @@ -326,4 +340,4 @@ poetry install
pre-commit install
```

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>

0 comments on commit dd50f02

Please sign in to comment.