Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update readme #459

Merged
merged 1 commit into from
Sep 25, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 43 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
</p>
<p align="center">
<i>Sweet data-centric foundation model fine-tuning</i>
<i>Large-scale data processing made easy and reusable</i>
<br>
<a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs »</strong></a>
<br>
Expand All @@ -15,36 +15,37 @@
</p>

---
**Fondant helps you create high quality datasets to train or fine-tune foundation models such as:**

- 🎨 Stable Diffusion
- 📄 GPT-like Large Language Models (LLMs)
- 🔎 CLIP
- ✂️ Segment Anything (SAM)
- ➕ And many more
🍫**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
containerized components reusable across pipelines and execution environments and shareable within the community.**
It offers:
- 🔧 Plug ‘n’ play composable pipelines for creating datasets for
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
- Large language model fine-tuning (LLaMA, Falcon)
- Code generation model fine-tuning (StarCoder)
- 🧱 Library of off-the-shelf reusable components for
- Extracting data from public sources such as Common Crawl, LAION, ...
- Filtering on
- Content, e.g. language, visual style, topic, format, aesthetics, etc.
- Context, e.g. copyright license, origin
- Metadata
- Removal of unwanted data such as toxic, NSFW or generated content
- Removal of unwanted data patterns such as societal bias
- Transforming data (resizing, cropping, reformatting, …)
- Tuning the data for model performance (normalization, deduplication, …)
- Enriching data (captioning, metadata generation, synthetics, …)
- Transparency, auditability, compliance
- 📖 🖼️ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
- 🐍 Standardized, Python/Pandas-based way of creating custom components
- 🏭 Production-ready, scalable deployment
- ☁️ Multi-cloud integrations

## 🪤 Why Fondant?

Foundation models simplify inference by solving multiple tasks across modalities with a simple
prompt-based interface. But what they've gained in the front, they've lost in the back.
**These models require enormous amounts of data, moving complexity towards data preparation**, and
leaving few parties able to train their own models.

We believe that **innovation is a group effort**, requiring collaboration. While the community has
been building and sharing models, everyone is still building their data preparation from scratch.
**Fondant is the platform where we meet to build and share data preparation workflows.**

Fondant offers a framework to build **composable data preparation pipelines, with reusable
components, optimized to handle massive datasets**. Stop building from scratch, and start
reusing components to:

- Extend your data with public datasets
- Generate new modalities using captioning, segmentation, translation, image generation, ...
- Distill knowledge from existing foundation models
- Filter out low quality data
- Deduplicate data

And create high quality datasets to fine-tune your own foundation models.
In the age of Foundation Models, control over your data is key and building pipelines
for large-scale data processing is costly, especially when they require advanced
machine learning-based operations. This need not be the case, however, if processing
components would be reusable and exchangeable and pipelines were easily composable.
Realizing this is the main vision behind Fondant.

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>

Expand All @@ -56,6 +57,13 @@ Anxious to get started? Here's is a [step by step guide](https://fondant.readthe

Curious to see what Fondant can do? Have a look at our example pipelines:

### Filtering creative commons image dataset

We have published an image dataset containing 25 million images.
As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that
demonstrates the download and filtering of these images. In the pipeline folder,
you will find detailed instructions on how to execute the pipeline and explore the images.

### Fine-tuning ControlNet

Our
Expand Down Expand Up @@ -94,6 +102,12 @@ point to create datasets for training code assistants.

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>

### Filtering creative commons image dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems duplicated


We have published an image dataset containing 25 million images.
As a result, we have provided a [sample pipeline](examples/pipelines/filter-cc-25m) that
demonstrates the download and filtering of these images. In the pipeline folder,
you will find detailed instructions on how to execute the pipeline and explore the images.

## 🧩 Reusable components

Expand Down Expand Up @@ -326,4 +340,4 @@ poetry install
pre-commit install
```

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
Loading