Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update readme with dataset focus #928

Merged
merged 3 commits into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 85 additions & 106 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,138 +20,62 @@
<a href="https://github.com/ml6team/fondant/actions/workflows/pipeline.yaml"><img alt="GitHub Workflow Status" src="https://img.shields.io/github/actions/workflow/status/ml6team/fondant/pipeline.yaml?style=flat-square"></a>
<a href="https://coveralls.io/github/ml6team/fondant?branch=main"><img alt="Coveralls" src="https://img.shields.io/coverallsCoverage/github/ml6team/fondant?style=flat-square"></a>
</p>

---

<table>
<thead>
<tr>
<th width="33%">🚀 Production-ready</th>
<th width="33%">👶 Easy</th>
<th width="33%">👫 Shareable</th>
</tr>
</thead>
<tbody>
<tr>
<td>
Benefit from built-in features such as autoscaling, data lineage, and pipeline caching, and deploy to (managed) platforms such as <i>Vertex AI</i>, <i>Sagemaker</i>, and <i>Kubeflow Pipelines</i>.
</td>
<td>
Implement your custom data processing code using datastructures you know such as <i>Pandas</i> dataframes.
Move from local development to remote deployment without any code changes.
</td>
<td>
Fondant components are defined by a clear interface, which makes them reusable and shareable.<br>
Compose your own pipeline using components available on <a href="https://fondant.ai/en/latest/components/hub/"><b>our hub</b></a>.
</td>
</tr>
</tbody>
</table>
<br>

## 🪤 Why Fondant?

With the advent of transfer learning and now foundation models, everyone has started sharing and
reusing machine learning models. Most of the work now goes into building data processing
pipelines, which everyone still does from scratch.
This doesn't need to be the case, though, if processing components would be shareable and pipelines
composable. Realizing this is the main vision behind Fondant.
Fondant is a data framework that enables collaborative dataset building. It is designed for developing and crafting datasets together, sharing reusable operations and complete data processing trees.

Towards that end, Fondant offers:
Fondant enables you to initialize datasets, apply various operations on them, and load datasets from other users. It assists in executing operations on managed services, sharing operations with others, and keeping track of your dataset versions. Fondant makes this all possible without moving the source data.

- 🔧 Plug ‘n’ play composable data processing pipelines
- 🧩 Library containing off-the-shelf reusable components
- 🐼 A simple Pandas based interface for creating custom components
- 📊 Built-in lineage, caching, and data explorer
- 🚀 Production-ready, scalable deployment
- ☁️ Integration with runners across different clouds (Vertex, Sagemaker, Kubeflow)

<p align="right">(<a href="#top">back to top</a>)</p>

## 💨 Getting Started

Eager to get started? Follow our [**step by step guide**](https://fondant.ai/en/latest/guides/first_dataset/) to get your first pipeline up and running.

<p align="right">(<a href="#top">back to top</a>)</p>

## 🧩 Reusable components

Fondant comes with a library of reusable components that you can leverage to compose your own
pipeline:

- Data ingestion: _S3, GCS, ABS, Hugging Face, local file system, ..._
- Data Filtering: _Duplicates, language, visual style, topic, format, aesthetics, NSFW, license,
..._
- Data Enrichment: _Captions, segmentations, embeddings, ..._
- Data Transformation: _Image cropping, image resizing, text chunking, ...._
- Data retrieval: _Common Crawl, LAION, ..._
Fondant allows you to easily define workflows comprised of both reusable and custom components. The following example uses the reusable load_from_hf_hub component to load a dataset from the Hugging Face Hub and process it using a custom component that will resize the images resulting in a new dataset.

👉 **Check our [Component Hub](https://fondant.ai/en/latest/components/hub/) for an overview of all
available components**

<p align="right">(<a href="#top">back to top</a>)</p>
```pipeline.py
import pyarrow as pa

## 🪄 Example pipelines

We have created several ready-made example pipelines for you to use as a starting point for exploring Fondant.
They are hosted as separate repositories containing a notebook tutorial so you can easily clone them and get started:

📖 [**RAG tuning pipeline**](https://github.com/ml6team/fondant-usecase-RAG)
End-to-end Fondant pipelines to index and evaluate RAG (Retrieval-Augmented Generation) systems.

🛋️ [**ControlNet Interior Design Pipeline**](https://github.com/ml6team/fondant-usecase-controlnet)
An end-to-end Fondant pipeline to collect and process data for the fine-tuning of a ControlNet model, focusing on images related to interior design.
from fondant.dataset import Dataset

🖼️ [**Filter creative common license images**](https://github.com/ml6team/fondant-usecase-filter-creative-commons)
An end-to-end Fondant pipeline that starts from our Fondant-CC-25M creative commons image dataset and filters and downloads the desired images.

## ⚒️ Installation

First, run the minimal Fondant installation:

```
pip install fondant
```

Fondant also includes extra dependencies for specific runners, storage integrations and publishing
components to registries. The dependencies for the local runner (docker) is included by default.

For more detailed installation options, check the [**installation page**](https://fondant.ai/en/latest/guides/installation/) on our documentation.


## 👨‍💻 Usage

#### Pipeline

Fondant allows you to easily define data pipelines comprised of both reusable and custom
components. The following pipeline for instance uses the reusable `load_from_hf_hub` component
to load a dataset from the Hugging Face Hub and process it using a custom component:

**_pipeline.py_**
```python

from fondant.pipeline import Pipeline

pipeline = Pipeline(name="example pipeline", base_path="./data")

dataset = pipeline.read(
raw_data = Dataset.create(
"load_from_hf_hub",
arguments={
"dataset_name": "lambdalabs/pokemon-blip-captions"
"dataset_name": "fondant-ai/fondant-cc-25m",
"n_rows_to_load": 100,
},
produces={
"alt_text": pa.string(),
"image_url": pa.string(),
"license_location": pa.string(),
"license_type": pa.string(),
"webpage_url": pa.string(),
"surt_url": pa.string(),
"top_level_domain": pa.string(),
},
)

dataset = dataset.apply(
images = raw_data.apply(
"download_images",
arguments={
"input_partition_rows": 100,
"resize_mode": "no",
},
)

dataset = images.apply(
"resize_images",
arguments={
"resize_width": 128,
"resize_height": 128,
},
)

```
Custom use cases require the creation of custom components. Check out our [**step by step guide**](https://fondant.ai/en/latest/guides/first_dataset/) to learn more about how to build custom pipelines and components.

Custom use cases require the creation of custom components. Check out our [getting started page](https://fondant.ai/en/latest/guides/first_dataset/) to learn
more about how to build custom pipelines and components.
<p align="right">(<a href="#top">back to top</a>)</p>

### Running your pipeline

Expand All @@ -175,6 +99,60 @@ fondant <subcommand> --help

<p align="right">(<a href="#top">back to top</a>)</p>


## 🪄 How Fondant works

- **Dataset**: The building blocks, a dataset is a collection of columns. Fondant operates uniquely via datasets. We start with a dataset, we augment it into a new dataset and we end with a dataset. Fondant optimizes the data transfer by storing and loading columns as needed. While also processing based on the available partitions. The aim is to make these datasets sharable and allow users to create their own datasets based on others.
- **Operation**: A transformation to be applied on a dataset resulting in a new dataset. The operation will load needed columns and produce new/altered columns. A transformation can be anything from loading, filtering, adding a column, writing etc. Fondant also makes operations sharable so you can easily use an operation in your workflow.
- **Shareable trees**: Datasets are a result of applying operations on other datasets. The full lineage is baked in. This allows for sharing not just the end product but the full history, users can also easily continue based on a dataset or branch off of an existing graph.

![overview](docs/art/fondant_overview.png)

<p align="right">(<a href="#top">back to top</a>)</p>

## 🧩 Key Features

Here's what Fondant brings to the table:
- 🔧 Plug ‘n’ play composable data processing workflows
- 🧩 Library containing off-the-shelf reusable components
- 🐼 A simple Pandas based interface for creating custom components
- 📊 Built-in lineage, caching, and data explorer
- 🚀 Production-ready, scalable deployment
- ☁️ Integration with runners across different clouds (Vertex, Sagemaker, Kubeflow)

👉 **Check our [Component Hub](https://fondant.ai/en/latest/components/hub/) for an overview of all
available components**

<p align="right">(<a href="#top">back to top</a>)</p>

## 🪄 Example pipelines

We have created several ready-made example pipelines for you to use as a starting point for exploring Fondant.
They are hosted as separate repositories containing a notebook tutorial so you can easily clone them and get started:

📖 [**RAG tuning pipeline**](https://github.com/ml6team/fondant-usecase-RAG)
End-to-end Fondant pipelines to index and evaluate RAG (Retrieval-Augmented Generation) systems.

🛋️ [**ControlNet Interior Design Pipeline**](https://github.com/ml6team/fondant-usecase-controlnet)
An end-to-end Fondant pipeline to collect and process data for the fine-tuning of a ControlNet model, focusing on images related to interior design.

🖼️ [**Filter creative common license images**](https://github.com/ml6team/fondant-usecase-filter-creative-commons)
An end-to-end Fondant pipeline that starts from our Fondant-CC-25M creative commons image dataset and filters and downloads the desired images.

## ⚒️ Installation

First, run the minimal Fondant installation:

```
pip install fondant
```

Fondant also includes extra dependencies for specific runners, storage integrations and publishing
components to registries. The dependencies for the local runner (docker) is included by default.

For more detailed installation options, check the [**installation page**](https://fondant.ai/en/latest/guides/installation/)on our documentation.


## 👭 Contributing

We welcome contributions of different kinds:
Expand Down Expand Up @@ -203,3 +181,4 @@ pre-commit install
```

<p align="right">(<a href="#top">back to top</a>)</p>

Binary file added docs/art/fondant_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading