Skip to content

Commit

Permalink
Restructure documentation (#597)
Browse files Browse the repository at this point in the history
PR that restructures the documentation to take in mind the user's
journey #572

Includes other changes:
* Updating main README documentation (components, some text tweaks,
runner visuals, ...)
* Renaming some components to match the format
<component_function>_<modality> (e.g. text_normalization ->
normalize_text)

There are many other changes that need to be made but best to track them
in separate PRs to make reviewing them easier. Created issues can be
found here #572
  • Loading branch information
PhilippeMoussalli authored Nov 9, 2023
1 parent 1961b1a commit 92bf041
Show file tree
Hide file tree
Showing 38 changed files with 731 additions and 422 deletions.
158 changes: 102 additions & 56 deletions README.md

Large diffs are not rendered by default.

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,15 @@ You can add this component to your pipeline using the following code:
from fondant.pipeline import ComponentOp


text_length_filter_op = ComponentOp.from_registry(
name="text_length_filter",
filter_text_length_op = ComponentOp.from_registry(
name="filter_text_length",
arguments={
# Add arguments
# "min_characters_length": 0,
# "min_words_length": 0,
}
)
pipeline.add_op(text_length_filter_op, dependencies=[...]) #Add previous component as dependency
pipeline.add_op(filter_text_length_op, dependencies=[...]) #Add previous component as dependency
```

### Testing
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
logger = logging.getLogger(__name__)


class TextLengthFilterComponent(PandasTransformComponent):
class FilterTextLengthComponent(PandasTransformComponent):
"""A component that filters out text based on their length."""

def __init__(self, *_, min_characters_length: int, min_words_length: int):
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
"""Unit test for text length filter component."""
import pandas as pd
from fondant.core.component_spec import ComponentSpec

from components.text_length_filter.src.main import TextLengthFilterComponent
from components.filter_text_length.src.main import FilterTextLengthComponent


def test_run_component_test():
Expand All @@ -16,17 +15,10 @@ def test_run_component_test():

dataframe = pd.concat({"text": pd.DataFrame(data)}, axis=1, names=["text", "data"])

# When: The text filter component proceed the dataframe
spec = ComponentSpec.from_file("../fondant_component.yaml")

component = TextLengthFilterComponent(
spec,
input_manifest_path="./dummy_input_manifest.json",
output_manifest_path="./dummy_input_manifest.json",
metadata={},
user_arguments={"min_characters_length": 20, "min_words_length": 4},
component = FilterTextLengthComponent(
min_characters_length=20,
min_words_length=4,
)
component.setup(min_characters_length=20, min_words_length=4)
dataframe = component.transform(dataframe=dataframe)

# Then: dataframe only contains one row
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ You can add this component to your pipeline using the following code:
from fondant.pipeline import ComponentOp


text_normalization_op = ComponentOp.from_registry(
name="text_normalization",
normalize_text_op = ComponentOp.from_registry(
name="normalize_text",
arguments={
# Add arguments
# "remove_additional_whitespaces": False,
Expand All @@ -55,7 +55,7 @@ text_normalization_op = ComponentOp.from_registry(
# "remove_punctuation": ,
}
)
pipeline.add_op(text_normalization_op, dependencies=[...]) #Add previous component as dependency
pipeline.add_op(normalize_text_op, dependencies=[...]) #Add previous component as dependency
```

### Testing
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Normalize text
image: fndnt/text_normalization:latest
image: fndnt/normalize_text:latest
description: |
This component implements several text normalization techniques to clean and preprocess textual
data:
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def any_condition_met(line, discard_condition_functions):
)


class TextNormalizationComponent(PandasTransformComponent):
class NormalizeTextComponent(PandasTransformComponent):
"""Component that normalizes text."""

def __init__(
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import pandas as pd

from src.main import TextNormalizationComponent
from src.main import NormalizeTextComponent


def test_transform_custom_componen_test():
Expand All @@ -12,7 +12,7 @@ def test_transform_custom_componen_test():
"do_lowercase": True,
"remove_punctuation": True,
}
component = TextNormalizationComponent(**user_arguments)
component = NormalizeTextComponent(**user_arguments)

input_dataframe = pd.DataFrame(
[
Expand Down
10 changes: 10 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Architecture

An overview of the architecture of Fondant

### Coming soon


## Conceptual overview

#### TODO: Add a diagram here
Binary file added docs/art/runners/docker_compose.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/art/runners/kubeflow_pipelines.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/art/runners/sagemaker.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/art/runners/vertex_ai.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions docs/caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
When Fondant runs a pipeline, it checks to see whether an execution exists in the base path based on
the cache key of each component.

The cache key is defined as the combination of the following:

1) The **pipeline step's inputs.** These inputs include the input arguments' value (if any).

2) **The component's specification.** This specification includes the image tag and the fields
consumed and produced by each component.

3) **The component resources.** Defines the hardware that was used to run the component (GPU,
nodepool).

If there is a matching execution in the base path (checked based on the output manifests),
the outputs of that execution are used and the step computation is skipped.
This helps to reduce costs by skipping computations that were completed in a previous pipeline run.

Additionally, only the pipelines with the same pipeline name will share the cache. Caching for
components
with the `latest` image tag is disabled by default. This is because using "latest" image tags can
lead to unpredictable behavior due to
image updates. Moreover, if one component in the pipeline is not cached then caching will be
disabled for all
subsequent components.

### Disabling caching
You can turn off execution caching at component level by setting the following:

```python
from fondant.pipeline.pipeline import ComponentOp

caption_images_op = ComponentOp(
component_dir="...",
arguments={
...
},
cache=False,
)
```
4 changes: 2 additions & 2 deletions docs/components/hub.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,11 @@ Below you can find the reusable components offered by Fondant.

??? "text_length_filter"

--8<-- "components/text_length_filter/README.md:1"
--8<-- "components/filter_text_length/README.md:1"

??? "text_normalization"

--8<-- "components/text_normalization/README.md:1"
--8<-- "components/normalize_text/README.md:1"

??? "write_to_hf_hub"

Expand Down
1 change: 1 addition & 0 deletions docs/components/publishing_components.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Coming Soon
105 changes: 105 additions & 0 deletions docs/documentation_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Documentation Guide

## Getting started with Fondant

Learn about the Fondant project and how to get started with it.

→ Start with the official guide on how to [install](guides/installation.md) Fondant.
→ Get started by running your first fondant [pipeline](guides/first_pipeline.md) using the [local
runner](runners/local.md).
→ Learn how to build your own [Fondant Pipeline](guides/build_a_simple_pipeline.md) and implement your
own [custom components](guides/implement_custom_components.md).
→ Learn how to use the [data explorer](data_explorer.md) to explore the outputs of your pipeline.

## Fondant fundamentals

Learn how to use Fondant to build your own data processing pipeline.

-> Design your own fondant [pipeline](pipeline.md) using the Fondant pipeline API.
-> Use existing [reusable components](components/hub.md) to build your pipeline.
-> Use [generic components](components/generic_component.md) to load/write your custom data format
to/from Fondant.
-> Build your own [custom component](components/custom_component.md) using the Fondant component
API.
-> Learn how to publish your own [components](components/publishing_components.md) to a container
registry so that you can reuse them in your pipelines.

## Components hub

Have a look at the [components hub](components/hub.md) to see what components are available.

## Fondant Runners

Learn how to run your Fondant pipeline on different platforms.

<table class="images" width="100%" style="border: 0px solid white; width: 100%;">
<tr style="border: 0px;">
<td width="25%" style="border: 0px; width: 28.33%">
<figure>
<img src="https://github.com/ml6team/fondant/blob/main/docs/art/runners/docker_compose.png?raw=true" />
<figcaption class="caption"><strong>LocalRunner</strong></figcaption>
</figure>
</td>
<td width="25%" style="border: 0px; width: 30.33%">
<figure>
<img src="https://github.com/ml6team/fondant/blob/main/docs/art/runners/vertex_ai.png?raw=true" />
<figcaption class="caption"><strong>VertexRunner</strong></figcaption>
</figure>
</td>
<td width="25%" style="border: 0px; width: 30.33%">
<figure>
<img src="https://github.com/ml6team/fondant/blob/main/docs/art/runners/kubeflow_pipelines.png?raw=true" />
<figcaption class="caption"><strong>KubeflowRunner</strong></figcaption>
</figure>
<td width="25%" style="border: 0px; width: 33.33%">
<figure>
<img src="https://github.com/ml6team/fondant/blob/main/docs/art/runners/sagemaker.png?raw=true" />
<figcaption class="caption"><strong>🚧SageMakerRunner🚧</strong></figcaption>
</figure>
</td>
</tr>
</table>

<style>
.caption {
text-align: center; /* Adjust the alignment as needed */
}
</style>

-> [LocalRunner](runners/local.md): ideal for developing fondant pipelines and components faster.
-> [VertexRunner](runners/vertex.md): used for running a fondant pipeline on Vertex AI.
-> [KubeflowRunner](runners/kfp.md): used for running a fondant pipeline on a Kubeflow cluster.
-> [SageMakerRunner](runners/kfp.md): used for running a fondant pipeline on a SageMaker pipelines (
🚧 Coming Soon 🚧).

## Fondant Explorer

Discover how to utilize the Fondant [data explorer](data_explorer.md) to navigate your pipeline
outputs, including visualizing intermediary steps between components.

## Advanced Concepts

Learn about some of the more advanced concepts in Fondant.

-> Learn more about the [architecture](architecture.md) of Fondant and how it works under the
hood.
-> Understand how Fondant passes data between components with the [manifest](manifest.md).
-> Learn how Fondant uses [caching](caching.md) to speed up your pipeline development.
-> Find out how Fondant uses [partitions](partitions.md) to parallelize and scale your pipeline and
how you can use it to your advantage.

## Contributing

Learn how to contribute to the Fondant project through
our [contribution guidelines](contributing.md).

## FAQ

Browse through the [frequently asked questions](faq.md) about Fondant.

## Announcements

Check out our latest [announcements] about Fondant.

-> 25 million Creative Commons image dataset released. Read more about it [here](announcements/CC_25M_press_release.md).

1 change: 1 addition & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Coming Soon
32 changes: 19 additions & 13 deletions docs/guides/build_a_simple_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,22 @@ We present a walkthrough to build by yourself the pipeline presented in the Gett

The sample pipeline that is going to be built in this tutorial demonstrates how to effectively utilise a creative commons image dataset within a fondant pipeline. This dataset comprises images from diverse sources and is available in various data formats.

The pipeline starts with the initialization of the image dataset sourced from HuggingFace and it proceeds with the downloading of these carefully selected images. Accomplishing these tasks necessitates the use of a pre-built generic component (HuggingFace dataset loading) and a reusable component (image downloading).
The pipeline starts with the initialization of the image dataset sourced from HuggingFace and proceeds with the downloading of these carefully selected images. Accomplishing these tasks necessitates the use of:

* [load_from_hf_hub](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub): A [generic component](../components/generic_component.md) that loads the initial dataset from the Huggingface hub.
* [download_images](https://github.com/ml6team/fondant/tree/main/components/download_images): A [reusable component](../components/components.md) that downloades images from urls.

## Setting up the environment

To set up your local environment, please refer to our getting started documentation. There, you will find the necessary steps to configure your environment.
We will be using the [local runner](../runners/local.md) to run this pipelines. To set up your local environment, please refer to our [installation](installation.md) documentation.

## Building the pipeline

Everything begins with the pipeline definition. Start by creating a 'pipeline.py' file and adding the following code.
Everything begins with the pipeline definition. Start by creating a `pipeline.py` file and adding the following script.

```python
from fondant.pipeline import ComponentOp, Pipeline

pipeline = Pipeline(
pipeline_name="creative_commons_pipline", # This is the name of your pipeline
base_path="./data" # The directory that will be used to store the data
Expand All @@ -49,7 +53,7 @@ If you want to learn more about components, you can check out the [components do

### First component to load the dataset

For every pipeline, the initial step is data initialization. In our case, we aim to load the dataset into our pipeline base from HuggingFace. Fortunately, there is already a generic component available called `load_from_hub`.
For every pipeline, the initial step is data initialization. In our case, we aim to load the dataset into our pipeline base from HuggingFace. Fortunately, we already have a generic component available called `load_from_hf_hub`.
This component is categorised as a generic component because the structure of the datasets we load from HuggingFace can vary from one dataset to another. While we can leverage the implemented business logic of the component, we must customise the component spec. This customization is necessary to inform the component about the specific columns it will produce.
To utilise this component, it's time to create your first component spec.
Create a folder `component/load_from_hub` and create a `fondant_component.yaml` with the following content:
Expand Down Expand Up @@ -96,8 +100,8 @@ args:
default: None
```
As mentioned earlier, the component spec specifies the data structure consumed and/or produced by the component. In this case, the component solely produces data, and this structure is defined within the `produces` section. Fondant operates with hierarchical column structures. In our example, we are defining a column called images with several subset fields.
Now that we have created the component spec, we can incorporate the component into our python code. The next steps involve initialising the component from the component spec and adding it to our pipeline using the following code:
As mentioned earlier, the component spec specifies the data structure consumed and/or produced by the component. In this case, the component solely produces data, and this structure is defined within the `produces` section. Fondant operates with hierarchical column structures. In our example, we are defining a column called `images` with several subset fields.
Now that we have created the component specification file, we can incorporate the component into our python code. The next steps involve initialising the component from the component spec and adding it to our pipeline using the following code:

```python
from fondant.pipeline import ComponentOp
Expand Down Expand Up @@ -126,18 +130,20 @@ Two key actions are taking place here:

1. We create a ComponentOp from the registry, configuring the component with specific arguments. In this process, we override default arguments as needed. If we don't provide an argument override, the default values are used. Notably, we are modifying the dataset to be loaded, specifying the number of rows to load (which can be a small number for testing purposes), and mapping columns from the HuggingFace dataset to columns in our dataframe.

2. The add_op method registers the configured component into the pipeline.
2. We provide a column mapping argument for the component to change the column names from the [initial dataset](https://huggingface.co/datasets/mrchtr/cc-test) to ones that match the component specification.

3. The add_op method registers the configured component into the pipeline.

To test the pipeline, you can execute the following command within the pipeline directory:

```
```bash
fondant run local pipeline.py
```

The pipeline execution will start, initiating the download of the dataset from HuggingFace.
After the pipeline has completed, you can explore the pipeline result using the fondant explorer:

```
```bash
fondant explore --base_path ./data
```

Expand All @@ -158,17 +164,17 @@ download_images = ComponentOp.from_registry(
arguments={}
)
pipeline.add_op(download_images, dependencies=[load_from_hf_hub])
pipeline.add_op(download_images, dependencies=load_from_hf_hub)
```

The reusable component requires a specific dataset input format to function effectively. Referring to the ComponentHub documentation, this component downloads images based on the URLs provided in the `image_url` column. Fortunately, the column generated by the first component is already named correctly for this purpose.
The reusable component requires a specific dataset input format to function effectively. Referring to the [component's documentation](https://hub.docker.com/r/fndnt/download_images), this component downloads images based on the URLs provided in the `image_url` column. Fortunately, the column generated by the first component is already named correctly for this purpose.

Instead of initialising the component from a YAML file, we'll use the method `ComponentOp.from_registry(...)` where we can easily specify the name of the reusable component. This is arguably the simplest way to start using a Fondant component.

Finally, we add the component to the pipeline using the `add_op` method. Notably, we define `dependencies=[load_from_hf_hub]` in this step. This command ensures that we chain both components together. Specifically, the `download_images` component awaits the execution input from the `load_from_hf_hub` component.
Finally, we add the component to the pipeline using the `add_op` method. Notably, we define `dependencies=load_from_hf_hub` in this step. This command ensures that we chain both components together. Specifically, the `download_images` component awaits the execution input from the `load_from_hf_hub` component.

Now, you can proceed to execute your pipeline once more and explore the results. In the explorer, you will be able to view the images that have been downloaded.

![explorer](https://github.com/ml6team/fondant/blob/main/docs/art/guides/explorer.png?raw=true)

Well done! You have now acquired the skills to construct a simple Fondant pipeline by leveraging generic and reusable components. In our [upcoming tutorial](../guides//implement_custom_components.md), we'll demonstrate how you can customise the pipeline by implementing a custom component.
Well done! You have now acquired the skills to construct a simple Fondant pipeline by leveraging generic and reusable components. In the [following tutorial](implement_custom_components.md), we'll demonstrate how you can customise the pipeline by implementing a custom component.
Loading

0 comments on commit 92bf041

Please sign in to comment.