Skip to content

Commit

Permalink
Don't use from_registry for generic components (#285)
Browse files Browse the repository at this point in the history
Fixes #251, #252, #253

Before this PR, the operation for a generic component (a reusable
component that dynamically takes a component specification), needed to
be created as follows:

```python
component_op = ComponentOp.from_registry(
    name="generic_component",
    component_spec_path="components/custom_generic_component/fondant_component.yaml",
)
```

But the provided name wasn't actually used, since the component
specification already contains a reference to the reusable image that it
should use. Now we can define both custom and generic components as
follows:

```python
component_op = ComponentOp(
    component_dir="components/custom_generic_component",
)
```

There is still a difference in how we want to handle custom and generic
components though. Custom components should be built by the local
runner, while generic components should not, since they use a reusable
image. To make this differentiation, we now simply check if a
`Dockerfile` is present in the provided `Component_dir`. This will be
the case for a custom component, but not for a generic component.

---

This has a nice implicit result for reusable components as well, which
can still be defined as:

```python
component_op = ComponentOp.from_registry(
    name="reusable_component",
)
```

Which `fondant` now resolves to:
```python
component_op = ComponentOp(
    component_dir="{fondant_install_dir}/components/custom_generic_component",
)
```

I added a change to this PR which no longer packages the reusable
component code with the `fondant` package, but only the component
specifications, as those are all that is needed since they contain a
reference to the reusable image on the registry. This means that the
`component_dir` above doesn't contain a `Dockerfile` when `fondant` is
installed from PyPI, but does when you locally install `fondant` using
`poetry install`. So the local runner doesn't build reusable components
when users install `fondant` from PyPI, but it does when you're working
on the `fondant` repo, which is useful for us `fondant` developers.

---

The only thing we still need is an option on the runner to provide
`build_args`, so we can pass in the `fondant` version to build into the
image. But I'll open a separate PR for that.
  • Loading branch information
RobbeSneyders authored Jul 18, 2023
1 parent fd90c2c commit 85b49ac
Show file tree
Hide file tree
Showing 31 changed files with 225 additions and 128 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def build_pipeline():
pipeline.add_op(load_from_hub_op)

custom_op = ComponentOp(
component_spec_path="components/custom_component/fondant_component.yaml",
component_dir="components/custom_component",
arguments={
"min_width": 600,
"min_height": 600,
Expand Down
2 changes: 1 addition & 1 deletion docs/component_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ If an argument is not explicitly provided, the default value will be used instea
from fondant.pipeline import ComponentOp
custom_op = ComponentOp(
component_spec_path="components/custom_component/fondant_component.yaml",
component_dir="components/custom_component",
arguments={
"custom_argument": "foo"
},
Expand Down
148 changes: 148 additions & 0 deletions docs/components.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Components

Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
provides a lot of components out of the box
([overview](https://github.com/ml6team/fondant/tree/main/components)), but you can also define your
own custom components.

## The anatomy of a component

A component is completely defined by its [component specification](component_spec.md) and a
docker image. The specification defines the docker image fondant should run to execute the
component, which data it consumes and produces, and which arguments it takes.

## Component types

We can distinguish three different types of components:

- **Reusable components** can be used out of the box and can be loaded from the fondant
component registry
- **Custom components** are completely defined and implemented by the user
- **Generic components** leverage a reusable implementation, but require a custom component
specification

### Reusable components

Reusable components are completely defined and implemented by fondant. You can easily add them
to your pipeline by creating an operation using `ComponentOp.from_registry()`.

```python
from fondant.pipeline import ComponentOp

component_op = ComponentOp.from_registry(
name="reusable_component",
arguments={
"arg": "value"
}
)
```

??? "fondant.pipeline.ComponentOp.from_registry"

::: fondant.pipeline.ComponentOp.from_registry
handler: python
options:
show_source: false

You can find an overview of the reusable components offered by fondant
[here](https://github.com/ml6team/fondant/tree/main/components). Check their
`fondant_component.yaml` file for information on which arguments they accept and which data they
consume and produce.

### Custom components

To define your own custom component, you can build your code into a docker image and write an
accompanying component specification that refers to it.

A typical file structure for a custom component looks like this:
```
|- components
| |- custom_component
| |- src
| | |- main.py
| |- Dockerfile
| |- fondant_component.yaml
|- pipeline.py
```

The `Dockerfile` is used to build the code into a docker image, which is then referred to in the
`fondant_component.yaml`.

```yaml title="components/custom_component/fondant_component.yaml"
name: Custom component
description: This is a custom component
image: custom_component:latest
```
You can add a custom component to your pipeline by creating a `ComponentOp` and passing in the path
to the directory containing your `fondant_component.yaml`.

```python title="pipeline.py"
from fondant.pipeline import ComponentOp
component_op = ComponentOp(
component_dir="components/custom_component",
arguments={
"arg": "value"
}
)
```

??? "fondant.pipeline.ComponentOp"

::: fondant.pipeline.ComponentOp
handler: python
options:
members: []
show_source: false

See our [best practices on creating a custom component](custom_component.md).

### Generic components

A generic component is a component leveraging a reusable docker image, but requiring a custom
`fondant_component.yaml` specification.

Since a generic component only requires a custom `fondant_component.yaml`, its file structure
looks like this:
```
|- components
| |- generic_component
| |- fondant_component.yaml
|- pipeline.py
```

The `fondant_component.yaml` refers to the reusable image it leverages:

```yaml title="components/generic_component/fondant_component.yaml"
name: Generic component
description: This is a generic component
image: reusable_component:latest
```

You can add a generic component to your pipeline by creating a `ComponentOp` and passing in the path
to the directory containing your custom `fondant_component.yaml`.

```python title="pipeline.py"
from fondant.pipeline import ComponentOp
component_op = ComponentOp(
component_dir="components/generic_component",
arguments={
"arg": "value"
}
)
```

??? "fondant.pipeline.ComponentOp"

::: fondant.pipeline.ComponentOp
handler: python
options:
members: []
show_source: false

An example of a generic component is the
[`load_from_hf_hub`](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub)
components. It can read any dataset from the HuggingFace hub, but it requires the user to define
the schema of the produced dataset in a custom `fondant_component.yaml` specification.
7 changes: 3 additions & 4 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,8 @@ Now that we have a pipeline, we can add components to it. Components are the bui
Let's add a reusable component to our pipeline. We will use the `load_from_hf_hub` component to read data from huggingface. Add the following code to your `pipeline.py` file:

```Python
load_from_hf_hub = ComponentOp.from_registry(
name='load_from_hf_hub',
component_spec_path='components/load_from_hf_hub/fondant_component.yml',
load_from_hf_hub = ComponentOp(
component_dir='components/load_from_hf_hub',
arguments={
'dataset_name': 'huggan/pokemon',
'n_rows_to_load': 100,
Expand Down Expand Up @@ -278,7 +277,7 @@ With our component complete we can now add it to our pipeline definition (`pipel

```python
extract_resolution = ComponentOp(
component_spec_path='components/extract_resolution/fondant_component.yml',
component_dir='components/extract_resolution',
)
my_pipeline.add_op(load_from_hf_hub) # this line was already there
Expand Down
2 changes: 1 addition & 1 deletion docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def build_pipeline():
pipeline.add_op(load_from_hub_op)

caption_images_op = ComponentOp(
component_spec_path="components/captioning_component/fondant_component.yaml",
component_dir="components/captioning_component",
arguments={
"model_id": "Salesforce/blip-image-captioning-base",
"batch_size": 2,
Expand Down
7 changes: 3 additions & 4 deletions examples/pipelines/controlnet-interior-design/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

# Define component ops
generate_prompts_op = ComponentOp(
component_spec_path="components/generate_prompts/fondant_component.yaml",
component_dir="components/generate_prompts",
arguments={"n_rows_to_load": None},
)
laion_retrieval_op = ComponentOp.from_registry(
Expand Down Expand Up @@ -59,9 +59,8 @@
node_pool_name="model-inference-pool",
)

write_to_hub_controlnet = ComponentOp.from_registry(
name="write_to_hf_hub",
component_spec_path="components/write_to_hub_controlnet/fondant_component.yaml",
write_to_hub_controlnet = ComponentOp(
component_dir="components/write_to_hub_controlnet",
arguments={
"username": "test-user",
"dataset_name": "segmentation_kfp",
Expand Down
12 changes: 5 additions & 7 deletions examples/pipelines/datacomp/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,8 @@
"clip_l14_similarity_score": "image_text_clip_l14_similarity_score",
}

load_from_hub_op = ComponentOp.from_registry(
name="load_from_hf_hub",
component_spec_path="components/load_from_hf_hub/fondant_component.yaml",
load_from_hub_op = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "nielsr/datacomp-small-with-embeddings",
"column_name_mapping": load_component_column_mapping,
Expand All @@ -48,17 +47,16 @@
arguments={"min_image_dim": 200, "max_aspect_ratio": 3},
)
filter_complexity_op = ComponentOp(
component_spec_path="components/filter_text_complexity/fondant_component.yaml",
component_dir="components/filter_text_complexity",
arguments={
"spacy_pipeline": "en_core_web_sm",
"batch_size": 1000,
"min_complexity": 1,
"min_num_actions": 1,
},
)
cluster_image_embeddings_op = ComponentOp.from_registry(
name="cluster_image_embeddings",
component_spec_path="components/cluster_image_embeddings/fondant_component.yaml",
cluster_image_embeddings_op = ComponentOp(
component_dir="components/cluster_image_embeddings",
arguments={
"sample_ratio": 0.3,
"num_clusters": 3,
Expand Down
10 changes: 4 additions & 6 deletions examples/pipelines/finetune_stable_diffusion/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,8 @@
value: key for key, value in load_component_column_mapping.items()
}
# Define component ops
load_from_hub_op = ComponentOp.from_registry(
name="load_from_hf_hub",
component_spec_path="components/load_from_hf_hub/fondant_component.yaml",
load_from_hub_op = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "logo-wizard/modern-logo-dataset",
"column_name_mapping": load_component_column_mapping,
Expand Down Expand Up @@ -72,9 +71,8 @@
node_pool_name="model-inference-pool",
)

write_to_hub = ComponentOp.from_registry(
name="write_to_hf_hub",
component_spec_path="components/write_to_hf_hub/fondant_component.yaml",
write_to_hub = ComponentOp(
component_dir="components/write_to_hf_hub",
arguments={
"username": "test-user",
"dataset_name": "stable_diffusion_processed",
Expand Down
5 changes: 2 additions & 3 deletions examples/pipelines/starcoder/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,8 @@
)

# define ops
load_from_hub_op = ComponentOp.from_registry(
name="load_from_hub",
component_spec_path="components/load_from_hub/fondant_component.yaml",
load_from_hub_op = ComponentOp(
component_dir="components/load_from_hub",
arguments={
"dataset_name": "ml6team/the-stack-smol-python",
"column_name_mapping": load_component_column_mapping,
Expand Down
11 changes: 8 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,18 @@ nav:
- Home: index.md
- Getting Started: getting_started.md
- Building a pipeline: pipeline.md
- Creating custom components: custom_component.md
- Read / write components: generic_component.md
- Component spec: component_spec.md
- Components:
- Components: components.md
- Creating custom components: custom_component.md
- Read / write components: generic_component.md
- Component spec: component_spec.md
- Data explorer: data_explorer.md
- Infrastructure: infrastructure.md
- Manifest: manifest.md

plugins:
- mkdocstrings

markdown_extensions:
- pymdownx.snippets:
check_paths: true
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ coveralls = "^3.3.1"

[tool.poetry.group.docs.dependencies]
mkdocs-material = "^9.1.8"
mkdocstrings = { version = "^0.20", extras = ["python"]}

[tool.poetry.scripts]
fondant = "fondant.cli:entrypoint"
Expand Down
2 changes: 1 addition & 1 deletion scripts/pre-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ root_path=$(dirname "$scripts_path")

pushd "$root_path"
rm -rf src/fondant/components
cp -r components src/fondant/
find components/ -type f | grep -i yaml$ | xargs -i cp --parents {} src/fondant/
popd
9 changes: 5 additions & 4 deletions src/fondant/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,10 +170,11 @@ def _generate_spec(self, pipeline: Pipeline, extra_volumes: list) -> dict:
"volumes": volumes,
}

if component_op.local_component:
services[safe_component_name][
"build"
] = f"./{Path(component_op.component_spec_path).parent}"
if component_op.dockerfile_path is not None:
logger.info(
f"Found Dockerfile for {component_name}, adding build step.",
)
services[safe_component_name]["build"] = str(component_op.component_dir)
else:
services[safe_component_name][
"image"
Expand Down
Loading

0 comments on commit 85b49ac

Please sign in to comment.