Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate READMEs for all components using a script #484

Merged
merged 5 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ repos:
"--exit-non-zero-on-fix",
]


- repo: https://github.com/PyCQA/bandit
rev: 1.7.4
hooks:
Expand Down Expand Up @@ -55,4 +54,13 @@ repos:
- types-jsonschema
- types-PyYAML
- types-requests
pass_filenames: false
pass_filenames: false

- repo: local
hooks:
- id: generate_component_readmes
name: Generate component READMEs
language: python
entry: python scripts/component_readme/generate_readme.py
files: ^components/.*/fondant_component.yaml
additional_dependencies: ["fondant"]
50 changes: 46 additions & 4 deletions components/caption_images/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,51 @@
# caption_images
# Caption images

### Description
This component captions inputted images using [BLIP](https://huggingface.co/docs/transformers/model_doc/blip).
This component captions images using a BLIP model from the Hugging Face hub

### **Inputs/Outputs**
### Inputs / outputs

See [`fondant_component.yaml`](fondant_component.yaml) for a more detailed description on all the input/output parameters.
**This component consumes:**
- images
- data: binary

**This component produces:**
- captions
- text: string

### Arguments

The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| model_id | str | Id of the BLIP model on the Hugging Face hub | Salesforce/blip-image-captioning-base |
| batch_size | int | Batch size to use for inference | 8 |
| max_new_tokens | int | Maximum token length of each caption | 50 |

### Usage

You can add this component to your pipeline using the following code:

```python
from fondant.pipeline import ComponentOp


caption_images_op = ComponentOp.from_registry(
name="caption_images",
arguments={
# Add arguments
# "model_id": "Salesforce/blip-image-captioning-base",
# "batch_size": 8,
# "max_new_tokens": 50,
}
)
pipeline.add_op(caption_images_op, dependencies=[...]) #Add previous component as dependency
```

### Testing

You can run the tests using docker with BuildKit. From this directory, run:
```
docker build . --target test
```
8 changes: 4 additions & 4 deletions components/caption_images/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Caption images
description: Component that captions images using a model from the Hugging Face hub
description: This component captions images using a BLIP model from the Hugging Face hub
image: ghcr.io/ml6team/caption_images:dev

consumes:
Expand All @@ -16,14 +16,14 @@ produces:

args:
model_id:
description: id of the model on the Hugging Face hub
description: Id of the BLIP model on the Hugging Face hub
type: str
default: "Salesforce/blip-image-captioning-base"
batch_size:
description: batch size to use
description: Batch size to use for inference
type: int
default: 8
max_new_tokens:
description: maximum token length of each caption
description: Maximum token length of each caption
type: int
default: 50
65 changes: 58 additions & 7 deletions components/download_images/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,70 @@
# download_images
# Download images

### Description
This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width).
The images are stored in a new colum as bytes objects. This component also resizes the images using the [resizer](https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py) function from the img2dataset library.
Component that downloads images from a list of URLs.

If the component is unable to retrieve the image at a URL (for any reason), it will return `None` for that particular URL.
This component takes in image URLs as input and downloads the images, along with some metadata
(like their height and width). The images are stored in a new colum as bytes objects. This
component also resizes the images using the
[resizer](https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py) function
from the img2dataset library.

### **Inputs/Outputs**

See [`fondant_component.yaml`](fondant_component.yaml) for a more detailed description on all the input/output parameters.
### Inputs / outputs

**This component consumes:**
- images
- url: string

**This component produces:**
- images
- data: binary
- width: int32
- height: int32

### Arguments

The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| timeout | int | Maximum time (in seconds) to wait when trying to download an image, | 10 |
| retries | int | Number of times to retry downloading an image if it fails. | / |
| n_connections | int | Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput. | 100 |
| image_size | int | Size of the images after resizing. | 256 |
| resize_mode | str | Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". | border |
| resize_only_if_bigger | bool | If True, resize only if image is bigger than image_size. | False |
| min_image_size | int | Minimum size of the images. | / |
| max_aspect_ratio | float | Maximum aspect ratio of the images. | inf |

### Usage

You can add this component to your pipeline using the following code:

```python
from fondant.pipeline import ComponentOp


download_images_op = ComponentOp.from_registry(
name="download_images",
arguments={
# Add arguments
# "timeout": 10,
# "retries": 0,
# "n_connections": 100,
# "image_size": 256,
# "resize_mode": "border",
# "resize_only_if_bigger": "False",
# "min_image_size": 0,
# "max_aspect_ratio": "inf",
}
)
pipeline.add_op(download_images_op, dependencies=[...]) #Add previous component as dependency
```

### Testing

You can run the tests using docker with BuildKit. From this directory, run:
```
docker build . --target test
```
```
17 changes: 14 additions & 3 deletions components/download_images/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
name: Download images
description: Component that downloads images based on URLs
description: |
Component that downloads images from a list of URLs.

This component takes in image URLs as input and downloads the images, along with some metadata
(like their height and width). The images are stored in a new colum as bytes objects. This
component also resizes the images using the
[resizer](https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py) function
from the img2dataset library.

image: ghcr.io/ml6team/download_images:dev

consumes:
Expand All @@ -21,15 +29,18 @@ produces:

args:
timeout:
description: Maximum time (in seconds) to wait when trying to download an image
description: Maximum time (in seconds) to wait when trying to download an image,
type: int
default: 10
retries:
description: Number of times to retry downloading an image if it fails.
type: int
default: 0
n_connections:
description: Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput.
description: |
Number of concurrent connections opened per process. Decrease this number if you are running
into timeout errors. A lower number of connections can increase the success rate but lower
the throughput.
type: int
default: 100
image_size:
Expand Down
42 changes: 38 additions & 4 deletions components/embed_images/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,43 @@
# Embed images

### Description
This component takes images as input and embeds them using a CLIP model from Hugging Face.
The embeddings are stored in a new colum as arrays of floats.
Component that generates CLIP embeddings from images

### **Inputs/Outputs**
### Inputs / outputs

**This component consumes:**
- images
- data: binary

**This component produces:**
- embeddings
- data: list<item: float>

### Arguments

The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| model_id | str | Model id of a CLIP model on the Hugging Face hub | openai/clip-vit-large-patch14 |
| batch_size | int | Batch size to use when embedding | 8 |

### Usage

You can add this component to your pipeline using the following code:

```python
from fondant.pipeline import ComponentOp


embed_images_op = ComponentOp.from_registry(
name="embed_images",
arguments={
# Add arguments
# "model_id": "openai/clip-vit-large-patch14",
# "batch_size": 8,
}
)
pipeline.add_op(embed_images_op, dependencies=[...]) #Add previous component as dependency
```

See [`fondant_component.yaml`](fondant_component.yaml) for a more detailed description on all the input/output parameters.
2 changes: 1 addition & 1 deletion components/embed_images/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Embed images
description: Component that embeds images using CLIP
description: Component that generates CLIP embeddings from images
image: ghcr.io/ml6team/embed_images:dev

consumes:
Expand Down
47 changes: 47 additions & 0 deletions components/embedding_based_laion_retrieval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Embedding based LAION retrieval

### Description
This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be
used to find images similar to the embedded images / captions.


### Inputs / outputs

**This component consumes:**
- embeddings
- data: list<item: float>

**This component produces:**
- images
- url: string

### Arguments

The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| num_images | int | Number of images to retrieve for each prompt | / |
| aesthetic_score | int | Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). | 9 |
| aesthetic_weight | float | Weight of the aesthetic embedding when added to the query, between 0 and 1 | 0.5 |

### Usage

You can add this component to your pipeline using the following code:

```python
from fondant.pipeline import ComponentOp


embedding_based_laion_retrieval_op = ComponentOp.from_registry(
name="embedding_based_laion_retrieval",
arguments={
# Add arguments
# "num_images": 0,
# "aesthetic_score": 9,
# "aesthetic_weight": 0.5,
}
)
pipeline.add_op(embedding_based_laion_retrieval_op, dependencies=[...]) #Add previous component as dependency
```

Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
name: LAION retrieval
description: A component that retrieves image URLs from LAION-5B based on a set of CLIP embeddings
name: Embedding based LAION retrieval
description: |
This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be
used to find images similar to the embedded images / captions.
image: ghcr.io/ml6team/embedding_based_laion_retrieval:dev

consumes:
Expand Down
41 changes: 41 additions & 0 deletions components/filter_comments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Filter comments

### Description
Component that filters code based on the code to comment ratio

### Inputs / outputs

**This component consumes:**
- code
- content: string

**This component produces no data.**

RobbeSneyders marked this conversation as resolved.
Show resolved Hide resolved
### Arguments

The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| min_comments_ratio | float | The minimum code to comment ratio | 0.1 |
| max_comments_ratio | float | The maximum code to comment ratio | 0.9 |

### Usage

You can add this component to your pipeline using the following code:

```python
from fondant.pipeline import ComponentOp


filter_comments_op = ComponentOp.from_registry(
name="filter_comments",
arguments={
# Add arguments
# "min_comments_ratio": 0.1,
# "max_comments_ratio": 0.9,
}
)
pipeline.add_op(filter_comments_op, dependencies=[...]) #Add previous component as dependency
```

Loading
Loading