ml6team · RobbeSneyders · Oct 5, 2023 · Oct 4, 2023 · Oct 4, 2023 · Oct 4, 2023
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -18,7 +18,6 @@ repos:
           "--exit-non-zero-on-fix",
         ]
 
-
   - repo: https://github.com/PyCQA/bandit
     rev: 1.7.4
     hooks:
@@ -55,4 +54,13 @@ repos:
           - types-jsonschema
           - types-PyYAML
           - types-requests
-        pass_filenames: false
+        pass_filenames: false
+
+  - repo: local
+    hooks:
+      - id: generate_component_readmes
+        name: Generate component READMEs
+        language: python
+        entry: python scripts/component_readme/generate_readme.py
+        files: ^components/.*/fondant_component.yaml
+        additional_dependencies: ["fondant"]
diff --git a/components/caption_images/README.md b/components/caption_images/README.md
@@ -1,9 +1,51 @@
-# caption_images
+# Caption images
 
 ### Description
-This component captions inputted images using [BLIP](https://huggingface.co/docs/transformers/model_doc/blip).
+This component captions images using a BLIP model from the Hugging Face hub
 
-### **Inputs/Outputs**
+### Inputs / outputs
 
-See [`fondant_component.yaml`](fondant_component.yaml) for a more detailed description on all the input/output parameters. 
+**This component consumes:**
+- images
+  - data: binary
 
+**This component produces:**
+- captions
+  - text: string
+
+### Arguments
+
+The component takes the following arguments to alter its behavior:
+
+| argument | type | description | default |
+| -------- | ---- | ----------- | ------- |
+| model_id | str | Id of the BLIP model on the Hugging Face hub | Salesforce/blip-image-captioning-base |
+| batch_size | int | Batch size to use for inference | 8 |
+| max_new_tokens | int | Maximum token length of each caption | 50 |
+
+### Usage
+
+You can add this component to your pipeline using the following code:
+
+```python
+from fondant.pipeline import ComponentOp
+
+
+caption_images_op = ComponentOp.from_registry(
+    name="caption_images",
+    arguments={
+        # Add arguments
+        # "model_id": "Salesforce/blip-image-captioning-base",
+        # "batch_size": 8,
+        # "max_new_tokens": 50,
+    }
+)
+pipeline.add_op(caption_images_op, dependencies=[...])  #Add previous component as dependency
+```
+
+### Testing
+
+You can run the tests using docker with BuildKit. From this directory, run:
+```
+docker build . --target test
+```
diff --git a/components/caption_images/fondant_component.yaml b/components/caption_images/fondant_component.yaml
@@ -1,5 +1,5 @@
 name: Caption images
-description: Component that captions images using a model from the Hugging Face hub
+description: This component captions images using a BLIP model from the Hugging Face hub
 image: ghcr.io/ml6team/caption_images:dev
 
 consumes:
@@ -16,14 +16,14 @@ produces:
 
 args:
   model_id:
-    description: id of the model on the Hugging Face hub
+    description: Id of the BLIP model on the Hugging Face hub
     type: str
     default: "Salesforce/blip-image-captioning-base"
   batch_size:
-    description: batch size to use
+    description: Batch size to use for inference
     type: int
     default: 8
   max_new_tokens:
-    description: maximum token length of each caption
+    description: Maximum token length of each caption
     type: int
     default: 50
diff --git a/components/download_images/README.md b/components/download_images/README.md
@@ -1,19 +1,70 @@
-# download_images
+# Download images
 
 ### Description
-This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width).
-The images are stored in a new colum as bytes objects. This component also resizes the images using the [resizer](https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py) function from the img2dataset library.
+Component that downloads images from a list of URLs.
 
-If the component is unable to retrieve the image at a URL (for any reason), it will return `None` for that particular URL.
+This component takes in image URLs as input and downloads the images, along with some metadata 
+(like their height and width). The images are stored in a new colum as bytes objects. This 
+component also resizes the images using the 
+[resizer](https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py) function 
+from the img2dataset library.
 
-### **Inputs/Outputs**
 
-See [`fondant_component.yaml`](fondant_component.yaml) for a more detailed description on all the input/output parameters. 
+### Inputs / outputs
 
+**This component consumes:**
+- images
+  - url: string
+
+**This component produces:**
+- images
+  - data: binary
+  - width: int32
+  - height: int32
+
+### Arguments
+
+The component takes the following arguments to alter its behavior:
+
+| argument | type | description | default |
+| -------- | ---- | ----------- | ------- |
+| timeout | int | Maximum time (in seconds) to wait when trying to download an image, | 10 |
+| retries | int | Number of times to retry downloading an image if it fails. | / |
+| n_connections | int | Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput. | 100 |
+| image_size | int | Size of the images after resizing. | 256 |
+| resize_mode | str | Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". | border |
+| resize_only_if_bigger | bool | If True, resize only if image is bigger than image_size. | False |
+| min_image_size | int | Minimum size of the images. | / |
+| max_aspect_ratio | float | Maximum aspect ratio of the images. | inf |
+
+### Usage
+
+You can add this component to your pipeline using the following code:
+
+```python
+from fondant.pipeline import ComponentOp
+
+
+download_images_op = ComponentOp.from_registry(
+    name="download_images",
+    arguments={
+        # Add arguments
+        # "timeout": 10,
+        # "retries": 0,
+        # "n_connections": 100,
+        # "image_size": 256,
+        # "resize_mode": "border",
+        # "resize_only_if_bigger": "False",
+        # "min_image_size": 0,
+        # "max_aspect_ratio": "inf",
+    }
+)
+pipeline.add_op(download_images_op, dependencies=[...])  #Add previous component as dependency
+```
 
 ### Testing
 
 You can run the tests using docker with BuildKit. From this directory, run:
 ```
 docker build . --target test
-```
+```
diff --git a/components/download_images/fondant_component.yaml b/components/download_images/fondant_component.yaml
@@ -1,5 +1,13 @@
 name: Download images
-description: Component that downloads images based on URLs
+description: |
+  Component that downloads images from a list of URLs.
+
+  This component takes in image URLs as input and downloads the images, along with some metadata 
+  (like their height and width). The images are stored in a new colum as bytes objects. This 
+  component also resizes the images using the 
+  [resizer](https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py) function 
+  from the img2dataset library.
+
 image: ghcr.io/ml6team/download_images:dev
 
 consumes:
@@ -21,15 +29,18 @@ produces:
 
 args:
   timeout:
-    description: Maximum time (in seconds) to wait when trying to download an image
+    description: Maximum time (in seconds) to wait when trying to download an image,
     type: int
     default: 10
   retries:
     description: Number of times to retry downloading an image if it fails.
     type: int
     default: 0
   n_connections:
-    description: Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput.
+    description: |
+      Number of concurrent connections opened per process. Decrease this number if you are running 
+      into timeout errors. A lower number of connections can increase the success rate but lower 
+      the throughput.
     type: int
     default: 100
   image_size:

diff --git a/components/embed_images/README.md b/components/embed_images/README.md
@@ -1,9 +1,43 @@
 # Embed images
 
 ### Description
-This component takes images as input and embeds them using a CLIP model from Hugging Face.
-The embeddings are stored in a new colum as arrays of floats.
+Component that generates CLIP embeddings from images
 
-### **Inputs/Outputs**
+### Inputs / outputs
+
+**This component consumes:**
+- images
+  - data: binary
+
+**This component produces:**
+- embeddings
+  - data: list<item: float>
+
+### Arguments
+
+The component takes the following arguments to alter its behavior:
+
+| argument | type | description | default |
+| -------- | ---- | ----------- | ------- |
+| model_id | str | Model id of a CLIP model on the Hugging Face hub | openai/clip-vit-large-patch14 |
+| batch_size | int | Batch size to use when embedding | 8 |
+
+### Usage
+
+You can add this component to your pipeline using the following code:
+
+```python
+from fondant.pipeline import ComponentOp
+
+
+embed_images_op = ComponentOp.from_registry(
+    name="embed_images",
+    arguments={
+        # Add arguments
+        # "model_id": "openai/clip-vit-large-patch14",
+        # "batch_size": 8,
+    }
+)
+pipeline.add_op(embed_images_op, dependencies=[...])  #Add previous component as dependency
+```
 
-See [`fondant_component.yaml`](fondant_component.yaml) for a more detailed description on all the input/output parameters. 
diff --git a/components/embed_images/fondant_component.yaml b/components/embed_images/fondant_component.yaml
@@ -1,5 +1,5 @@
 name: Embed images
-description: Component that embeds images using CLIP
+description: Component that generates CLIP embeddings from images
 image: ghcr.io/ml6team/embed_images:dev
 
 consumes:

diff --git a/components/embedding_based_laion_retrieval/README.md b/components/embedding_based_laion_retrieval/README.md
@@ -0,0 +1,47 @@
+# Embedding based LAION retrieval
+
+### Description
+This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be 
+used to find images similar to the embedded images / captions.
+
+
+### Inputs / outputs
+
+**This component consumes:**
+- embeddings
+  - data: list<item: float>
+
+**This component produces:**
+- images
+  - url: string
+
+### Arguments
+
+The component takes the following arguments to alter its behavior:
+
+| argument | type | description | default |
+| -------- | ---- | ----------- | ------- |
+| num_images | int | Number of images to retrieve for each prompt | / |
+| aesthetic_score | int | Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). | 9 |
+| aesthetic_weight | float | Weight of the aesthetic embedding when added to the query, between 0 and 1 | 0.5 |
+
+### Usage
+
+You can add this component to your pipeline using the following code:
+
+```python
+from fondant.pipeline import ComponentOp
+
+
+embedding_based_laion_retrieval_op = ComponentOp.from_registry(
+    name="embedding_based_laion_retrieval",
+    arguments={
+        # Add arguments
+        # "num_images": 0,
+        # "aesthetic_score": 9,
+        # "aesthetic_weight": 0.5,
+    }
+)
+pipeline.add_op(embedding_based_laion_retrieval_op, dependencies=[...])  #Add previous component as dependency
+```
+
diff --git a/components/embedding_based_laion_retrieval/fondant_component.yaml b/components/embedding_based_laion_retrieval/fondant_component.yaml
@@ -1,5 +1,7 @@
-name: LAION retrieval
-description: A component that retrieves image URLs from LAION-5B based on a set of CLIP embeddings
+name: Embedding based LAION retrieval
+description: |
+  This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be 
+  used to find images similar to the embedded images / captions.
 image: ghcr.io/ml6team/embedding_based_laion_retrieval:dev
 
 consumes:

diff --git a/components/filter_comments/README.md b/components/filter_comments/README.md
@@ -0,0 +1,41 @@
+# Filter comments
+
+### Description
+Component that filters code based on the code to comment ratio
+
+### Inputs / outputs
+
+**This component consumes:**
+- code
+  - content: string
+
+**This component produces no data.**
+
+### Arguments
+
+The component takes the following arguments to alter its behavior:
+
+| argument | type | description | default |
+| -------- | ---- | ----------- | ------- |
+| min_comments_ratio | float | The minimum code to comment ratio | 0.1 |
+| max_comments_ratio | float | The maximum code to comment ratio | 0.9 |
+
+### Usage
+
+You can add this component to your pipeline using the following code:
+
+```python
+from fondant.pipeline import ComponentOp
+
+
+filter_comments_op = ComponentOp.from_registry(
+    name="filter_comments",
+    arguments={
+        # Add arguments
+        # "min_comments_ratio": 0.1,
+        # "max_comments_ratio": 0.9,
+    }
+)
+pipeline.add_op(filter_comments_op, dependencies=[...])  #Add previous component as dependency
+```
+