Add component hub doc page (#487)

This PR adds a script and template to automatically add a component hub page to our docs.
ml6team · Oct 5, 2023 · 52cdbb2 · 52cdbb2
1 parent aa41774
commit 52cdbb2
Show file tree

Hide file tree

Showing 28 changed files with 274 additions and 44 deletions.
diff --git a/components/caption_images/README.md b/components/caption_images/README.md
@@ -6,12 +6,14 @@ This component captions images using a BLIP model from the Hugging Face hub
 ### Inputs / outputs
 
 **This component consumes:**
+
 - images
-  - data: binary
+    - data: binary
 
 **This component produces:**
+
 - captions
-  - text: string
+    - text: string
 
 ### Arguments
 

diff --git a/components/download_images/README.md b/components/download_images/README.md
@@ -13,14 +13,16 @@ from the img2dataset library.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - images
-  - url: string
+    - url: string
 
 **This component produces:**
+
 - images
-  - data: binary
-  - width: int32
-  - height: int32
+    - data: binary
+    - width: int32
+    - height: int32
 
 ### Arguments
 

diff --git a/components/embed_images/README.md b/components/embed_images/README.md
@@ -6,12 +6,14 @@ Component that generates CLIP embeddings from images
 ### Inputs / outputs
 
 **This component consumes:**
+
 - images
-  - data: binary
+    - data: binary
 
 **This component produces:**
+
 - embeddings
-  - data: list<item: float>
+    - data: list<item: float>
 
 ### Arguments
 

diff --git a/components/embedding_based_laion_retrieval/README.md b/components/embedding_based_laion_retrieval/README.md
@@ -8,12 +8,14 @@ used to find images similar to the embedded images / captions.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - embeddings
-  - data: list<item: float>
+    - data: list<item: float>
 
 **This component produces:**
+
 - images
-  - url: string
+    - url: string
 
 ### Arguments
 

diff --git a/components/filter_comments/README.md b/components/filter_comments/README.md
@@ -6,8 +6,9 @@ Component that filters code based on the code to comment ratio
 ### Inputs / outputs
 
 **This component consumes:**
+
 - code
-  - content: string
+    - content: string
 
 **This component produces no data.**
 

diff --git a/components/filter_image_resolution/README.md b/components/filter_image_resolution/README.md
@@ -6,9 +6,10 @@ Component that filters images based on minimum size and max aspect ratio
 ### Inputs / outputs
 
 **This component consumes:**
+
 - images
-  - width: int32
-  - height: int32
+    - width: int32
+    - height: int32
 
 **This component produces no data.**
 

diff --git a/components/filter_line_length/README.md b/components/filter_line_length/README.md
@@ -6,10 +6,11 @@ Component that filters code based on line length
 ### Inputs / outputs
 
 **This component consumes:**
+
 - code
-  - avg_line_length: double
-  - max_line_length: int32
-  - alphanum_fraction: double
+    - avg_line_length: double
+    - max_line_length: int32
+    - alphanum_fraction: double
 
 **This component produces no data.**
 

diff --git a/components/image_cropping/README.md b/components/image_cropping/README.md
@@ -21,14 +21,16 @@ right side is border-cropped image.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - images
-  - data: binary
+    - data: binary
 
 **This component produces:**
+
 - images
-  - data: binary
-  - width: int32
-  - height: int32
+    - data: binary
+    - width: int32
+    - height: int32
 
 ### Arguments
 

diff --git a/components/image_resolution_extraction/README.md b/components/image_resolution_extraction/README.md
@@ -6,14 +6,16 @@ Component that extracts image resolution data from the images
 ### Inputs / outputs
 
 **This component consumes:**
+
 - images
-  - data: binary
+    - data: binary
 
 **This component produces:**
+
 - images
-  - data: binary
-  - width: int32
-  - height: int32
+    - data: binary
+    - width: int32
+    - height: int32
 
 ### Arguments
 

diff --git a/components/language_filter/README.md b/components/language_filter/README.md
@@ -6,8 +6,9 @@ A component that filters text based on the provided language.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - text
-  - data: string
+    - data: string
 
 **This component produces no data.**
 

diff --git a/components/load_from_files/README.md b/components/load_from_files/README.md
@@ -10,9 +10,10 @@ location. It supports the following formats: .zip, gzip, tar and tar.gz.
 **This component consumes no data.**
 
 **This component produces:**
+
 - file
-  - filename: string
-  - content: binary
+    - filename: string
+    - content: binary
 
 ### Arguments
 

diff --git a/components/load_from_hf_hub/README.md b/components/load_from_hf_hub/README.md
@@ -8,8 +8,9 @@ Component that loads a dataset from the hub
 **This component consumes no data.**
 
 **This component produces:**
+
 - dummy_variable
-  - data: binary
+    - data: binary
 
 ### Arguments
 

diff --git a/components/load_from_parquet/README.md b/components/load_from_parquet/README.md
@@ -8,8 +8,9 @@ Component that loads a dataset from a parquet uri
 **This component consumes no data.**
 
 **This component produces:**
+
 - dummy_variable
-  - data: binary
+    - data: binary
 
 ### Arguments
 

diff --git a/components/minhash_generator/README.md b/components/minhash_generator/README.md
@@ -6,12 +6,14 @@ A component that generates minhashes of text.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - text
-  - data: string
+    - data: string
 
 **This component produces:**
+
 - text
-  - minhash: list<item: uint64>
+    - minhash: list<item: uint64>
 
 ### Arguments
 

diff --git a/components/pii_redaction/README.md b/components/pii_redaction/README.md
@@ -26,12 +26,14 @@ code.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - code
-  - content: string
+    - content: string
 
 **This component produces:**
+
 - code
-  - content: string
+    - content: string
 
 ### Arguments
 

diff --git a/components/prompt_based_laion_retrieval/README.md b/components/prompt_based_laion_retrieval/README.md
@@ -11,12 +11,14 @@ This component doesn’t return the actual images, only URLs.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - prompts
-  - text: string
+    - text: string
 
 **This component produces:**
+
 - images
-  - url: string
+    - url: string
 
 ### Arguments
 

diff --git a/components/segment_images/README.md b/components/segment_images/README.md
@@ -6,12 +6,14 @@ Component that creates segmentation masks for images using a model from the Hugg
 ### Inputs / outputs
 
 **This component consumes:**
+
 - images
-  - data: binary
+    - data: binary
 
 **This component produces:**
+
 - segmentations
-  - data: binary
+    - data: binary
 
 ### Arguments
 

diff --git a/components/text_length_filter/README.md b/components/text_length_filter/README.md
@@ -6,8 +6,9 @@ A component that filters out text based on their length
 ### Inputs / outputs
 
 **This component consumes:**
+
 - text
-  - data: string
+    - data: string
 
 **This component produces no data.**
 

diff --git a/components/text_normalization/README.md b/components/text_normalization/README.md
@@ -18,8 +18,9 @@ the training of large language models.
 ### Inputs / outputs
 
 **This component consumes:**
+
 - text
-  - data: string
+    - data: string
 
 **This component produces no data.**
 

diff --git a/components/write_to_hf_hub/README.md b/components/write_to_hf_hub/README.md
@@ -6,8 +6,9 @@ Component that writes a dataset to the hub
 ### Inputs / outputs
 
 **This component consumes:**
+
 - dummy_variable
-  - data: binary
+    - data: binary
 
 **This component produces no data.**
 

diff --git a/docs/.readthedocs.yaml b/docs/.readthedocs.yaml
@@ -21,4 +21,7 @@ build:
       - poetry config virtualenvs.create false
     post_install:
       # Install dependencies with 'docs' dependency group
-      - poetry install --with docs
+      - poetry install --with docs
+    pre_build:
+      # Generate hub documentation
+      - python scripts/component_readme/generate_hub.py
diff --git a/docs/components/components.md b/docs/components/components.md
@@ -2,7 +2,7 @@
 
 Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
 provides a lot of components out of the box
-([overview](https://github.com/ml6team/fondant/tree/main/components)), but you can also define your
+([overview](hub.md)), but you can also define your
 own custom components.
 
 ## The anatomy of a component