Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update component readmes #538

Merged
merged 2 commits into from
Oct 19, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions components/caption_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| model_id | str | Id of the BLIP model on the Hugging Face hub | Salesforce/blip-image-captioning-base |
| batch_size | int | Batch size to use for inference | 8 |
| max_new_tokens | int | Maximum token length of each caption | 50 |
Expand All @@ -37,6 +45,14 @@ caption_images_op = ComponentOp.from_registry(
name="caption_images",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "model_id": "Salesforce/blip-image-captioning-base",
# "batch_size": 8,
# "max_new_tokens": 50,
Expand Down
16 changes: 16 additions & 0 deletions components/chunk_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| chunk_size | int | Maximum size of chunks to return | / |
| chunk_overlap | int | Overlap in characters between chunks | / |

Expand All @@ -41,6 +49,14 @@ chunk_text_op = ComponentOp.from_registry(
name="chunk_text",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "chunk_size": 0,
# "chunk_overlap": 0,
}
Expand Down
24 changes: 20 additions & 4 deletions components/download_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,22 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| timeout | int | Maximum time (in seconds) to wait when trying to download an image, | 10 |
| retries | int | Number of times to retry downloading an image if it fails. | / |
| n_connections | int | Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput. | 100 |
| image_size | int | Size of the images after resizing. | 256 |
| resize_mode | str | Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". | border |
| resize_only_if_bigger | bool | If True, resize only if image is bigger than image_size. | False |
| resize_only_if_bigger | bool | If True, resize only if image is bigger than image_size. | / |
| min_image_size | int | Minimum size of the images. | / |
| max_aspect_ratio | float | Maximum aspect ratio of the images. | inf |
| max_aspect_ratio | float | Maximum aspect ratio of the images. | 99.9 |

### Usage

Expand All @@ -51,14 +59,22 @@ download_images_op = ComponentOp.from_registry(
name="download_images",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "timeout": 10,
# "retries": 0,
# "n_connections": 100,
# "image_size": 256,
# "resize_mode": "border",
# "resize_only_if_bigger": "False",
# "resize_only_if_bigger": False,
# "min_image_size": 0,
# "max_aspect_ratio": "inf",
# "max_aspect_ratio": 99.9,
}
)
pipeline.add_op(download_images_op, dependencies=[...]) #Add previous component as dependency
Expand Down
16 changes: 16 additions & 0 deletions components/embed_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| model_id | str | Model id of a CLIP model on the Hugging Face hub | openai/clip-vit-large-patch14 |
| batch_size | int | Batch size to use when embedding | 8 |

Expand All @@ -36,6 +44,14 @@ embed_images_op = ComponentOp.from_registry(
name="embed_images",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "model_id": "openai/clip-vit-large-patch14",
# "batch_size": 8,
}
Expand Down
16 changes: 16 additions & 0 deletions components/embedding_based_laion_retrieval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| num_images | int | Number of images to retrieve for each prompt | / |
| aesthetic_score | int | Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). | 9 |
| aesthetic_weight | float | Weight of the aesthetic embedding when added to the query, between 0 and 1 | 0.5 |
Expand All @@ -39,6 +47,14 @@ embedding_based_laion_retrieval_op = ComponentOp.from_registry(
name="embedding_based_laion_retrieval",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "num_images": 0,
# "aesthetic_score": 9,
# "aesthetic_weight": 0.5,
Expand Down
16 changes: 16 additions & 0 deletions components/filter_comments/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| min_comments_ratio | float | The minimum code to comment ratio | 0.1 |
| max_comments_ratio | float | The maximum code to comment ratio | 0.9 |

Expand All @@ -33,6 +41,14 @@ filter_comments_op = ComponentOp.from_registry(
name="filter_comments",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "min_comments_ratio": 0.1,
# "max_comments_ratio": 0.9,
}
Expand Down
16 changes: 16 additions & 0 deletions components/filter_image_resolution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| min_image_dim | int | Minimum image dimension | / |
| max_aspect_ratio | float | Maximum aspect ratio | / |

Expand All @@ -34,6 +42,14 @@ filter_image_resolution_op = ComponentOp.from_registry(
name="filter_image_resolution",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "min_image_dim": 0,
# "max_aspect_ratio": 0.0,
}
Expand Down
16 changes: 16 additions & 0 deletions components/filter_line_length/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| avg_line_length_threshold | int | Threshold for average line length to filter on | / |
| max_line_length_threshold | int | Threshold for maximum line length to filter on | / |
| alphanum_fraction_threshold | float | Alphanum fraction to filter on | / |
Expand All @@ -36,6 +44,14 @@ filter_line_length_op = ComponentOp.from_registry(
name="filter_line_length",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "avg_line_length_threshold": 0,
# "max_line_length_threshold": 0,
# "alphanum_fraction_threshold": 0.0,
Expand Down
16 changes: 16 additions & 0 deletions components/image_cropping/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| cropping_threshold | int | Threshold parameter used for detecting borders. A lower (negative) parameter results in a more performant border detection, but can cause overcropping. Default is -30 | -30 |
| padding | int | Padding for the image cropping. The padding is added to all borders of the image. | 10 |

Expand All @@ -53,6 +61,14 @@ image_cropping_op = ComponentOp.from_registry(
name="image_cropping",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "cropping_threshold": -30,
# "padding": 10,
}
Expand Down
21 changes: 20 additions & 1 deletion components/image_resolution_extraction/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,18 @@ Component that extracts image resolution data from the images

### Arguments

This component takes no arguments.
The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |

### Usage

Expand All @@ -33,6 +44,14 @@ image_resolution_extraction_op = ComponentOp.from_registry(
name="image_resolution_extraction",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
}
)
pipeline.add_op(image_resolution_extraction_op, dependencies=[...]) #Add previous component as dependency
Expand Down
16 changes: 16 additions & 0 deletions components/language_filter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| language | str | A valid language code or identifier (e.g., "en", "fr", "de"). | en |

### Usage
Expand All @@ -32,6 +40,14 @@ language_filter_op = ComponentOp.from_registry(
name="language_filter",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "language": "en",
}
)
Expand Down
16 changes: 16 additions & 0 deletions components/load_from_files/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| input_manifest_path | str | Path to the input manifest | / |
| component_spec | dict | The component specification as a dictionary | / |
| input_partition_rows | int | The number of rows to load per partition. Set to override the automatic partitioning | / |
| cache | bool | Set to False to disable caching, True by default. | True |
| cluster_type | str | The cluster type to use for the execution | default |
| client_kwargs | dict | Keyword arguments to pass to the Dask client | / |
| metadata | str | Metadata arguments containing the run id and base path | / |
| output_manifest_path | str | Path to the output manifest | / |
| directory_uri | str | Local or remote path to the directory containing the files | / |

### Usage
Expand All @@ -35,6 +43,14 @@ load_from_files_op = ComponentOp.from_registry(
name="load_from_files",
arguments={
# Add arguments
# "input_manifest_path": ,
# "component_spec": {},
# "input_partition_rows": 0,
# "cache": True,
# "cluster_type": "default",
# "client_kwargs": {},
# "metadata": ,
# "output_manifest_path": ,
# "directory_uri": ,
}
)
Expand Down
Loading
Loading