Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cleaner field names in reusable components #679

Merged
merged 3 commits into from
Nov 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions components/caption_images/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,10 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ This component captions images using a BLIP model from the Hugging Face hub

**This component consumes:**

- images_data: binary
- image: binary

**This component produces:**

- captions_text: string
- caption: string

### Arguments

Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ tags:
- Image processing

consumes:
images_data:
image:
type: binary

produces:
captions_text:
caption:
type: utf8

args:
Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.max_new_tokens = max_new_tokens

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images_data"]
images = dataframe["image"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -112,4 +112,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(captions)

return pd.concat(results).to_frame(name=("captions_text"))
return pd.concat(results).to_frame(name="caption")
4 changes: 2 additions & 2 deletions components/caption_images/tests/test_caption_images.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ def test_image_caption_component():
"https://cdn.pixabay.com/photo/2023/07/19/18/56/japanese-beetle-8137606_1280.png",
]
input_dataframe = pd.DataFrame(
{"images": {"data": [requests.get(url).content for url in image_urls]}},
{"image": [requests.get(url).content for url in image_urls]},
)

expected_output_dataframe = pd.DataFrame(
data={("captions", "text"): {0: "a motorcycle", 1: "a beetle"}},
data={"caption": {0: "a motorcycle", 1: "a beetle"}},
)

component = CaptionImagesComponent(
Expand Down
6 changes: 2 additions & 4 deletions components/chunk_text/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,12 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
ENTRYPOINT ["fondant", "execute", "main"]
6 changes: 3 additions & 3 deletions components/chunk_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ consists of the id of the original document followed by the chunk index.

**This component consumes:**

- text_data: string
- text: string

**This component produces:**

- text_data: string
- text_original_document_id: string
- text: string
- original_document_id: string

### Arguments

Expand Down
6 changes: 3 additions & 3 deletions components/chunk_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ tags:
- Text processing

consumes:
text_data:
text:
type: string

produces:
text_data:
text:
type: string
text_original_document_id:
original_document_id:
type: string

args:
Expand Down
4 changes: 2 additions & 2 deletions components/chunk_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def __init__(
def chunk_text(self, row) -> t.List[t.Tuple]:
# Multi-index df has id under the name attribute
doc_id = row.name
text_data = row[("text_data")]
text_data = row["text"]
docs = self.text_splitter.create_documents([text_data])
return [
(doc_id, f"{doc_id}_{chunk_id}", chunk.page_content)
Expand All @@ -59,7 +59,7 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
# Turn into dataframes
results_df = pd.DataFrame(
results,
columns=["text_original_document_id", "id", "text_data"],
columns=["original_document_id", "id", "text"],
)
results_df = results_df.set_index("id")

Expand Down
6 changes: 3 additions & 3 deletions components/chunk_text/tests/chunk_text_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def test_transform():
"""Test chunk component method."""
input_dataframe = pd.DataFrame(
{
("text_data"): [
"text": [
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo",
"ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis",
"parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec,",
Expand All @@ -25,8 +25,8 @@ def test_transform():

expected_output_dataframe = pd.DataFrame(
{
("text_original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text_data"): [
"original_document_id": ["a", "a", "a", "b", "b", "c", "c"],
"text": [
"Lorem ipsum dolor sit amet, consectetuer",
"amet, consectetuer adipiscing elit. Aenean",
"elit. Aenean commodo",
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ right side is border-cropped image.

**This component produces:**

- images_data: binary
- images_width: int32
- images_height: int32
- image: binary
- image_width: int32
- image_height: int32

### Arguments

Expand All @@ -47,14 +47,14 @@ You can add this component to your pipeline using the following code:
from fondant.pipeline import ComponentOp


image_cropping_op = ComponentOp.from_registry(
name="image_cropping",
crop_images_op = ComponentOp.from_registry(
name="crop_images",
arguments={
# Add arguments
# "cropping_threshold": -30,
# "padding": 10,
}
)
pipeline.add_op(image_cropping_op, dependencies=[...]) #Add previous component as dependency
pipeline.add_op(crop_images_op, dependencies=[...]) #Add previous component as dependency
```

Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ consumes:
type: binary

produces:
images_data:
image:
type: binary
images_width:
image_width:
type: int32
images_height:
image_height:
type: int32

args:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ def __init__(

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
# crop images
dataframe["images_data"] = dataframe["images_data"].apply(
dataframe["image"] = dataframe["image"].apply(
lambda image: remove_borders(image, self.cropping_threshold, self.padding),
)

# extract width and height
dataframe["images_width", "images_height"] = dataframe["images_data"].apply(
dataframe["image_width", "image_height"] = dataframe["image"].apply(
extract_dimensions,
axis=1,
result_type="expand",
Expand Down
6 changes: 2 additions & 4 deletions components/download_images/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,12 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
ENTRYPOINT ["fondant", "execute", "main"]
8 changes: 4 additions & 4 deletions components/download_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ from the img2dataset library.

**This component consumes:**

- images_url: string
- image_url: string

**This component produces:**

- images_data: binary
- images_width: int32
- images_height: int32
- image: binary
- image_width: int32
- image_height: int32

### Arguments

Expand Down
8 changes: 4 additions & 4 deletions components/download_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ tags:
- Image processing

consumes:
images_url:
image_url:
type: string

produces:
images_data:
image:
type: binary
images_width:
image_width:
type: int32
images_height:
image_height:
type: int32

args:
Expand Down
4 changes: 2 additions & 2 deletions components/download_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,14 +119,14 @@ async def download_dataframe() -> None:
images = await asyncio.gather(
*[
self.download_and_resize_image(id_, url, semaphore=semaphore)
for id_, url in zip(dataframe.index, dataframe["images_url"])
for id_, url in zip(dataframe.index, dataframe["image_url"])
],
)
results.extend(images)

asyncio.run(download_dataframe())

columns = ["id", "data", "width", "height"]
columns = ["id", "image", "image_width", "image_height"]
if results:
results_df = pd.DataFrame(results, columns=columns)
else:
Expand Down
2 changes: 2 additions & 0 deletions components/download_images/tests/pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
pythonpath = ../src
2 changes: 2 additions & 0 deletions components/download_images/tests/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pytest==7.4.0
respx==0.20.2
8 changes: 4 additions & 4 deletions components/download_images/tests/test_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_transform(respx_mock):

input_dataframe = pd.DataFrame(
{
"images_url": urls,
"image_url": urls,
},
index=pd.Index(ids, name="id"),
)
Expand All @@ -55,9 +55,9 @@ def test_transform(respx_mock):
resized_images = [component.resizer(io.BytesIO(image))[0] for image in images]
expected_dataframe = pd.DataFrame(
{
"images_data": resized_images,
"images_width": [image_size] * len(ids),
"images_height": [image_size] * len(ids),
"image": resized_images,
"image_width": [image_size] * len(ids),
"image_height": [image_size] * len(ids),
},
index=pd.Index(ids, name="id"),
)
Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ Component that generates CLIP embeddings from images

**This component consumes:**

- images_data: binary
- image: binary

**This component produces:**

- embeddings_data: list<item: float>
- embedding: list<item: float>

### Arguments

Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ tags:
- Image processing

consumes:
images_data:
image:
type: binary

produces:
embeddings_data:
embedding:
type: array
items:
type: float32
Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.batch_size = batch_size

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images_data"]
images = dataframe["image"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -110,4 +110,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(embeddings)

return pd.concat(results).to_frame(name=("embeddings_data"))
return pd.concat(results).to_frame(name="embedding")
6 changes: 2 additions & 4 deletions components/embed_text/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,12 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
ENTRYPOINT ["fondant", "execute", "main"]
5 changes: 2 additions & 3 deletions components/embed_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,11 @@ Component that generates embeddings of text passages.

**This component consumes:**

- text_data: string
- text: string

**This component produces:**

- text_data: string
- text_embedding: list<item: float>
- embedding: list<item: float>

### Arguments

Expand Down
Loading
Loading