Skip to content

Commit

Permalink
Use new data format (#667)
Browse files Browse the repository at this point in the history
This PR applies the usage of the new data format:

- fixes all tests
- update component specifications and component code
- remove subset field usage in `pipeline.py`

---------

Co-authored-by: Robbe Sneyders <[email protected]>
  • Loading branch information
mrchtr and RobbeSneyders committed Nov 27, 2023
1 parent 30dafb3 commit 1c6cb6d
Show file tree
Hide file tree
Showing 124 changed files with 420 additions and 1,059 deletions.
6 changes: 2 additions & 4 deletions components/caption_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@ This component captions images using a BLIP model from the Hugging Face hub

**This component consumes:**

- images
- data: binary
- images_data: binary

**This component produces:**

- captions
- text: string
- captions_text: string

### Arguments

Expand Down
12 changes: 4 additions & 8 deletions components/caption_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,12 @@ tags:
- Image processing

consumes:
images:
fields:
data:
type: binary
images_data:
type: binary

produces:
captions:
fields:
text:
type: utf8
captions_text:
type: utf8

args:
model_id:
Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.max_new_tokens = max_new_tokens

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images"]["data"]
images = dataframe["images_data"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -112,4 +112,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(captions)

return pd.concat(results).to_frame(name=("captions", "text"))
return pd.concat(results).to_frame(name=("captions_text"))
8 changes: 3 additions & 5 deletions components/chunk_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,12 @@ consists of the id of the original document followed by the chunk index.

**This component consumes:**

- text
- data: string
- text_data: string

**This component produces:**

- text
- data: string
- original_document_id: string
- text_data: string
- text_original_document_id: string

### Arguments

Expand Down
16 changes: 6 additions & 10 deletions components/chunk_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,14 @@ tags:
- Text processing

consumes:
text:
fields:
data:
type: string
text_data:
type: string

produces:
text:
fields:
data:
type: string
original_document_id:
type: string
text_data:
type: string
text_original_document_id:
type: string

args:
chunk_size:
Expand Down
7 changes: 1 addition & 6 deletions components/chunk_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def __init__(
def chunk_text(self, row) -> t.List[t.Tuple]:
# Multi-index df has id under the name attribute
doc_id = row.name
text_data = row[("text", "data")]
text_data = row[("text_data")]
docs = self.text_splitter.create_documents([text_data])
return [
(doc_id, f"{doc_id}_{chunk_id}", chunk.page_content)
Expand All @@ -63,9 +63,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
)
results_df = results_df.set_index("id")

# Set multi-index column for the expected subset and field
results_df.columns = pd.MultiIndex.from_product(
[["text"], results_df.columns],
)

return results_df
6 changes: 3 additions & 3 deletions components/chunk_text/tests/chunk_text_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def test_transform():
"""Test chunk component method."""
input_dataframe = pd.DataFrame(
{
("text", "data"): [
("text_data"): [
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo",
"ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis",
"parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec,",
Expand All @@ -25,8 +25,8 @@ def test_transform():

expected_output_dataframe = pd.DataFrame(
{
("text", "original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text", "data"): [
("text_original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text_data"): [
"Lorem ipsum dolor sit amet, consectetuer",
"amet, consectetuer adipiscing elit. Aenean",
"elit. Aenean commodo",
Expand Down
10 changes: 4 additions & 6 deletions components/download_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,13 @@ from the img2dataset library.

**This component consumes:**

- images
- url: string
- images_url: string

**This component produces:**

- images
- data: binary
- width: int32
- height: int32
- images_data: binary
- images_width: int32
- images_height: int32

### Arguments

Expand Down
24 changes: 10 additions & 14 deletions components/download_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,17 @@ tags:
- Image processing

consumes:
images:
fields:
url:
type: string
images_url:
type: string

produces:
images:
fields:
data:
type: binary
width:
type: int32
height:
type: int32
additionalFields: false
images_data:
type: binary
images_width:
type: int32
images_height:
type: int32
# additionalFields: false

args:
timeout:
Expand All @@ -53,7 +49,7 @@ args:
description: Resize mode to use. One of "no", "keep_ratio", "center_crop", "border".
type: str
default: 'border'
resize_only_if_bigger:
resize_only_if_bigger:
description: If True, resize only if image is bigger than image_size.
type: bool
default: False
Expand Down
5 changes: 1 addition & 4 deletions components/download_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ async def download_dataframe() -> None:
images = await asyncio.gather(
*[
self.download_and_resize_image(id_, url, semaphore=semaphore)
for id_, url in zip(dataframe.index, dataframe["images"]["url"])
for id_, url in zip(dataframe.index, dataframe["images_url"])
],
)
results.extend(images)
Expand All @@ -134,8 +134,5 @@ async def download_dataframe() -> None:

results_df = results_df.dropna()
results_df = results_df.set_index("id", drop=True)
results_df.columns = pd.MultiIndex.from_product(
[["images"], results_df.columns],
)

return results_df
8 changes: 4 additions & 4 deletions components/download_images/tests/test_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_transform(respx_mock):

input_dataframe = pd.DataFrame(
{
("images", "url"): urls,
"images_url": urls,
},
index=pd.Index(ids, name="id"),
)
Expand All @@ -55,9 +55,9 @@ def test_transform(respx_mock):
resized_images = [component.resizer(io.BytesIO(image))[0] for image in images]
expected_dataframe = pd.DataFrame(
{
("images", "data"): resized_images,
("images", "width"): [image_size] * len(ids),
("images", "height"): [image_size] * len(ids),
"images_data": resized_images,
"images_width": [image_size] * len(ids),
"images_height": [image_size] * len(ids),
},
index=pd.Index(ids, name="id"),
)
Expand Down
6 changes: 2 additions & 4 deletions components/embed_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@ Component that generates CLIP embeddings from images

**This component consumes:**

- images
- data: binary
- images_data: binary

**This component produces:**

- embeddings
- data: list<item: float>
- embeddings_data: list<item: float>

### Arguments

Expand Down
18 changes: 7 additions & 11 deletions components/embed_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,17 @@ name: Embed images
description: Component that generates CLIP embeddings from images
image: fndnt/embed_images:dev
tags:
- Image processing
- Image processing

consumes:
images:
fields:
data:
type: binary
images_data:
type: binary

produces:
embeddings:
fields:
data:
type: array
items:
type: float32
embeddings_data:
type: array
items:
type: float32

args:
model_id:
Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.batch_size = batch_size

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images"]["data"]
images = dataframe["images_data"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -110,4 +110,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(embeddings)

return pd.concat(results).to_frame(name=("embeddings", "data"))
return pd.concat(results).to_frame(name=("embeddings_data"))
8 changes: 3 additions & 5 deletions components/embed_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,12 @@ Component that generates embeddings of text passages.

**This component consumes:**

- text
- data: string
- text_data: string

**This component produces:**

- text
- data: string
- embedding: list<item: float>
- text_data: string
- text_embedding: list<item: float>

### Arguments

Expand Down
22 changes: 9 additions & 13 deletions components/embed_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,17 @@ tags:
- Text processing

consumes:
text:
fields:
data:
type: string
text_data:
type: string

produces:
text:
fields:
data:
type: string
embedding:
type: array
items:
type: float32

text_data:
type: string
text_embedding:
type: array
items:
type: float32

args:
model_provider:
description: |
Expand Down
4 changes: 2 additions & 2 deletions components/embed_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def get_embeddings_vectors(self, texts):
return self.embedding_model.embed_documents(texts.tolist())

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
dataframe[("text", "embedding")] = self.get_embeddings_vectors(
dataframe[("text", "data")],
dataframe["text_embedding"] = self.get_embeddings_vectors(
dataframe["text_data"],
)
return dataframe
6 changes: 2 additions & 4 deletions components/embedding_based_laion_retrieval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,11 @@ used to find images similar to the embedded images / captions.

**This component consumes:**

- embeddings
- data: list<item: float>
- embeddings_data: list<item: float>

**This component produces:**

- images
- url: string
- images_url: string

### Arguments

Expand Down
18 changes: 7 additions & 11 deletions components/embedding_based_laion_retrieval/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,15 @@ tags:
- Data retrieval

consumes:
embeddings:
fields:
data:
type: array
items:
type: float32
embeddings_data:
type: array
items:
type: float32

produces:
images:
fields:
url:
type: string
additionalSubsets: false
images_url:
type: string
# additionalFields: false

args:
num_images:
Expand Down
Loading

0 comments on commit 1c6cb6d

Please sign in to comment.