Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign dataset format #672

Merged
merged 4 commits into from
Nov 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions components/caption_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@ This component captions images using a BLIP model from the Hugging Face hub

**This component consumes:**

- images
- data: binary
- images_data: binary

**This component produces:**

- captions
- text: string
- captions_text: string

### Arguments

Expand Down
12 changes: 4 additions & 8 deletions components/caption_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,12 @@ tags:
- Image processing

consumes:
images:
fields:
data:
type: binary
images_data:
type: binary

produces:
captions:
fields:
text:
type: utf8
captions_text:
type: utf8

args:
model_id:
Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.max_new_tokens = max_new_tokens

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images"]["data"]
images = dataframe["images_data"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -112,4 +112,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(captions)

return pd.concat(results).to_frame(name=("captions", "text"))
return pd.concat(results).to_frame(name=("captions_text"))
8 changes: 3 additions & 5 deletions components/chunk_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,12 @@ consists of the id of the original document followed by the chunk index.

**This component consumes:**

- text
- data: string
- text_data: string

**This component produces:**

- text
- data: string
- original_document_id: string
- text_data: string
- text_original_document_id: string

### Arguments

Expand Down
16 changes: 6 additions & 10 deletions components/chunk_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,14 @@ tags:
- Text processing

consumes:
text:
fields:
data:
type: string
text_data:
type: string

produces:
text:
fields:
data:
type: string
original_document_id:
type: string
text_data:
type: string
text_original_document_id:
type: string

args:
chunk_size:
Expand Down
7 changes: 1 addition & 6 deletions components/chunk_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def __init__(
def chunk_text(self, row) -> t.List[t.Tuple]:
# Multi-index df has id under the name attribute
doc_id = row.name
text_data = row[("text", "data")]
text_data = row[("text_data")]
docs = self.text_splitter.create_documents([text_data])
return [
(doc_id, f"{doc_id}_{chunk_id}", chunk.page_content)
Expand All @@ -63,9 +63,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
)
results_df = results_df.set_index("id")

# Set multi-index column for the expected subset and field
results_df.columns = pd.MultiIndex.from_product(
[["text"], results_df.columns],
)

return results_df
6 changes: 3 additions & 3 deletions components/chunk_text/tests/chunk_text_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def test_transform():
"""Test chunk component method."""
input_dataframe = pd.DataFrame(
{
("text", "data"): [
("text_data"): [
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo",
"ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis",
"parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec,",
Expand All @@ -25,8 +25,8 @@ def test_transform():

expected_output_dataframe = pd.DataFrame(
{
("text", "original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text", "data"): [
("text_original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text_data"): [
"Lorem ipsum dolor sit amet, consectetuer",
"amet, consectetuer adipiscing elit. Aenean",
"elit. Aenean commodo",
Expand Down
10 changes: 4 additions & 6 deletions components/download_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,13 @@ from the img2dataset library.

**This component consumes:**

- images
- url: string
- images_url: string

**This component produces:**

- images
- data: binary
- width: int32
- height: int32
- images_data: binary
- images_width: int32
- images_height: int32

### Arguments

Expand Down
23 changes: 9 additions & 14 deletions components/download_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,16 @@ tags:
- Image processing

consumes:
images:
fields:
url:
type: string
images_url:
type: string

produces:
images:
fields:
data:
type: binary
width:
type: int32
height:
type: int32
additionalFields: false
images_data:
type: binary
images_width:
type: int32
images_height:
type: int32

args:
timeout:
Expand All @@ -53,7 +48,7 @@ args:
description: Resize mode to use. One of "no", "keep_ratio", "center_crop", "border".
type: str
default: 'border'
resize_only_if_bigger:
resize_only_if_bigger:
description: If True, resize only if image is bigger than image_size.
type: bool
default: False
Expand Down
5 changes: 1 addition & 4 deletions components/download_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ async def download_dataframe() -> None:
images = await asyncio.gather(
*[
self.download_and_resize_image(id_, url, semaphore=semaphore)
for id_, url in zip(dataframe.index, dataframe["images"]["url"])
for id_, url in zip(dataframe.index, dataframe["images_url"])
],
)
results.extend(images)
Expand All @@ -134,8 +134,5 @@ async def download_dataframe() -> None:

results_df = results_df.dropna()
results_df = results_df.set_index("id", drop=True)
results_df.columns = pd.MultiIndex.from_product(
[["images"], results_df.columns],
)

return results_df
8 changes: 4 additions & 4 deletions components/download_images/tests/test_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_transform(respx_mock):

input_dataframe = pd.DataFrame(
{
("images", "url"): urls,
"images_url": urls,
},
index=pd.Index(ids, name="id"),
)
Expand All @@ -55,9 +55,9 @@ def test_transform(respx_mock):
resized_images = [component.resizer(io.BytesIO(image))[0] for image in images]
expected_dataframe = pd.DataFrame(
{
("images", "data"): resized_images,
("images", "width"): [image_size] * len(ids),
("images", "height"): [image_size] * len(ids),
"images_data": resized_images,
"images_width": [image_size] * len(ids),
"images_height": [image_size] * len(ids),
},
index=pd.Index(ids, name="id"),
)
Expand Down
6 changes: 2 additions & 4 deletions components/embed_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@ Component that generates CLIP embeddings from images

**This component consumes:**

- images
- data: binary
- images_data: binary

**This component produces:**

- embeddings
- data: list<item: float>
- embeddings_data: list<item: float>

### Arguments

Expand Down
18 changes: 7 additions & 11 deletions components/embed_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,17 @@ name: Embed images
description: Component that generates CLIP embeddings from images
image: fndnt/embed_images:dev
tags:
- Image processing
- Image processing

consumes:
images:
fields:
data:
type: binary
images_data:
type: binary

produces:
embeddings:
fields:
data:
type: array
items:
type: float32
embeddings_data:
type: array
items:
type: float32

args:
model_id:
Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.batch_size = batch_size

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images"]["data"]
images = dataframe["images_data"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -110,4 +110,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(embeddings)

return pd.concat(results).to_frame(name=("embeddings", "data"))
return pd.concat(results).to_frame(name=("embeddings_data"))
8 changes: 3 additions & 5 deletions components/embed_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,12 @@ Component that generates embeddings of text passages.

**This component consumes:**

- text
- data: string
- text_data: string

**This component produces:**

- text
- data: string
- embedding: list<item: float>
- text_data: string
- text_embedding: list<item: float>

### Arguments

Expand Down
22 changes: 9 additions & 13 deletions components/embed_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,17 @@ tags:
- Text processing

consumes:
text:
fields:
data:
type: string
text_data:
type: string

produces:
text:
fields:
data:
type: string
embedding:
type: array
items:
type: float32

text_data:
type: string
text_embedding:
type: array
items:
type: float32

args:
model_provider:
description: |
Expand Down
4 changes: 2 additions & 2 deletions components/embed_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def get_embeddings_vectors(self, texts):
return self.embedding_model.embed_documents(texts.tolist())

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
dataframe[("text", "embedding")] = self.get_embeddings_vectors(
dataframe[("text", "data")],
dataframe["text_embedding"] = self.get_embeddings_vectors(
dataframe["text_data"],
)
return dataframe
15 changes: 11 additions & 4 deletions components/embedding_based_laion_retrieval/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM --platform=linux/amd64 python:3.8-slim
FROM --platform=linux/amd64 python:3.8-slim as base

# System dependencies
RUN apt-get update && \
Expand All @@ -16,8 +16,15 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team

# Set the working directory to the component folder
WORKDIR /component/src
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

# Copy over src-files
COPY src/ .
FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN python -m pytest tests

ENTRYPOINT ["fondant", "execute", "main"]
FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
Loading
Loading