Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of new pipeline interface #665

Closed
wants to merge 38 commits into from
Closed
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
35338b6
Update component spec schema validation
mrchtr Nov 16, 2023
a269e3c
Update component spec tests to validate new component spec
mrchtr Nov 16, 2023
ad0dab6
Add additional fields to json schema
mrchtr Nov 16, 2023
7b91535
Update manifest json schema for validation
mrchtr Nov 16, 2023
5d1bf5e
Update manifest creation
mrchtr Nov 17, 2023
d8ecd01
Reduce PR to core module
mrchtr Nov 21, 2023
12c78ca
Addresses comments
mrchtr Nov 21, 2023
c1cad60
Restructure test directory
mrchtr Nov 21, 2023
fd0699c
Remove additional fields in common.json
mrchtr Nov 21, 2023
0f8117f
Test structure
mrchtr Nov 21, 2023
7e8a1d6
Refactor component package
mrchtr Nov 21, 2023
9f67c61
Update src/fondant/core/component_spec.py
mrchtr Nov 21, 2023
40955bf
Update src/fondant/core/manifest.py
mrchtr Nov 21, 2023
6b246a4
Update src/fondant/core/component_spec.py
mrchtr Nov 21, 2023
8ef38d9
Update src/fondant/core/manifest.py
mrchtr Nov 21, 2023
e8c8135
Update src/fondant/core/schema.py
mrchtr Nov 21, 2023
df9a60e
Addresses comments
mrchtr Nov 21, 2023
2256118
Addresses comments
mrchtr Nov 21, 2023
3042fb5
Addresses comments
mrchtr Nov 21, 2023
8fa8be7
Update src/fondant/core/manifest.py
mrchtr Nov 21, 2023
25eb492
Addresses comments
mrchtr Nov 22, 2023
c0fb47a
Merge branch 'feature/implement-new-dataset-format' into feautre/refa…
mrchtr Nov 22, 2023
0701662
Addresses comments
mrchtr Nov 22, 2023
365ca6d
Update test examples
mrchtr Nov 22, 2023
4dc7dc7
Update src/fondant/core/manifest.py
mrchtr Nov 22, 2023
a60ca3e
addresses comments
mrchtr Nov 22, 2023
d2182a0
Merge feature/implement-new-dataset-format into feature/refactore-com…
mrchtr Nov 22, 2023
e141231
Adjust interface for usage of produces and consumes
mrchtr Nov 22, 2023
f3e0a6a
Adjust interface for usage of schema, consumes, and produces
mrchtr Nov 22, 2023
b4fe222
Update core package (#653)
mrchtr Nov 23, 2023
bb3b623
Refactor component package (#654)
mrchtr Nov 23, 2023
e4eadf3
Use new data format (#667)
mrchtr Nov 24, 2023
ae72104
Merge redesign-dataset-format-and-interface into feature/implement-n…
mrchtr Nov 24, 2023
f0344c8
Resolve conflicts
mrchtr Nov 24, 2023
826f061
Addressing comments
mrchtr Nov 24, 2023
4bb35a4
Overwriting consumes and produces of component specification
mrchtr Nov 24, 2023
e7a960f
Consumes and produces renaming
mrchtr Nov 24, 2023
045769f
Merge branch 'main' into feature/implement-new-pipeline-interface
RobbeSneyders Nov 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions components/caption_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@ This component captions images using a BLIP model from the Hugging Face hub

**This component consumes:**

- images
- data: binary
- images_data: binary

**This component produces:**

- captions
- text: string
- captions_text: string

### Arguments

Expand Down
12 changes: 4 additions & 8 deletions components/caption_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,12 @@ tags:
- Image processing

consumes:
images:
fields:
data:
type: binary
images_data:
type: binary

produces:
captions:
fields:
text:
type: utf8
captions_text:
type: utf8

args:
model_id:
Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.max_new_tokens = max_new_tokens

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images"]["data"]
images = dataframe["images_data"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -112,4 +112,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(captions)

return pd.concat(results).to_frame(name=("captions", "text"))
return pd.concat(results).to_frame(name=("captions_text"))
8 changes: 3 additions & 5 deletions components/chunk_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,12 @@ consists of the id of the original document followed by the chunk index.

**This component consumes:**

- text
- data: string
- text_data: string

**This component produces:**

- text
- data: string
- original_document_id: string
- text_data: string
- text_original_document_id: string

### Arguments

Expand Down
16 changes: 6 additions & 10 deletions components/chunk_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,14 @@ tags:
- Text processing

consumes:
text:
fields:
data:
type: string
text_data:
type: string

produces:
text:
fields:
data:
type: string
original_document_id:
type: string
text_data:
type: string
text_original_document_id:
type: string

args:
chunk_size:
Expand Down
7 changes: 1 addition & 6 deletions components/chunk_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def __init__(
def chunk_text(self, row) -> t.List[t.Tuple]:
# Multi-index df has id under the name attribute
doc_id = row.name
text_data = row[("text", "data")]
text_data = row[("text_data")]
docs = self.text_splitter.create_documents([text_data])
return [
(doc_id, f"{doc_id}_{chunk_id}", chunk.page_content)
Expand All @@ -63,9 +63,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
)
results_df = results_df.set_index("id")

# Set multi-index column for the expected subset and field
results_df.columns = pd.MultiIndex.from_product(
[["text"], results_df.columns],
)

return results_df
6 changes: 3 additions & 3 deletions components/chunk_text/tests/chunk_text_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def test_transform():
"""Test chunk component method."""
input_dataframe = pd.DataFrame(
{
("text", "data"): [
("text_data"): [
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo",
"ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis",
"parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec,",
Expand All @@ -25,8 +25,8 @@ def test_transform():

expected_output_dataframe = pd.DataFrame(
{
("text", "original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text", "data"): [
("text_original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text_data"): [
"Lorem ipsum dolor sit amet, consectetuer",
"amet, consectetuer adipiscing elit. Aenean",
"elit. Aenean commodo",
Expand Down
10 changes: 4 additions & 6 deletions components/download_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,13 @@ from the img2dataset library.

**This component consumes:**

- images
- url: string
- images_url: string

**This component produces:**

- images
- data: binary
- width: int32
- height: int32
- images_data: binary
- images_width: int32
- images_height: int32

### Arguments

Expand Down
24 changes: 10 additions & 14 deletions components/download_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,17 @@ tags:
- Image processing

consumes:
images:
fields:
url:
type: string
images_url:
type: string

produces:
images:
fields:
data:
type: binary
width:
type: int32
height:
type: int32
additionalFields: false
images_data:
type: binary
images_width:
type: int32
images_height:
type: int32
# additionalFields: false

args:
timeout:
Expand All @@ -53,7 +49,7 @@ args:
description: Resize mode to use. One of "no", "keep_ratio", "center_crop", "border".
type: str
default: 'border'
resize_only_if_bigger:
resize_only_if_bigger:
description: If True, resize only if image is bigger than image_size.
type: bool
default: False
Expand Down
5 changes: 1 addition & 4 deletions components/download_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ async def download_dataframe() -> None:
images = await asyncio.gather(
*[
self.download_and_resize_image(id_, url, semaphore=semaphore)
for id_, url in zip(dataframe.index, dataframe["images"]["url"])
for id_, url in zip(dataframe.index, dataframe["images_url"])
],
)
results.extend(images)
Expand All @@ -134,8 +134,5 @@ async def download_dataframe() -> None:

results_df = results_df.dropna()
results_df = results_df.set_index("id", drop=True)
results_df.columns = pd.MultiIndex.from_product(
[["images"], results_df.columns],
)

return results_df
8 changes: 4 additions & 4 deletions components/download_images/tests/test_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_transform(respx_mock):

input_dataframe = pd.DataFrame(
{
("images", "url"): urls,
"images_url": urls,
},
index=pd.Index(ids, name="id"),
)
Expand All @@ -55,9 +55,9 @@ def test_transform(respx_mock):
resized_images = [component.resizer(io.BytesIO(image))[0] for image in images]
expected_dataframe = pd.DataFrame(
{
("images", "data"): resized_images,
("images", "width"): [image_size] * len(ids),
("images", "height"): [image_size] * len(ids),
"images_data": resized_images,
"images_width": [image_size] * len(ids),
"images_height": [image_size] * len(ids),
},
index=pd.Index(ids, name="id"),
)
Expand Down
6 changes: 2 additions & 4 deletions components/embed_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@ Component that generates CLIP embeddings from images

**This component consumes:**

- images
- data: binary
- images_data: binary

**This component produces:**

- embeddings
- data: list<item: float>
- embeddings_data: list<item: float>

### Arguments

Expand Down
18 changes: 7 additions & 11 deletions components/embed_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,17 @@ name: Embed images
description: Component that generates CLIP embeddings from images
image: fndnt/embed_images:dev
tags:
- Image processing
- Image processing

consumes:
images:
fields:
data:
type: binary
images_data:
type: binary

produces:
embeddings:
fields:
data:
type: array
items:
type: float32
embeddings_data:
type: array
items:
type: float32

args:
model_id:
Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.batch_size = batch_size

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images"]["data"]
images = dataframe["images_data"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -110,4 +110,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(embeddings)

return pd.concat(results).to_frame(name=("embeddings", "data"))
return pd.concat(results).to_frame(name=("embeddings_data"))
8 changes: 3 additions & 5 deletions components/embed_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,12 @@ Component that generates embeddings of text passages.

**This component consumes:**

- text
- data: string
- text_data: string

**This component produces:**

- text
- data: string
- embedding: list<item: float>
- text_data: string
- text_embedding: list<item: float>

### Arguments

Expand Down
22 changes: 9 additions & 13 deletions components/embed_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,17 @@ tags:
- Text processing

consumes:
text:
fields:
data:
type: string
text_data:
type: string

produces:
text:
fields:
data:
type: string
embedding:
type: array
items:
type: float32

text_data:
type: string
text_embedding:
type: array
items:
type: float32

args:
model_provider:
description: |
Expand Down
4 changes: 2 additions & 2 deletions components/embed_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def get_embeddings_vectors(self, texts):
return self.embedding_model.embed_documents(texts.tolist())

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
dataframe[("text", "embedding")] = self.get_embeddings_vectors(
dataframe[("text", "data")],
dataframe["text_embedding"] = self.get_embeddings_vectors(
dataframe["text_data"],
)
return dataframe
6 changes: 2 additions & 4 deletions components/embedding_based_laion_retrieval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,11 @@ used to find images similar to the embedded images / captions.

**This component consumes:**

- embeddings
- data: list<item: float>
- embeddings_data: list<item: float>

**This component produces:**

- images
- url: string
- images_url: string

### Arguments

Expand Down
18 changes: 7 additions & 11 deletions components/embedding_based_laion_retrieval/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,15 @@ tags:
- Data retrieval

consumes:
embeddings:
fields:
data:
type: array
items:
type: float32
embeddings_data:
type: array
items:
type: float32

produces:
images:
fields:
url:
type: string
additionalSubsets: false
images_url:
type: string
# additionalFields: false

args:
num_images:
Expand Down
Loading
Loading