Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataComp] Run pipeline at scale #337

Closed
wants to merge 65 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
7dcd117
More fixes
NielsRogge Jul 25, 2023
05c7156
More improvements
NielsRogge Jul 25, 2023
aff3fee
More improvements
NielsRogge Jul 25, 2023
710685b
Add logging
NielsRogge Jul 25, 2023
c070bab
Update dockerfile
NielsRogge Jul 25, 2023
698b92c
Fix dtype
NielsRogge Jul 25, 2023
6ed5384
Update Dockerfile
NielsRogge Jul 25, 2023
f253d9c
More updates
NielsRogge Jul 26, 2023
2c990c4
Update logging
NielsRogge Jul 26, 2023
87df957
More improvements
NielsRogge Jul 26, 2023
f702be1
Update specs
NielsRogge Jul 26, 2023
7c19cc7
Improve load_from_hf_hub component
NielsRogge Jul 26, 2023
08bc45a
Update specs
NielsRogge Jul 26, 2023
0912acc
Add task graph
NielsRogge Jul 26, 2023
9d5208d
Add graphviz to the dependencies
NielsRogge Jul 26, 2023
0d42734
Update Dockerfile
NielsRogge Jul 26, 2023
11a4ec7
Add more
NielsRogge Jul 26, 2023
86a1336
Add visualize
NielsRogge Jul 26, 2023
eefd05a
More improvements
NielsRogge Jul 26, 2023
88dec53
Fix visualization
NielsRogge Jul 26, 2023
a1bcb50
Remove line
NielsRogge Jul 26, 2023
4ddca85
More improvements
NielsRogge Jul 26, 2023
30fcfe3
Add print statements
NielsRogge Jul 26, 2023
887f48f
More improvements
NielsRogge Jul 27, 2023
ab17641
More improvements
NielsRogge Jul 27, 2023
12ba25f
Comment out code
NielsRogge Jul 27, 2023
6e1d318
More improvements
NielsRogge Jul 27, 2023
ef5c323
Remove print statements
NielsRogge Jul 27, 2023
cfba11a
Fix repartioning
NielsRogge Jul 28, 2023
014543b
More improvements
NielsRogge Jul 28, 2023
b63c5cb
More improvements
NielsRogge Jul 28, 2023
ce67179
Add download images component
NielsRogge Aug 1, 2023
3d9f119
Update script
NielsRogge Aug 1, 2023
e288749
Remove graphviz
NielsRogge Aug 1, 2023
6e6bd6a
More improvements
NielsRogge Aug 1, 2023
3a97346
Debug
NielsRogge Aug 1, 2023
d1882ec
Use Pandas component
NielsRogge Aug 1, 2023
f81f122
Run on 1000 images
NielsRogge Aug 1, 2023
0c71130
Use map_partitions
NielsRogge Aug 2, 2023
dd9d06d
Add logging
NielsRogge Aug 2, 2023
a940050
More improvements
NielsRogge Aug 2, 2023
3ae578a
More improvements
NielsRogge Aug 2, 2023
e490a2e
Fix rebase
NielsRogge Aug 2, 2023
edc65f5
More improvements
NielsRogge Aug 3, 2023
7acbeba
Fix rebase
NielsRogge Aug 3, 2023
f225ef9
Include uids
NielsRogge Aug 3, 2023
1bfe259
More improvements
NielsRogge Aug 3, 2023
025399a
More improvements
NielsRogge Aug 3, 2023
0c9ac7f
More improvements
NielsRogge Aug 3, 2023
c179edf
More improvements
NielsRogge Aug 3, 2023
7866025
More improvements
NielsRogge Aug 3, 2023
22de5f6
Use cpu for now
NielsRogge Aug 3, 2023
dcc714e
More improvements
NielsRogge Aug 3, 2023
fa341a5
Run text detection on 1000 images
NielsRogge Aug 3, 2023
dada010
Remove print statement
NielsRogge Aug 3, 2023
7ec067b
More improvements
NielsRogge Aug 3, 2023
ac2a130
More improvements
NielsRogge Aug 4, 2023
3d433d9
More improvements
NielsRogge Aug 4, 2023
cc418db
Simplify requirements
NielsRogge Aug 4, 2023
fff8ca0
More improvements
NielsRogge Aug 6, 2023
be3e4b8
Add print statement
NielsRogge Aug 6, 2023
b40c7d9
Add more print statements
NielsRogge Aug 6, 2023
2cee54b
More improvements
NielsRogge Aug 6, 2023
39c5643
Remove dummy op
NielsRogge Aug 8, 2023
8ce9b37
Update dockerfile
NielsRogge Aug 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion components/filter_image_resolution/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
ARG FONDANT_VERSION=f3f3925b8e8f634e2978e5c7fcefa72c53baba7c
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
Expand Down
6 changes: 3 additions & 3 deletions components/filter_image_resolution/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
name: Filter image resolution
description: Component that filters images based on minimum size and max aspect ratio
image: ghcr.io/ml6team/filter_image_resolution:latest
image: ghcr.io/ml6team/filter_image_resolution:f3f3925b8e8f634e2978e5c7fcefa72c53baba7c

consumes:
image:
fields:
width:
type: int16
type: int64
height:
type: int16
type: int64

args:
min_image_dim:
Expand Down
2 changes: 1 addition & 1 deletion components/load_from_hf_hub/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
ARG FONDANT_VERSION=edc65f516067937401e325650d5b5409f071bc39
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
Expand Down
9 changes: 7 additions & 2 deletions components/load_from_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Load from hub
description: Component that loads a dataset from the hub
image: ghcr.io/ml6team/load_from_hf_hub:dev
image: ghcr.io/ml6team/load_from_hf_hub:edc65f516067937401e325650d5b5409f071bc39

produces:
dummy_variable: #TODO: fill in here
Expand All @@ -23,4 +23,9 @@ args:
n_rows_to_load:
description: Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale
type: int
default: None
default: None
dataset_length:
description: Optional argument that defines the length of the dataset. Required in case `n_rows_to_load` is specified.
type: int
default: None

26 changes: 23 additions & 3 deletions components/load_from_hf_hub/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ def __init__(self, *_,
column_name_mapping: dict,
image_column_names: t.Optional[list],
n_rows_to_load: t.Optional[int],
dataset_length: int,
) -> None:
"""
Args:
Expand All @@ -25,11 +26,14 @@ def __init__(self, *_,
format the image from HF hub format to a byte string
n_rows_to_load: optional argument that defines the number of rows to load. Useful for
testing pipeline runs on a small scale.
dataset_length: optional argument that specifies the length of the entire dataset. Only
required in case n_rows_to_load is specified.
"""
self.dataset_name = dataset_name
self.column_name_mapping = column_name_mapping
self.image_column_names = image_column_names
self.n_rows_to_load = n_rows_to_load
self.dataset_length = dataset_length

def load(self) -> dd.DataFrame:
# 1) Load data, read as Dask dataframe
Expand All @@ -44,12 +48,28 @@ def load(self) -> dd.DataFrame:
)

# 3) Rename columns
logger.info("Renaming columns...")
dask_df = dask_df.rename(columns=self.column_name_mapping)

# 4) Optional: only return specific amount of rows
if self.n_rows_to_load:
dask_df = dask_df.head(self.n_rows_to_load)
dask_df = dd.from_pandas(dask_df, npartitions=1)
if self.n_rows_to_load is not None:
if self.dataset_length is None:
raise ValueError("""Make sure to also specify the length of the entire
dataset. This is required as otherwise only the first
partition can be loaded""")
logger.info(f"""Loading approximately {self.n_rows_to_load} rows...
at least one partition""")
partition_length = self.dataset_length // dask_df.npartitions
npartitions = max(self.n_rows_to_load // partition_length, 1)
dask_df = dask_df.head(self.n_rows_to_load, npartitions=npartitions)
dask_df = dd.from_pandas(dask_df, npartitions=npartitions)
# .reset_index(drop=True) # will reset it from 0 for every partition

# Set monotonically increasing index
logger.info("Setting the index...")
dask_df["id"] = 1
dask_df["id"] = dask_df.id.cumsum()
dask_df = dask_df.set_index("id", sort=True)

return dask_df

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
ARG FONDANT_VERSION=79df895e9d62d2010ccb8d40ee7e4fd4c68f117d
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
class ClusterImageEmbeddingsComponent(DaskTransformComponent):
"""Component that clusters images based on embeddings."""

def __init__(self, sample_ratio: float, num_clusters: int) -> None:
def __init__(self, *_, sample_ratio: float, num_clusters: int) -> None:
self.sample_ratio = sample_ratio
self.num_clusters = num_clusters

Expand Down
24 changes: 24 additions & 0 deletions examples/pipelines/datacomp/components/detect_text/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM --platform=linux/amd64 python:3.8-slim

# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# Install requirements
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt
RUN pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=39c56436e20fb920a50c26a4d0753251993f3251
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["python", "main.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Detect text
description: Component that detects text in images
image: ghcr.io/ml6team/detect_text:39c56436e20fb920a50c26a4d0753251993f3251

consumes:
image:
fields:
data:
type: binary

produces:
image:
fields:
data:
type: binary
boxes:
type: array
items:
type: array
items:
type: int64
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
huggingface-hub==0.16.4
onnxruntime==1.15.1
opencv-python-headless
scipy
Loading