[DataComp] Run pipeline at scale #337

NielsRogge · 2023-08-08T08:25:22Z

This PR supersedes #319 and includes all changes made to run the DataComp pipeline at scale. It's not meant to be merged as it's too big (see below), but enables to reproduce the UnicodeDecodeError issue.

It includes:

adding int64 in the schema of Fondant
repartitioning the index dataframe to have the same number of partitions as the subset dataframe before merging
2 new components: download_images and detect_text.

This branch also includes 2 variations of the detect_text component, namely:

detect_text_gpu: this one replaces onnxruntime by onnxruntime-gpu in the requirements.txt to make sure it can leverage a GPU
detect_text_torch_gpu: this one leverages plain PyTorch instead of ONNX to run inference.

At the moment, both components are hit by the following issue:

Traceback (most recent call last):
  File "/component/src/main.py", line 142, in <module>
    executor.execute(DetextTextComponent)
  File "/opt/conda/lib/python3.10/site-packages/fondant/executor.py", line 203, in execute
    self._write_data(dataframe=output_df, manifest=output_manifest)
  File "/opt/conda/lib/python3.10/site-packages/fondant/executor.py", line 188, in _write_data
    data_writer.write_dataframe(dataframe)
  File "/opt/conda/lib/python3.10/site-packages/fondant/data_io.py", line 232, in write_dataframe
    dd.compute(*write_tasks)
  File "/opt/conda/lib/python3.10/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/opt/conda/lib/python3.10/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/opt/conda/lib/python3.10/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/opt/conda/lib/python3.10/site-packages/dask/dataframe/_pyarrow.py", line 82, in _to_string_dtype
    df = df.astype(dtypes, copy=False)
  File "pandas/_libs/lib.pyx", line 712, in pandas._libs.lib.ensure_string_array
  File "pandas/_libs/lib.pyx", line 781, in pandas._libs.lib.ensure_string_array
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

when writing image data to the cloud. Weirdly, this works for the detect_text (CPU only component).

As this branch is too large to be merged, I'll break it down into smaller parts:

support int64 as dtype in the schema: Add int64 dtype #338
add dataset_length argument and set_index to load_from_hf_hub [load_from_hf_hub] Add dataset_length, set_index #339
update pipeline name, remove DockerCompiler: [DataComp] Update pipeline name, remove DockerCompiler #340
add download_images component to the datacomp pipeline: [DataComp] Add download images component #348
add detect_text component to the datacomp pipeline: [DataComp] Add T-MARS #374

NielsRogge added 30 commits August 3, 2023 08:49

More fixes

7dcd117

More improvements

05c7156

More improvements

aff3fee

Add logging

710685b

Update dockerfile

c070bab

Fix dtype

698b92c

Update Dockerfile

6ed5384

More updates

f253d9c

Update logging

2c990c4

More improvements

87df957

Update specs

f702be1

Improve load_from_hf_hub component

7c19cc7

Update specs

08bc45a

Add task graph

0912acc

Add graphviz to the dependencies

9d5208d

Update Dockerfile

0d42734

Add more

11a4ec7

Add visualize

86a1336

More improvements

eefd05a

Fix visualization

88dec53

Remove line

a1bcb50

More improvements

4ddca85

Add print statements

30fcfe3

More improvements

887f48f

More improvements

ab17641

Comment out code

12ba25f

More improvements

6e1d318

Remove print statements

ef5c323

Fix repartioning

cfba11a

More improvements

014543b

NielsRogge added 26 commits August 3, 2023 08:49

Use map_partitions

0c71130

Add logging

dd9d06d

More improvements

a940050

More improvements

3ae578a

Fix rebase

e490a2e

More improvements

edc65f5

Fix rebase

7acbeba

Include uids

f225ef9

More improvements

1bfe259

More improvements

025399a

More improvements

0c9ac7f

More improvements

c179edf

More improvements

7866025

Use cpu for now

22de5f6

More improvements

dcc714e

Run text detection on 1000 images

fa341a5

Remove print statement

dada010

More improvements

7ec067b

More improvements

ac2a130

More improvements

3d433d9

Simplify requirements

cc418db

More improvements

fff8ca0

Add print statement

be3e4b8

Add more print statements

b40c7d9

More improvements

2cee54b

Remove dummy op

39c5643

NielsRogge marked this pull request as draft August 8, 2023 08:26

Update dockerfile

8ce9b37

NielsRogge closed this Sep 12, 2023

janvanlooyml6 deleted the debug branch January 9, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataComp] Run pipeline at scale #337

[DataComp] Run pipeline at scale #337

NielsRogge commented Aug 8, 2023 •

edited

Loading

[DataComp] Run pipeline at scale #337

[DataComp] Run pipeline at scale #337

Conversation

NielsRogge commented Aug 8, 2023 • edited Loading

NielsRogge commented Aug 8, 2023 •

edited

Loading