Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataComp] Run pipeline at scale #337

Closed
wants to merge 65 commits into from
Closed

[DataComp] Run pipeline at scale #337

wants to merge 65 commits into from

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Aug 8, 2023

This PR supersedes #319 and includes all changes made to run the DataComp pipeline at scale. It's not meant to be merged as it's too big (see below), but enables to reproduce the UnicodeDecodeError issue.

It includes:

  • adding int64 in the schema of Fondant
  • repartitioning the index dataframe to have the same number of partitions as the subset dataframe before merging
  • 2 new components: download_images and detect_text.

This branch also includes 2 variations of the detect_text component, namely:

  • detect_text_gpu: this one replaces onnxruntime by onnxruntime-gpu in the requirements.txt to make sure it can leverage a GPU
  • detect_text_torch_gpu: this one leverages plain PyTorch instead of ONNX to run inference.

At the moment, both components are hit by the following issue:

Traceback (most recent call last):
  File "/component/src/main.py", line 142, in <module>
    executor.execute(DetextTextComponent)
  File "/opt/conda/lib/python3.10/site-packages/fondant/executor.py", line 203, in execute
    self._write_data(dataframe=output_df, manifest=output_manifest)
  File "/opt/conda/lib/python3.10/site-packages/fondant/executor.py", line 188, in _write_data
    data_writer.write_dataframe(dataframe)
  File "/opt/conda/lib/python3.10/site-packages/fondant/data_io.py", line 232, in write_dataframe
    dd.compute(*write_tasks)
  File "/opt/conda/lib/python3.10/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/opt/conda/lib/python3.10/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/opt/conda/lib/python3.10/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/opt/conda/lib/python3.10/site-packages/dask/dataframe/_pyarrow.py", line 82, in _to_string_dtype
    df = df.astype(dtypes, copy=False)
  File "pandas/_libs/lib.pyx", line 712, in pandas._libs.lib.ensure_string_array
  File "pandas/_libs/lib.pyx", line 781, in pandas._libs.lib.ensure_string_array
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

when writing image data to the cloud. Weirdly, this works for the detect_text (CPU only component).

As this branch is too large to be merged, I'll break it down into smaller parts:

@NielsRogge NielsRogge marked this pull request as draft August 8, 2023 08:26
@NielsRogge NielsRogge closed this Sep 12, 2023
@janvanlooyml6 janvanlooyml6 deleted the debug branch January 9, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant