Add load from pdf component #765

PhilippeMoussalli · 2024-01-09T08:07:25Z

Fixes https://github.com/ml6team/fondant-use-cases/issues/54

PR that adds the functionality to load pdf documents from different local and remote storage.

The implementation differs from the suggested solution at #54 since:

Accumulating different loaders and loading each document individually seems to be inefficient since it would require the initialization of a client, temp storage, ... on every invocation link
The langchain cloud loaders don't have a unified interface
- Each would requires specific arguments to be passed (in contrast fsspec is much simpler)
- Only the google loader enables defining a custom loader class, the rest uses the Unstructured loader which requires a lot of system and cuda dependencies to have it installed (a lot of overhead for just loading pdfs)

The current implementation relies on copying the pdfs to a temporary local storage and loading them using the PyPDFDirectoryLoader, they are then loaded lazily. The assumption for now is that the loaded docs won't exceed the storage of the device which should be valid for most use cases. Later on, we can think on how to optimize this further.

RobbeSneyders

Thanks @PhilippeMoussalli!

Can you move the test_file and test_folder into the tests directory?
As discussed, this currently doesn't work with larger-than-memory data. For that, we probably first need to fetch all the paths, partition them, and apply a transform that loads them.

PhilippeMoussalli · 2024-01-09T11:09:37Z

Thanks @PhilippeMoussalli!

Can you move the test_file and test_folder into the tests directory?

As discussed, this currently doesn't work with larger-than-memory data. For that, we probably first need to fetch all the paths, partition them, and apply a transform that loads them.

Thanks for the suggestions, made the necessary changes

RobbeSneyders

Thanks @PhilippeMoussalli. One optional comment. Up to you if you think it's worth the effort.

RobbeSneyders · 2024-01-11T09:15:19Z

components/load_from_pdf/src/main.py

+
+        dask_df = dd.from_pandas(
+            pd.DataFrame({"pdf_path": file_paths}),
+            npartitions=os.cpu_count(),


This could probably be a parameter so it can be upped for larger datasets.

add load from pdf component

5040cae

PhilippeMoussalli requested review from mrchtr and RobbeSneyders January 9, 2024 08:07

Merge branch 'main' into add-load-from-pdf-component

9def335

RobbeSneyders reviewed Jan 9, 2024

View reviewed changes

address PR feedback

20641b0

RobbeSneyders approved these changes Jan 11, 2024

View reviewed changes

make partitions an optional argument

a9d1b45

PhilippeMoussalli merged commit b422fc3 into main Jan 11, 2024
6 checks passed

PhilippeMoussalli deleted the add-load-from-pdf-component branch January 11, 2024 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load from pdf component #765

Add load from pdf component #765

PhilippeMoussalli commented Jan 9, 2024 •

edited

Loading

RobbeSneyders left a comment

PhilippeMoussalli commented Jan 9, 2024

RobbeSneyders left a comment

RobbeSneyders Jan 11, 2024

PhilippeMoussalli Jan 11, 2024

Add load from pdf component #765

Add load from pdf component #765

Conversation

PhilippeMoussalli commented Jan 9, 2024 • edited Loading

RobbeSneyders left a comment

Choose a reason for hiding this comment

PhilippeMoussalli commented Jan 9, 2024

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders Jan 11, 2024

Choose a reason for hiding this comment

PhilippeMoussalli Jan 11, 2024

Choose a reason for hiding this comment

PhilippeMoussalli commented Jan 9, 2024 •

edited

Loading