Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load from pdf component #765

Merged
merged 4 commits into from
Jan 11, 2024
Merged

Conversation

PhilippeMoussalli
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli commented Jan 9, 2024

Fixes https://github.com/ml6team/fondant-use-cases/issues/54

PR that adds the functionality to load pdf documents from different local and remote storage.

The implementation differs from the suggested solution at #54 since:

  • Accumulating different loaders and loading each document individually seems to be inefficient since it would require the initialization of a client, temp storage, ... on every invocation link
  • The langchain cloud loaders don't have a unified interface
    • Each would requires specific arguments to be passed (in contrast fsspec is much simpler)
    • Only the google loader enables defining a custom loader class, the rest uses the Unstructured loader which requires a lot of system and cuda dependencies to have it installed (a lot of overhead for just loading pdfs)

The current implementation relies on copying the pdfs to a temporary local storage and loading them using the PyPDFDirectoryLoader, they are then loaded lazily. The assumption for now is that the loaded docs won't exceed the storage of the device which should be valid for most use cases. Later on, we can think on how to optimize this further.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli!

  • Can you move the test_file and test_folder into the tests directory?
  • As discussed, this currently doesn't work with larger-than-memory data. For that, we probably first need to fetch all the paths, partition them, and apply a transform that loads them.

@PhilippeMoussalli
Copy link
Contributor Author

Thanks @PhilippeMoussalli!

  • Can you move the test_file and test_folder into the tests directory?
  • As discussed, this currently doesn't work with larger-than-memory data. For that, we probably first need to fetch all the paths, partition them, and apply a transform that loads them.

Thanks for the suggestions, made the necessary changes

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli. One optional comment. Up to you if you think it's worth the effort.


dask_df = dd.from_pandas(
pd.DataFrame({"pdf_path": file_paths}),
npartitions=os.cpu_count(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be a parameter so it can be upped for larger datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added :)

@PhilippeMoussalli PhilippeMoussalli merged commit b422fc3 into main Jan 11, 2024
6 checks passed
@PhilippeMoussalli PhilippeMoussalli deleted the add-load-from-pdf-component branch January 11, 2024 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants