-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load from pdf component #765
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli!
- Can you move the
test_file
andtest_folder
into thetests
directory? - As discussed, this currently doesn't work with larger-than-memory data. For that, we probably first need to fetch all the paths, partition them, and apply a transform that loads them.
Thanks for the suggestions, made the necessary changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli. One optional comment. Up to you if you think it's worth the effort.
components/load_from_pdf/src/main.py
Outdated
|
||
dask_df = dd.from_pandas( | ||
pd.DataFrame({"pdf_path": file_paths}), | ||
npartitions=os.cpu_count(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably be a parameter so it can be upped for larger datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added :)
Fixes https://github.com/ml6team/fondant-use-cases/issues/54
PR that adds the functionality to load pdf documents from different local and remote storage.
The implementation differs from the suggested solution at #54 since:
Unstructured
loader which requires a lot of system and cuda dependencies to have it installed (a lot of overhead for just loading pdfs)The current implementation relies on copying the pdfs to a temporary local storage and loading them using the
PyPDFDirectoryLoader
, they are then loaded lazily. The assumption for now is that the loaded docs won't exceed the storage of the device which should be valid for most use cases. Later on, we can think on how to optimize this further.