Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Non-api version of Unstrctured file converter #258

Closed
wants to merge 2 commits into from

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Jan 24, 2024

I added a version of the Unstructured File Converter that runs without the docker image.

I believe it is nice to provide a choice to the user if they would like to use a hosted version of Unstructured versus non-hosted in case their environment does not allow them to start an additional hosted service through Docker.

The change to make this work is very minimal but does lead to some questions for me about best practices with integrations.

Questions:

  1. The new version requires additional python dependencies. So we went from pip install unstructured to pip install unstructured[pdf]. Is it all right to require this new dependency or would it be better to follow something like Haystack's dependency management where the [pdf] part is optional using something like LazyImport?
  2. Would it make sense to rename the original UnstructuredFileConverter to UnstructuredAPIFileConverter to make it clearer that it uses the hosted version of Unstructured? Then the new Local version could be named UnstructuredFileConverter.

Additional Comments

  • If we are okay with keeping these in the same package (refer to question 1.), I'll spend the time to refactor the testing code to parametrize it to reduce the duplicate code.

@sjrl sjrl requested a review from a team as a code owner January 24, 2024 10:31
@sjrl sjrl requested review from julian-risch and removed request for a team January 24, 2024 10:31
@anakin87 anakin87 self-requested a review January 24, 2024 10:34
@anakin87
Copy link
Member

Hey!
TBH, I initially discarded this option (local installation) because it can require several tricky (python and system) dependencies to make it work with different file types.

If we want to support this option, I'm not sure about shipping it in the same package... 🤔

@sjrl
Copy link
Contributor Author

sjrl commented Jan 24, 2024

Hey @anakin87 thanks for the feedback!

I agree that the full installation of Unstructured looks tricky, but I think it would still be worthwhile to support it for a lesser set of file types or let users optionally install the different dependencies locally. It's just in some settings (like in dC) it's more difficult to spin up an additional (large) docker image than having a local install.

@julian-risch
Copy link
Member

UnstructuredLocalFileConverter and UnstructuredAPIFileConverter sound intuitive to me as component names. Even only the additional dependencies needed for the UnstructuredLocalFileConverter with pip install unstructured[pdf] would justify having a separate integration in my opinion too. I suggest one integration tile to describe the two options but two separate packages. @anakin87 I agree with you on that and would leave the review to you.

@julian-risch julian-risch removed their request for review January 25, 2024 16:41
@sjrl
Copy link
Contributor Author

sjrl commented Jan 26, 2024

would justify having a separate integration in my opinion too

Okay this sounds reasonable to me. Would I need to create a new folder called something like integrations/unstructured-local or is there a way to create a separate package within the existing integrations/unstructured folder?

@anakin87
Copy link
Member

@sjrl I'll let you know! Sorry for the delay...

@anakin87
Copy link
Member

Hey, @sjrl!
I discussed the best way to support this addition with @masci.

  • Let's keep the two components in this integration. (Each can live in a separate file).
  • (I would avoid inheritance, the common logic/methods can live in a utils file)

Dependencies

  • To keep things simple, the only default dependency will be unstructured
  • If the user wants to use the UnstructuredLocalFileConverter, they should manually install and deal with the additional dependencies (we can document that)
  • If we only have unstructured installed (the default), running partition(filename="my_pdf_doc.pdf")
    fails with ImportError: partition_pdf is not available. Install the pdf dependencies with pip install "unstructured[pdf]"
    which is expressive and useful, so we can catch and reraise this specific exception (we can also add a reference to unstructured installation docs).

Please let me know if you have any doubts or need clarification!

@sjrl sjrl marked this pull request as draft February 15, 2024 11:30
@sjrl
Copy link
Contributor Author

sjrl commented Mar 18, 2024

Closing for now, may revisit in the future when I have more time.

@sjrl sjrl closed this Mar 18, 2024
@masci masci deleted the unstructured-local branch April 4, 2024 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants