Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstructured: fix metadata order mixed up #336

Merged
merged 22 commits into from
Feb 5, 2024

Conversation

lambda-science
Copy link
Contributor

Fix #331

Because unstructured integration code use python sets, file orders are not preserved leading to metadata attribution to wrong filepaths. Using lists fix this as list are ordered.

Also C/C from the issue, I added:

And actually I'm not sure why we need a set logic here to make filepath unique. I feel like it's up to the user to provide unique paths ? For example what happens if a user provide 10 path and 10 metadata but then some filepath are duplicated so then we have 8 filepath and 10 metadata ? It will raise error from normalize_metadata I guess

What I think is that:

We can support directories as path BUT then metadata should max be of a length of 1 (same metadata for all files in directory). Because I'm not sure it's clear how path.glob() orders files (leading to metadata attribution confusion)
If direct paths to files are provided: don't make them unique with sets.

lambda-science and others added 18 commits January 18, 2024 13:26
…/converters/unstructured/converter.py

Co-authored-by: Stefano Fiorucci <[email protected]>
…/converters/unstructured/converter.py

Co-authored-by: Stefano Fiorucci <[email protected]>
…/converters/unstructured/converter.py

Co-authored-by: Stefano Fiorucci <[email protected]>
…rs. Raise error if glob and metadata list because unsafe
@lambda-science lambda-science requested a review from a team as a code owner February 4, 2024 11:30
@lambda-science lambda-science requested review from shadeMe and removed request for a team February 4, 2024 11:30
@anakin87 anakin87 self-requested a review February 4, 2024 11:32
@lambda-science
Copy link
Contributor Author

lambda-science commented Feb 4, 2024

I'm sorry for the long commit list it's weird. It took all my previous history also from #242
Maybe I should have created a new branch :/

Also: tests are failing because no hard drive space :p

@masci masci removed the request for review from shadeMe February 5, 2024 08:03
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey... Thanks for the fix!

I generally agree with your approach.

Let's also update the docstrings about meta to reflect the changes...

…/converters/unstructured/converter.py

Co-authored-by: Stefano Fiorucci <[email protected]>
@lambda-science
Copy link
Contributor Author

Sounds good to me ! :)

@anakin87 anakin87 self-requested a review February 5, 2024 09:41
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@anakin87 anakin87 merged commit 96f6ade into deepset-ai:main Feb 5, 2024
7 checks passed
@lambda-science lambda-science deleted the feat/unstructured_meta_field branch February 5, 2024 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unstructured: metadata get mixed up
2 participants