Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add DocumentLanguageClassifier 2.0 #6037

Merged
merged 18 commits into from
Oct 31, 2023
Merged

Conversation

julian-risch
Copy link
Member

@julian-risch julian-risch commented Oct 12, 2023

Related Issues

Proposed Changes:

  • Add new DocumentLanguageClassifier component that takes a list of Document as input and routes them to different output edges depending on the content language
  • Add unit tests
  • Add one end-to-end test with an indexing pipeline. blocked by PR feat: Add DocumentCleaner 2.0 #5976.

How did you test it?

new unit tests. e2e test still pending.

Notes for the reviewer

Storing the detected language in the Document's metadata could still be done. Not convinced that we need it at this point. Happy to discuss.
Regarding the e2e test, I am working on getting the other PR #5976 merged first. Then I'll add more assertions.
The component is in the preprocessing module right now but could also fit in the routers. I'd keep it in preprocessors because it not only routes but also detects the language.

Checklist

@julian-risch julian-risch requested a review from a team as a code owner October 12, 2023 11:45
@julian-risch julian-risch requested review from masci and removed request for a team October 12, 2023 11:45
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Oct 12, 2023
@julian-risch julian-risch requested a review from a team as a code owner October 12, 2023 11:46
@julian-risch julian-risch requested review from dfokina and removed request for a team October 12, 2023 11:46
@coveralls
Copy link
Collaborator

coveralls commented Oct 13, 2023

Pull Request Test Coverage Report for Build 6508357776

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.09%) to 50.92%

Totals Coverage Status
Change from base Build 6507212414: 0.09%
Covered Lines: 12982
Relevant Lines: 25495

💛 - Coveralls

@julian-risch
Copy link
Member Author

In the end to end test I cam across a problem with the connection named "text/plain" and opened an issue in canals: deepset-ai/canals#130

```python
document_store = MemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to this PR, but why inverting the order of the original signature:

def add_component(self, name: str, instance: Component) -> None:

if passing the instance first is more intuitive we're on time to change it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more intuitive, yes. I created an issue here: https://github.com/deepset-ai/canals/issues/137

@anakin87 anakin87 merged commit 29b1fef into main Oct 31, 2023
22 checks passed
@anakin87 anakin87 deleted the document-language-classifier branch October 31, 2023 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LanguageClassifier
4 participants