Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add DocumentCleaner 2.0 #5976

Merged
merged 11 commits into from
Oct 13, 2023
Merged

feat: Add DocumentCleaner 2.0 #5976

merged 11 commits into from
Oct 13, 2023

Conversation

julian-risch
Copy link
Member

@julian-risch julian-risch commented Oct 5, 2023

Related Issues

Proposed Changes:

Add new DocumentCleaner component with the options to

  • remove_empty_lines
  • remove_extra_whitespaces
  • remove_repeated_substrings
  • remove_substrings
  • remove_regex

Also added new unit tests for this component.
The code for removing repeated substrings (footers, headers) was mostly copied over from 1.x.

How did you test it?

Added new unit tests. We should add an end-to-end test later with an indexing pipeline containing a file converter, text document cleaner and text document splitter components. The end-to-end test will be added in the PR with the last component needed here: https://github.com/deepset-ai/haystack/pull/6037/files#diff-963c94f5742eb94f8771a87759aa17307f9cc868fa6ecbe80a431b4dcf14cf28

Notes for the reviewer

The issues mentions to update "structure dictionary properly if it’s present" but I didn't address it so far in this PR. Not 100% clear to me what this should look like and probably not needed for 2.0. Could be added later.

Checklist

@julian-risch julian-risch changed the title remove whitespaces, substrings, regex, empty lines feat: Add TextDocumentCleaner 2.0 Oct 5, 2023
@github-actions github-actions bot added the type:documentation Improvements on the docs label Oct 5, 2023
@julian-risch julian-risch marked this pull request as ready for review October 10, 2023 20:23
@julian-risch julian-risch requested review from a team as code owners October 10, 2023 20:23
@julian-risch julian-risch requested review from dfokina and anakin87 and removed request for a team October 10, 2023 20:23
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @julian-risch... Good work!

I found some opportunities for improvement
(and several occasions for me to better understand).

@julian-risch julian-risch self-assigned this Oct 13, 2023
@julian-risch julian-risch changed the title feat: Add TextDocumentCleaner 2.0 feat: Add DocumentCleaner 2.0 Oct 13, 2023
@julian-risch julian-risch requested a review from anakin87 October 13, 2023 07:01
@julian-risch julian-risch removed their assignment Oct 13, 2023
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

(Only a small comment about the docstring)

@julian-risch julian-risch merged commit aaee03a into main Oct 13, 2023
20 checks passed
@julian-risch julian-risch deleted the text-document-cleaner branch October 13, 2023 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TextDocumentCleaner
2 participants