Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextDocumentCleaner #5676

Closed
Tracked by #5581
ZanSara opened this issue Aug 29, 2023 · 0 comments · Fixed by #5976
Closed
Tracked by #5581

TextDocumentCleaner #5676

ZanSara opened this issue Aug 29, 2023 · 0 comments · Fixed by #5976
Assignees
Labels
2.x Related to Haystack v2.0
Milestone

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Aug 29, 2023

Cleans text to make it more readable, both by humans and by LLMs. It's task is to perform find/replace of some strings, both user-given or predefined.

Ideally it would be able to replace/remove:

  • Whitespace, empty lines and other control characters
  • Headers and footers (auto-detected repeated strings at the start and at the end of every page)
  • User-specified strings
  • User-specified regexes

TextDocumentCleaner should also update the structure dictionary properly if it’s present, so it should retain only the page numbers and headings that are still present after the cleaning and updating the chars positions.

Draft I/O

@component
class TextDocumentCleaner:

    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document], ... clean_empty_lines, clean_whitespace, clean_substrings, etc ...):
        # cleans the documents
        return {"documents": documents}
@ZanSara ZanSara added the 2.x Related to Haystack v2.0 label Aug 29, 2023
@ZanSara ZanSara changed the title TextCleaner TextDocumentCleaner Aug 29, 2023
@julian-risch julian-risch self-assigned this Sep 28, 2023
@ZanSara ZanSara added this to the 2.0-beta milestone Sep 28, 2023
@Timoeller Timoeller modified the milestone: 2.0-beta Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants