Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RecursiveSplitter component for Document preprocessing #8605

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

davidsbatista
Copy link
Contributor

Related Issues

Proposed Changes:

  • Adding a RecursiveSplitter, using a set of predefined separators to split text recursively - see issue for more details

How did you test it?

  • local unit tests and integration tests plus CI tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Dec 4, 2024
@coveralls
Copy link
Collaborator

coveralls commented Dec 4, 2024

Pull Request Test Coverage Report for Build 12357154921

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 90.601%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/sentence_tokenizer.py 1 94.12%
Totals Coverage Status
Change from base Build 12353137414: 0.1%
Covered Lines: 8222
Relevant Lines: 9075

💛 - Coveralls

@davidsbatista davidsbatista marked this pull request as ready for review December 4, 2024 17:23
@davidsbatista davidsbatista requested review from a team as code owners December 4, 2024 17:23
@davidsbatista davidsbatista requested review from dfokina and julian-risch and removed request for a team December 4, 2024 17:23
@davidsbatista davidsbatista changed the title feat:: add recursive chunking strategy feat: add recursive chunking strategy Dec 4, 2024
@davidsbatista davidsbatista requested a review from sjrl December 4, 2024 17:24
@davidsbatista
Copy link
Contributor Author

@bglearning - mentioning you since I believe you were the one with most interest in this feature

@sjrl
Copy link
Contributor

sjrl commented Dec 13, 2024

Thanks for the continued work on this @davidsbatista! I have a few more comments:

- It doesn't seem like calculating the page_number has made it yet, is that right? Update: Sorry I was looking at an older commit.

  • Also my impression is that there is a lot of similar functionality between the DocumentSplitter and the RecursiveDocumentSplitter and I wonder if it would make sense to refactor one or both so they could use similar functionality. Otherwise it seems like we have a lot of similarish functions doing the same thing which also requires double the the tests. What do you think?

Hey @davidsbatista just pinging to make sure you saw this.

@davidsbatista
Copy link
Contributor Author

Thanks for the continued work on this @davidsbatista! I have a few more comments:
- It doesn't seem like calculating the page_number has made it yet, is that right? Update: Sorry I was looking at an older commit.

  • Also my impression is that there is a lot of similar functionality between the DocumentSplitter and the RecursiveDocumentSplitter and I wonder if it would make sense to refactor one or both so they could use similar functionality. Otherwise it seems like we have a lot of similarish functions doing the same thing which also requires double the the tests. What do you think?

Hey @davidsbatista just pinging to make sure you saw this.

I saw it as well, no worries - I need to still to make the edge cases work and add the new feature and then look into this - the changes and needed features started to pile up, I need more time - but I did not forgot no worries

@davidsbatista
Copy link
Contributor Author

Also my impression is that there is a lot of similar functionality between the DocumentSplitter and the RecursiveDocumentSplitter and I wonder if it would make sense to refactor one or both so they could use similar functionality. Otherwise it seems like we have a lot of similarish functions doing the same thing which also requires double the the tests. What do you think?

So, I agree and we should definitely do it. But, out of the scope of this PR/issue - we should finalise this one first, merge it and then see how to extract common functionalities from both to an external utils file/module.

@sjrl
Copy link
Contributor

sjrl commented Dec 16, 2024

Also my impression is that there is a lot of similar functionality between the DocumentSplitter and the RecursiveDocumentSplitter and I wonder if it would make sense to refactor one or both so they could use similar functionality. Otherwise it seems like we have a lot of similarish functions doing the same thing which also requires double the the tests. What do you think?

So, I agree and we should definitely do it. But, out of the scope of this PR/issue - we should finalise this one first, merge it and then see how to extract common functionalities from both to an external utils file/module.

Okay sounds good let's do that. Could you open an issue to track this?

@davidsbatista
Copy link
Contributor Author

Okay sounds good let's do that. Could you open an issue to track this?

Issue open: #8645

@davidsbatista davidsbatista changed the title feat: add recursive chunking strategy feat: add RecursiveSplitter component for Document preprocessing Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Recursive Chunking strategy
5 participants