Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat : DocumentSplitter, adding the option to split_by function #8336

Merged
merged 8 commits into from
Sep 12, 2024

Conversation

GivAlz
Copy link
Contributor

@GivAlz GivAlz commented Sep 5, 2024

Proposed Changes:

Adding the possibility to pass a function and personalise the way in which DocumentSplitter defines a unit.

This means a user can, for example, use the following to split and define units:

splitter_function = lambda text: re.split('[\n]{2,}, text)

(or use spacy or anything else).

How did you test it?

Added two tests with two "mock" splitter functions.

Notes for the reviewer

There are some issues related to document splitting #5922 . Given the fact that the current methods are very basic and the issues have been open for moths, I think it would make sense to let the user define how text is split.

@GivAlz GivAlz requested a review from a team as a code owner September 5, 2024 20:21
@GivAlz GivAlz requested review from vblagoje and removed request for a team September 5, 2024 20:21
@CLAassistant
Copy link

CLAassistant commented Sep 5, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Sep 5, 2024
@vblagoje
Copy link
Member

Hey @GivAlz this is an excellent idea, thank you for opening this PR. Would you please add a reno release note to this PRs branch so we can generate a nice release note about this feature in the upcoming release. See https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes for more details on how to create reno release note 🙏

@coveralls
Copy link
Collaborator

coveralls commented Sep 10, 2024

Pull Request Test Coverage Report for Build 10831052207

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.01%) to 90.319%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/document_splitter.py 2 98.26%
Totals Coverage Status
Change from base Build 10827600883: 0.01%
Covered Lines: 7202
Relevant Lines: 7974

💛 - Coveralls

@GivAlz GivAlz requested a review from a team as a code owner September 10, 2024 14:21
@GivAlz GivAlz requested review from dfokina and removed request for a team September 10, 2024 14:21
@GivAlz
Copy link
Contributor Author

GivAlz commented Sep 10, 2024

I've added a release note. Please let me know if I need to modify its name or text content.

Thank you!!

@vblagoje
Copy link
Member

@GivAlz thanks for a quick turnaround. To stay consistent let's use all small caps in reno release note name (with - between words). And let's remove highlights as that entry is reserved only for major features we want to highlight to users. Although cool perhaps this feature doesn't cross the highlights threshold this time :-)

@vblagoje
Copy link
Member

Looks much better now @GivAlz - to integrate you need to sign the contribution agreement- pretty much standard procedure in most bigger open source projects 🙏

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last tiny change and let's update https://docs.haystack.deepset.ai/docs/documentsplitter - would you please do that @dfokina

"""

self.split_by = split_by
if split_by not in ["word", "sentence", "page", "passage"]:
if split_by not in ["function", "page", "passage", "sentence", "word"]:
raise ValueError("split_by must be one of 'word', 'sentence', 'page' or 'passage'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, tiny nitpick here @GivAlz we need to add 'function' in this list here. Let's fix this and integrate. Don't forget to pull branch first as I merged main into it 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh...good catch! Just merged, fixed & pushed

@vblagoje vblagoje self-requested a review September 12, 2024 09:59
Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @GivAlz and thanks for your first contribution. Many to come 🚀
@dfokina please don't forget this tiny change in https://docs.haystack.deepset.ai/docs/documentsplitter

@vblagoje
Copy link
Member

The change will be reflected in docs in the upcoming 2.6 version.

@vblagoje
Copy link
Member

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

@GivAlz
Copy link
Contributor Author

GivAlz commented Sep 12, 2024

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.

@vblagoje
Copy link
Member

vblagoje commented Sep 12, 2024

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.

No worries, I've been doing this for over a year and I forget all the time as well. Now I have pre-commit check notes :-) Please review 6a59250 and say if there is something off

@GivAlz
Copy link
Contributor Author

GivAlz commented Sep 12, 2024

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.

No worries, I've been doing this for over a year and I forget all the time as well. Now I have pre-commit check notes :-) Please review 6a59250 and say if there is something off

LGTM! Just wondering if it makes sense to add a note on the fact that, if the method to_dict is used, the function must be serialisable, but it should be obvious and I think that the error thrown would be pretty clear...if you don't think it is necessary (or maybe a note could be added in the documentation), then I guess it's good to merge!

@vblagoje
Copy link
Member

I think we treat it as serializable due to its simple string interface. We do this quite often throughout the codebase - it is ok most likely. I think we can merge this now 🚀

@vblagoje vblagoje merged commit 4106e7e into deepset-ai:main Sep 12, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants