-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat : DocumentSplitter, adding the option to split_by function #8336
Conversation
Hey @GivAlz this is an excellent idea, thank you for opening this PR. Would you please add a reno release note to this PRs branch so we can generate a nice release note about this feature in the upcoming release. See https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes for more details on how to create reno release note 🙏 |
Pull Request Test Coverage Report for Build 10831052207Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
2ac806a
to
387330e
Compare
I've added a release note. Please let me know if I need to modify its name or text content. Thank you!! |
@GivAlz thanks for a quick turnaround. To stay consistent let's use all small caps in reno release note name (with - between words). And let's remove highlights as that entry is reserved only for major features we want to highlight to users. Although cool perhaps this feature doesn't cross the highlights threshold this time :-) |
Looks much better now @GivAlz - to integrate you need to sign the contribution agreement- pretty much standard procedure in most bigger open source projects 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last tiny change and let's update https://docs.haystack.deepset.ai/docs/documentsplitter - would you please do that @dfokina
""" | ||
|
||
self.split_by = split_by | ||
if split_by not in ["word", "sentence", "page", "passage"]: | ||
if split_by not in ["function", "page", "passage", "sentence", "word"]: | ||
raise ValueError("split_by must be one of 'word', 'sentence', 'page' or 'passage'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, tiny nitpick here @GivAlz we need to add 'function' in this list here. Let's fix this and integrate. Don't forget to pull branch first as I merged main into it 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh...good catch! Just merged, fixed & pushed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @GivAlz and thanks for your first contribution. Many to come 🚀
@dfokina please don't forget this tiny change in https://docs.haystack.deepset.ai/docs/documentsplitter
The change will be reflected in docs in the upcoming 2.6 version. |
@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch |
Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function. |
No worries, I've been doing this for over a year and I forget all the time as well. Now I have pre-commit check notes :-) Please review 6a59250 and say if there is something off |
LGTM! Just wondering if it makes sense to add a note on the fact that, if the method to_dict is used, the function must be serialisable, but it should be obvious and I think that the error thrown would be pretty clear...if you don't think it is necessary (or maybe a note could be added in the documentation), then I guess it's good to merge! |
I think we treat it as serializable due to its simple string interface. We do this quite often throughout the codebase - it is ok most likely. I think we can merge this now 🚀 |
Proposed Changes:
Adding the possibility to pass a function and personalise the way in which DocumentSplitter defines a unit.
This means a user can, for example, use the following to split and define units:
splitter_function = lambda text: re.split('[\n]{2,}, text)
(or use spacy or anything else).
How did you test it?
Added two tests with two "mock" splitter functions.
Notes for the reviewer
There are some issues related to document splitting #5922 . Given the fact that the current methods are very basic and the issues have been open for moths, I think it would make sense to let the user define how text is split.