-
Notifications
You must be signed in to change notification settings - Fork 15.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regex control over separators in character text splitter #7933
Add regex control over separators in character text splitter #7933
Conversation
…iveCharacterTextSplitter
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
…add_regex_control_over_seperators_in_character_text_splitter
def __init__(self, separator: str = "\n\n", **kwargs: Any) -> None: | ||
def __init__( | ||
self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs: Any | ||
) -> None: | ||
"""Create a new TextSplitter.""" | ||
super().__init__(**kwargs) | ||
self._separator = separator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just perform escaping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baskaryan this will affect the output when keep_separator=True
because it is using self._separator
to add the separator back and will add the escaped separator.
Another approach can be to save the escaped separator as well as the original separator. WDYT?
@@ -261,15 +261,21 @@ async def atransform_documents( | |||
class CharacterTextSplitter(TextSplitter): | |||
"""Implementation of splitting text that looks at characters.""" | |||
|
|||
def __init__(self, separator: str = "\n\n", **kwargs: Any) -> None: | |||
def __init__( | |||
self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs: Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm should we default to True so this doesn't change default behavior / is backwards compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up until langchain=0.0.226
the default behaviour was the way I implementaed. This PR is actually started because the behaviour was changed to using /.
I think that most use cases will not want to use regex in separators that is why I chose this implementation. WDYT?
…add_regex_control_over_seperators_in_character_text_splitter
…add_regex_control_over_seperators_in_character_text_splitter
…add_regex_control_over_seperators_in_character_text_splitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this makes sense to me
#7854
Added the ability to use the
separator
ase a regex or a simple character.Fixed a bug where
start_index
was incorrectly counting from -1.Who can review?
@eyurtsev
@hwchase17
@mmz-001