Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex control over separators in character text splitter #7933

Conversation

IlyaMichlin
Copy link
Contributor

@IlyaMichlin IlyaMichlin commented Jul 19, 2023

#7854

Added the ability to use the separator ase a regex or a simple character.
Fixed a bug where start_index was incorrectly counting from -1.

Who can review?
@eyurtsev
@hwchase17
@mmz-001

@vercel
Copy link

vercel bot commented Jul 19, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 3, 2023 6:24am

@dosubot dosubot bot added the 🤖:improvement Medium size change to existing code to handle new use-cases label Jul 19, 2023
…add_regex_control_over_seperators_in_character_text_splitter
def __init__(self, separator: str = "\n\n", **kwargs: Any) -> None:
def __init__(
self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs: Any
) -> None:
"""Create a new TextSplitter."""
super().__init__(**kwargs)
self._separator = separator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just perform escaping here?

Copy link
Contributor Author

@IlyaMichlin IlyaMichlin Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan this will affect the output when keep_separator=True because it is using self._separator to add the separator back and will add the escaped separator.
Another approach can be to save the escaped separator as well as the original separator. WDYT?

@@ -261,15 +261,21 @@ async def atransform_documents(
class CharacterTextSplitter(TextSplitter):
"""Implementation of splitting text that looks at characters."""

def __init__(self, separator: str = "\n\n", **kwargs: Any) -> None:
def __init__(
self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs: Any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm should we default to True so this doesn't change default behavior / is backwards compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up until langchain=0.0.226 the default behaviour was the way I implementaed. This PR is actually started because the behaviour was changed to using /.
I think that most use cases will not want to use regex in separators that is why I chose this implementation. WDYT?

…add_regex_control_over_seperators_in_character_text_splitter
…add_regex_control_over_seperators_in_character_text_splitter
…add_regex_control_over_seperators_in_character_text_splitter
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this makes sense to me

@hwchase17 hwchase17 merged commit 6f0bccf into langchain-ai:master Aug 4, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants