Add regex control over separators in character text splitter #7933

IlyaMichlin · 2023-07-19T10:24:01Z

Added the ability to use the separator ase a regex or a simple character.
Fixed a bug where start_index was incorrectly counting from -1.

Who can review?
@eyurtsev
@hwchase17
@mmz-001

…iveCharacterTextSplitter

vercel · 2023-07-19T10:24:05Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 3, 2023 6:24am

…add_regex_control_over_seperators_in_character_text_splitter

baskaryan · 2023-07-25T00:08:15Z

libs/langchain/langchain/text_splitter.py

-    def __init__(self, separator: str = "\n\n", **kwargs: Any) -> None:
+    def __init__(
+        self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs: Any
+    ) -> None:
        """Create a new TextSplitter."""
        super().__init__(**kwargs)
        self._separator = separator


should we just perform escaping here?

@baskaryan this will affect the output when keep_separator=True because it is using self._separator to add the separator back and will add the escaped separator.
Another approach can be to save the escaped separator as well as the original separator. WDYT?

baskaryan · 2023-07-25T00:08:35Z

libs/langchain/langchain/text_splitter.py

@@ -261,15 +261,21 @@ async def atransform_documents(
 class CharacterTextSplitter(TextSplitter):
    """Implementation of splitting text that looks at characters."""

-    def __init__(self, separator: str = "\n\n", **kwargs: Any) -> None:
+    def __init__(
+        self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs: Any


hmm should we default to True so this doesn't change default behavior / is backwards compatible?

Up until langchain=0.0.226 the default behaviour was the way I implementaed. This PR is actually started because the behaviour was changed to using /.
I think that most use cases will not want to use regex in separators that is why I chose this implementation. WDYT?

…add_regex_control_over_seperators_in_character_text_splitter

hwchase17

i think this makes sense to me

IlyaMichlin added 12 commits June 2, 2023 20:54

Fixed markdown text splitter horizontal lines

1d3a9ae

Fixed rst text splitter section lines

597d451

fix markdown regex

384512c

test markdown split

a2af1b0

fix markdown and rst split regex

fca3aa7

test markdown and rst split

c76bba8

test markdown and rst split

a1ad1a3

Merge branch 'master' of https://github.com/hwchase17/langchain

7d79b12

Merge branch 'master' of https://github.com/hwchase17/langchain

6990ab3

Merge branch 'master' of https://github.com/IlyaMichlin/langchain

998fad4

Merge branch 'master' of https://github.com/hwchase17/langchain

612383e

Add regex control over separators in CharacterTextSplitter and Recurs…

4e5b931

…iveCharacterTextSplitter

dosubot bot added the 🤖:improvement Medium size change to existing code to handle new use-cases label Jul 19, 2023

IlyaMichlin added 3 commits July 19, 2023 13:29

black

0624210

black

533e70b

fix tests

a71b928

IlyaMichlin mentioned this pull request Jul 20, 2023

Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854

Closed

14 tasks

IlyaMichlin added 2 commits July 21, 2023 17:53

docs

3a80903

Merge branch 'master' of https://github.com/hwchase17/langchain into …

9fd29ef

…add_regex_control_over_seperators_in_character_text_splitter

baskaryan reviewed Jul 25, 2023

View reviewed changes

Merge branch 'master' of https://github.com/hwchase17/langchain into …

49cd2ec

…add_regex_control_over_seperators_in_character_text_splitter

vercel bot deployed to Preview – langchain July 25, 2023 14:39 View deployment

IlyaMichlin requested a review from baskaryan July 25, 2023 14:39

Merge branch 'master' of https://github.com/hwchase17/langchain into …

bbaa53e

…add_regex_control_over_seperators_in_character_text_splitter

vercel bot deployed to Preview – langchain July 26, 2023 05:18 View deployment

Merge branch 'master' of https://github.com/hwchase17/langchain into …

b278f6a

…add_regex_control_over_seperators_in_character_text_splitter

vercel bot deployed to Preview – langchain August 3, 2023 06:24 View deployment

hwchase17 approved these changes Aug 4, 2023

View reviewed changes

hwchase17 merged commit 6f0bccf into langchain-ai:master Aug 4, 2023
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regex control over separators in character text splitter #7933

Add regex control over separators in character text splitter #7933

IlyaMichlin commented Jul 19, 2023 •

edited

Loading

vercel bot commented Jul 19, 2023 •

edited

Loading

baskaryan Jul 25, 2023

IlyaMichlin Jul 25, 2023 •

edited

Loading

baskaryan Jul 25, 2023

IlyaMichlin Jul 25, 2023

hwchase17 left a comment

Add regex control over separators in character text splitter #7933

Add regex control over separators in character text splitter #7933

Conversation

IlyaMichlin commented Jul 19, 2023 • edited Loading

vercel bot commented Jul 19, 2023 • edited Loading

baskaryan Jul 25, 2023

Choose a reason for hiding this comment

IlyaMichlin Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

baskaryan Jul 25, 2023

Choose a reason for hiding this comment

IlyaMichlin Jul 25, 2023

Choose a reason for hiding this comment

hwchase17 left a comment

Choose a reason for hiding this comment

IlyaMichlin commented Jul 19, 2023 •

edited

Loading

vercel bot commented Jul 19, 2023 •

edited

Loading

IlyaMichlin Jul 25, 2023 •

edited

Loading