Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior of CharacterTextSplitter when changing keep_separator for special regex characters #7262

Closed
14 tasks
mmz-001 opened this issue Jul 6, 2023 · 1 comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder

Comments

@mmz-001
Copy link
Contributor

mmz-001 commented Jul 6, 2023

System Info

LangChain v0.0.225, Windows, Python 3.10

Who can help?

@hwchase17

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

The behavior for CharacterTextSplitter when changing keep_separator when using normal characters is like this:

text_splitter = CharacterTextSplitter(
    chunk_size=4,
    chunk_overlap=0,
    separator="_",
    keep_separator=False,
)

text_splitter.split_text("foo_bar_baz_123")
# ['foo', 'bar', 'baz', '123']
text_splitter = CharacterTextSplitter(
    chunk_size=4,
    chunk_overlap=0,
    separator="_",
    keep_separator=True,
)

text_splitter.split_text("foo_bar_baz_123")
# ['foo', '_bar', '_baz', '_123']

However, when using special regex characters like . or * the splitter breaks when keep_separator is False.

text_splitter = CharacterTextSplitter(
    chunk_size=4,
    chunk_overlap=0,
    separator=r"\.",
    keep_separator=False,
)

text_splitter.split_text("foo.bar.baz.123")
# ['foo.bar.baz.123']

The special characters should be escaped, otherwise it raises an error. For example, the following code raises an error.

text_splitter = CharacterTextSplitter(
    chunk_size=4,
    chunk_overlap=0,
    separator=r"?",
    keep_separator=True,
)

text_splitter.split_text("foo?bar?baz?123")

I'll make a PR to fix this.

Also, the documentation never mentions that the separator should be a regex, I only found out the hard way after getting regex errors on one of the RecursiveTextSplitter splitters after updating LangChain. I think we should add a note about this in the documentation or the code.

Expected behavior

text_splitter = CharacterTextSplitter(
   chunk_size=4,
   chunk_overlap=0,
   separator=r"\.",
   keep_separator=False,
)

text_splitter.split_text("foo.bar.baz.123")
# ['foo', 'bar', 'baz', '123']
@dosubot dosubot bot added 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Jul 6, 2023
@dosubot
Copy link

dosubot bot commented Jul 6, 2023

Answer generated by a 🤖

Answer

Thank you for bringing this issue to our attention and for proposing a solution! It's great to see your willingness to contribute to LangChain.

Your proposed fix to handle special regex characters in the CharacterTextSplitter is definitely needed. It will ensure consistent behavior when changing the keep_separator parameter.

Additionally, your suggestion to add a note in the documentation or code about the separator being a regex is valuable. This will help users avoid confusion and potential errors.

Please go ahead and create a pull request with your proposed changes. The maintainers will review it and decide if it should be merged into the codebase.

Thank you for your contribution to LangChain!


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

hwchase17 pushed a commit that referenced this issue Jul 6, 2023
…eep_separator` (#7263)

- Description:
- When `keep_separator` is `True` the `_split_text_with_regex()` method
in `text_splitter` uses regex to split, but when `keep_separator` is
`False` it uses `str.split()`. This causes problems when the separator
is a special regex character like `.` or `*`. This PR fixes that by
using `re.split()` in both cases.
- Issue: #7262 
- Tag maintainer: @baskaryan
@mmz-001 mmz-001 closed this as completed Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
Projects
None yet
Development

No branches or pull requests

1 participant