You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🤖:bugRelated to a bug, vulnerability, unexpected error with an existing feature🤖:docsChanges to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
Also, the documentation never mentions that the separator should be a regex, I only found out the hard way after getting regex errors on one of the RecursiveTextSplitter splitters after updating LangChain. I think we should add a note about this in the documentation or the code.
The text was updated successfully, but these errors were encountered:
dosubotbot
added
🤖:bug
Related to a bug, vulnerability, unexpected error with an existing feature
🤖:docs
Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
labels
Jul 6, 2023
Thank you for bringing this issue to our attention and for proposing a solution! It's great to see your willingness to contribute to LangChain.
Your proposed fix to handle special regex characters in the CharacterTextSplitter is definitely needed. It will ensure consistent behavior when changing the keep_separator parameter.
Additionally, your suggestion to add a note in the documentation or code about the separator being a regex is valuable. This will help users avoid confusion and potential errors.
Please go ahead and create a pull request with your proposed changes. The maintainers will review it and decide if it should be merged into the codebase.
Thank you for your contribution to LangChain!
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
…eep_separator` (#7263)
- Description:
- When `keep_separator` is `True` the `_split_text_with_regex()` method
in `text_splitter` uses regex to split, but when `keep_separator` is
`False` it uses `str.split()`. This causes problems when the separator
is a special regex character like `.` or `*`. This PR fixes that by
using `re.split()` in both cases.
- Issue: #7262
- Tag maintainer: @baskaryan
🤖:bugRelated to a bug, vulnerability, unexpected error with an existing feature🤖:docsChanges to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
System Info
LangChain v0.0.225, Windows, Python 3.10
Who can help?
@hwchase17
Information
Related Components
Reproduction
The behavior for
CharacterTextSplitter
when changingkeep_separator
when using normal characters is like this:However, when using special regex characters like
.
or*
the splitter breaks whenkeep_separator
isFalse
.The special characters should be escaped, otherwise it raises an error. For example, the following code raises an error.
I'll make a PR to fix this.
Also, the documentation never mentions that the separator should be a regex, I only found out the hard way after getting regex errors on one of the
RecursiveTextSplitter
splitters after updating LangChain. I think we should add a note about this in the documentation or the code.Expected behavior
The text was updated successfully, but these errors were encountered: