-
Notifications
You must be signed in to change notification settings - Fork 15.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854
Comments
@Karin-Basis You're right that the bug is related to #7263 -- the new A workaround is to escape the period in your separator text_splitter = CharacterTextSplitter(separator="\. ", chunk_size=30, chunk_overlap=0)
Thanks for making an issue. I'm sure others will run into this behavior change as well. |
@devstein Thanks so much for the comment. We've escaped the period in the separator and the chunks are better now, but we do still observe that the metadata start_index is -1 in a lot of cases (and 0 in a few). Is that an expected change in behavior that we need to account for, or a bug? I can file a separate issue if needed. |
I think it is a bug as it is not backward compatible. I have create a PR to add the ability to control how to use the separator, as a ReGex or a simple character. @Karin-Basis I cannot reproduce the issue with For the code:
I got the output:
|
@IlyaMichlin this code should do it:
Sample output:
|
I see it now @Karin-Basis This bug is also fixed in my PR. I tried running this with the changes:
And got the output:
|
<!-- Thank you for contributing to LangChain! Replace this comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! Please make sure you're PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @baskaryan - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @baskaryan - Memory: @hwchase17 - Agents / Tools / Toolkits: @hinthornw - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #7854 Added the ability to use the `separator` ase a regex or a simple character. Fixed a bug where `start_index` was incorrectly counting from -1. Who can review? @eyurtsev @hwchase17 @mmz-001
Hi, @Karin-Basis! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, you reported an issue with the CharacterTextSplitter in version 0.0.226 of Langchain. Devstein suggested escaping the period in the separator as a workaround, which improved the chunks. However, you mentioned another issue with the metadata start_index being -1 in some cases. IlyaMichlin believed this to be a bug and created a PR to add the ability to control the separator. After you provided code to reproduce the issue, IlyaMichlin confirmed that the bug was fixed in their PR. Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your contribution to LangChain! Let us know if you have any further questions or concerns. |
System Info
Langchain version 0.0.226
M1 Mac
Python 3.11.3
Who can help?
@hwchase17 @mmz-001
Information
Related Components
Reproduction
Output in version 0.0.225:
Output in version 0.0.226:
Expected behavior
The output seen in version 0.0.225 should be the same in version 0.0.226.
I suspect that the bug is related to the fix for this issue #7263. We have also noticed that in recent versions, the metadata start_index is always -1 when using create_documents(). Please let me know if I should file a separate issue for this.
The text was updated successfully, but these errors were encountered: