Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854

Closed
1 of 14 tasks
Karin-Basis opened this issue Jul 17, 2023 · 6 comments
Closed
1 of 14 tasks
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@Karin-Basis
Copy link

Karin-Basis commented Jul 17, 2023

System Info

Langchain version 0.0.226
M1 Mac
Python 3.11.3

Who can help?

@hwchase17 @mmz-001

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

from langchain.text_splitter import CharacterTextSplitter

def main():
    sample_text = "This is a series of short sentences. I want them to be separated at the periods.  Three sentences should be enough."
    text_splitter = CharacterTextSplitter(separator=". ", chunk_size=30, chunk_overlap=0)
    chunks = text_splitter.split_text(sample_text)
    for chunk in chunks:
        print("CHUNK:", chunk)

if __name__ == "__main__":
    main()

Output in version 0.0.225:

CHUNK: This is a series of short sentences
CHUNK: I want them to be separated at the periods
CHUNK: Three sentences should be enough.

Output in version 0.0.226:

CHUNK: Thi. i. serie. o. shor
CHUNK: sentences. wan. the. t. b
CHUNK: separate. a. th. periods
CHUNK: Thre. sentence. shoul. b
CHUNK: enough.

Expected behavior

The output seen in version 0.0.225 should be the same in version 0.0.226.

I suspect that the bug is related to the fix for this issue #7263. We have also noticed that in recent versions, the metadata start_index is always -1 when using create_documents(). Please let me know if I should file a separate issue for this.

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jul 17, 2023
@devstein
Copy link

devstein commented Jul 17, 2023

@Karin-Basis You're right that the bug is related to #7263 -- the new re.split is treating the period separator . as a regular expression.

A workaround is to escape the period in your separator

    text_splitter = CharacterTextSplitter(separator="\. ", chunk_size=30, chunk_overlap=0)
>>> re.split(". ", sample_text)
['Thi', 'i', '', 'serie', 'o', 'shor', 'sentences', '', 'wan', 'the', 't', 'b', 'separate', 'a', 'th', 'periods', ' Thre', 'sentence', 'shoul', 'b', 'enough.']
>>> re.split("\. ", sample_text)
['This is a series of short sentences', 'I want them to be separated at the periods', ' Three sentences should be enough.']
>>>

Thanks for making an issue. I'm sure others will run into this behavior change as well.

@Karin-Basis
Copy link
Author

@devstein Thanks so much for the comment. We've escaped the period in the separator and the chunks are better now, but we do still observe that the metadata start_index is -1 in a lot of cases (and 0 in a few). Is that an expected change in behavior that we need to account for, or a bug? I can file a separate issue if needed.

@IlyaMichlin
Copy link
Contributor

I think it is a bug as it is not backward compatible. I have create a PR to add the ability to control how to use the separator, as a ReGex or a simple character.

@Karin-Basis I cannot reproduce the issue with start_index. How to reproduce this issue?

For the code:

from langchain.text_splitter import CharacterTextSplitter


sample_text = "This is a series of short sentences. I want them to be separated at the periods.  Three sentences should be enough."
text_splitter = CharacterTextSplitter(separator="\. ", is_separator_regex=True, chunk_size=30, chunk_overlap=0, add_start_index=True)
chunks = text_splitter.split_text(sample_text)
for chunk in chunks:
    print("CHUNK:", chunk)

for doc in text_splitter.create_documents([sample_text]):
    print("DOC:", doc)

I got the output:

CHUNK: This is a series of short sentences
CHUNK: I want them to be separated at the periods
CHUNK: Three sentences should be enough.
DOC: page_content='This is a series of short sentences' metadata={'start_index': 0}
DOC: page_content='I want them to be separated at the periods' metadata={'start_index': 37}
DOC: page_content='Three sentences should be enough.' metadata={'start_index': 82}

@Karin-Basis
Copy link
Author

@IlyaMichlin this code should do it:

from langchain.text_splitter import CharacterTextSplitter
import lorem

lorem_sample = lorem.text()
print(lorem_sample)
text_splitter = CharacterTextSplitter(separator="\. ", chunk_size=100, chunk_overlap=0, add_start_index=True)
for doc in text_splitter.create_documents([lorem_sample]):
    print("DOC:", doc)

Sample output:

DOC: page_content='Non amet ipsum dolor\\. Quiquia eius sit magnam' metadata={'start_index': -1}
DOC: page_content='Consectetur etincidunt amet velit voluptatem numquam etincidunt est' metadata={'start_index': 47}
DOC: page_content='Dolorem non dolor aliquam ut magnam\\. Velit ut ut dolore non quisquam quisquam etincidunt' metadata={'start_index': -1}
DOC: page_content='Velit dolorem porro ipsum eius consectetur labore\\. Neque est sit modi ut' metadata={'start_index': -1}
DOC: page_content='Numquam ut sed labore dolor dolorem' metadata={'start_index': 280}
...

@IlyaMichlin
Copy link
Contributor

IlyaMichlin commented Jul 20, 2023

I see it now @Karin-Basis

This bug is also fixed in my PR. I tried running this with the changes:

from langchain.text_splitter import CharacterTextSplitter


lorem_sample = "Non amet ipsum dolor. Quiquia eius sit magnam. Consectetur etincidunt amet velit voluptatem numquam etincidunt est. Dolorem non dolor aliquam ut magnam. Velit ut ut dolore non quisquam quisquam etincidunt. Velit dolorem porro ipsum eius consectetur labore. Neque est sit modi ut. Numquam ut sed labore dolor dolorem"
text_splitter = CharacterTextSplitter(separator=". ", chunk_size=100, chunk_overlap=0, add_start_index=True)
for doc in text_splitter.create_documents([lorem_sample]):
    print("DOC:", doc)

And got the output:

DOC: page_content='Non amet ipsum dolor. Quiquia eius sit magnam' metadata={'start_index': 0}
DOC: page_content='Consectetur etincidunt amet velit voluptatem numquam etincidunt est' metadata={'start_index': 47}
DOC: page_content='Dolorem non dolor aliquam ut magnam. Velit ut ut dolore non quisquam quisquam etincidunt' metadata={'start_index': 116}
DOC: page_content='Velit dolorem porro ipsum eius consectetur labore. Neque est sit modi ut' metadata={'start_index': 206}
DOC: page_content='Numquam ut sed labore dolor dolorem' metadata={'start_index': 280}

hwchase17 pushed a commit that referenced this issue Aug 4, 2023
<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->
#7854

Added the ability to use the `separator` ase a regex or a simple
character.
Fixed a bug where `start_index` was incorrectly counting from -1.

Who can review?
@eyurtsev
@hwchase17 
@mmz-001
@dosubot
Copy link

dosubot bot commented Oct 19, 2023

Hi, @Karin-Basis! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue with the CharacterTextSplitter in version 0.0.226 of Langchain. Devstein suggested escaping the period in the separator as a workaround, which improved the chunks. However, you mentioned another issue with the metadata start_index being -1 in some cases. IlyaMichlin believed this to be a bug and created a PR to add the ability to control the separator. After you provided code to reproduce the issue, IlyaMichlin confirmed that the bug was fixed in their PR.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to LangChain! Let us know if you have any further questions or concerns.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 19, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants