Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854

Karin-Basis · 2023-07-17T19:53:47Z

System Info

Langchain version 0.0.226
M1 Mac
Python 3.11.3

Who can help?

@hwchase17 @mmz-001

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

from langchain.text_splitter import CharacterTextSplitter

def main():
    sample_text = "This is a series of short sentences. I want them to be separated at the periods.  Three sentences should be enough."
    text_splitter = CharacterTextSplitter(separator=". ", chunk_size=30, chunk_overlap=0)
    chunks = text_splitter.split_text(sample_text)
    for chunk in chunks:
        print("CHUNK:", chunk)

if __name__ == "__main__":
    main()

Output in version 0.0.225:

CHUNK: This is a series of short sentences
CHUNK: I want them to be separated at the periods
CHUNK: Three sentences should be enough.

Output in version 0.0.226:

CHUNK: Thi. i. serie. o. shor
CHUNK: sentences. wan. the. t. b
CHUNK: separate. a. th. periods
CHUNK: Thre. sentence. shoul. b
CHUNK: enough.

Expected behavior

The output seen in version 0.0.225 should be the same in version 0.0.226.

I suspect that the bug is related to the fix for this issue #7263. We have also noticed that in recent versions, the metadata start_index is always -1 when using create_documents(). Please let me know if I should file a separate issue for this.

The text was updated successfully, but these errors were encountered:

devstein · 2023-07-17T22:35:38Z

@Karin-Basis You're right that the bug is related to #7263 -- the new re.split is treating the period separator . as a regular expression.

A workaround is to escape the period in your separator

    text_splitter = CharacterTextSplitter(separator="\. ", chunk_size=30, chunk_overlap=0)

>>> re.split(". ", sample_text)
['Thi', 'i', '', 'serie', 'o', 'shor', 'sentences', '', 'wan', 'the', 't', 'b', 'separate', 'a', 'th', 'periods', ' Thre', 'sentence', 'shoul', 'b', 'enough.']
>>> re.split("\. ", sample_text)
['This is a series of short sentences', 'I want them to be separated at the periods', ' Three sentences should be enough.']
>>>

Thanks for making an issue. I'm sure others will run into this behavior change as well.

Karin-Basis · 2023-07-18T14:15:36Z

@devstein Thanks so much for the comment. We've escaped the period in the separator and the chunks are better now, but we do still observe that the metadata start_index is -1 in a lot of cases (and 0 in a few). Is that an expected change in behavior that we need to account for, or a bug? I can file a separate issue if needed.

IlyaMichlin · 2023-07-19T10:26:55Z

I think it is a bug as it is not backward compatible. I have create a PR to add the ability to control how to use the separator, as a ReGex or a simple character.

@Karin-Basis I cannot reproduce the issue with start_index. How to reproduce this issue?

For the code:

from langchain.text_splitter import CharacterTextSplitter


sample_text = "This is a series of short sentences. I want them to be separated at the periods.  Three sentences should be enough."
text_splitter = CharacterTextSplitter(separator="\. ", is_separator_regex=True, chunk_size=30, chunk_overlap=0, add_start_index=True)
chunks = text_splitter.split_text(sample_text)
for chunk in chunks:
    print("CHUNK:", chunk)

for doc in text_splitter.create_documents([sample_text]):
    print("DOC:", doc)

I got the output:

CHUNK: This is a series of short sentences
CHUNK: I want them to be separated at the periods
CHUNK: Three sentences should be enough.
DOC: page_content='This is a series of short sentences' metadata={'start_index': 0}
DOC: page_content='I want them to be separated at the periods' metadata={'start_index': 37}
DOC: page_content='Three sentences should be enough.' metadata={'start_index': 82}

Karin-Basis · 2023-07-20T12:53:08Z

@IlyaMichlin this code should do it:

from langchain.text_splitter import CharacterTextSplitter
import lorem

lorem_sample = lorem.text()
print(lorem_sample)
text_splitter = CharacterTextSplitter(separator="\. ", chunk_size=100, chunk_overlap=0, add_start_index=True)
for doc in text_splitter.create_documents([lorem_sample]):
    print("DOC:", doc)

Sample output:

DOC: page_content='Non amet ipsum dolor\\. Quiquia eius sit magnam' metadata={'start_index': -1}
DOC: page_content='Consectetur etincidunt amet velit voluptatem numquam etincidunt est' metadata={'start_index': 47}
DOC: page_content='Dolorem non dolor aliquam ut magnam\\. Velit ut ut dolore non quisquam quisquam etincidunt' metadata={'start_index': -1}
DOC: page_content='Velit dolorem porro ipsum eius consectetur labore\\. Neque est sit modi ut' metadata={'start_index': -1}
DOC: page_content='Numquam ut sed labore dolor dolorem' metadata={'start_index': 280}
...

IlyaMichlin · 2023-07-20T14:56:45Z

I see it now @Karin-Basis

This bug is also fixed in my PR. I tried running this with the changes:

from langchain.text_splitter import CharacterTextSplitter


lorem_sample = "Non amet ipsum dolor. Quiquia eius sit magnam. Consectetur etincidunt amet velit voluptatem numquam etincidunt est. Dolorem non dolor aliquam ut magnam. Velit ut ut dolore non quisquam quisquam etincidunt. Velit dolorem porro ipsum eius consectetur labore. Neque est sit modi ut. Numquam ut sed labore dolor dolorem"
text_splitter = CharacterTextSplitter(separator=". ", chunk_size=100, chunk_overlap=0, add_start_index=True)
for doc in text_splitter.create_documents([lorem_sample]):
    print("DOC:", doc)

And got the output:

DOC: page_content='Non amet ipsum dolor. Quiquia eius sit magnam' metadata={'start_index': 0}
DOC: page_content='Consectetur etincidunt amet velit voluptatem numquam etincidunt est' metadata={'start_index': 47}
DOC: page_content='Dolorem non dolor aliquam ut magnam. Velit ut ut dolore non quisquam quisquam etincidunt' metadata={'start_index': 116}
DOC: page_content='Velit dolorem porro ipsum eius consectetur labore. Neque est sit modi ut' metadata={'start_index': 206}
DOC: page_content='Numquam ut sed labore dolor dolorem' metadata={'start_index': 280}

@baskaryan

#7854 Added the ability to use the `separator` ase a regex or a simple character. Fixed a bug where `start_index` was incorrectly counting from -1. Who can review? @eyurtsev @hwchase17 @mmz-001

dosubot · 2023-10-19T16:03:11Z

Hi, @Karin-Basis! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue with the CharacterTextSplitter in version 0.0.226 of Langchain. Devstein suggested escaping the period in the separator as a workaround, which improved the chunks. However, you mentioned another issue with the metadata start_index being -1 in some cases. IlyaMichlin believed this to be a bug and created a PR to add the ability to control the separator. After you provided code to reproduce the issue, IlyaMichlin confirmed that the bug was fixed in their PR.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to LangChain! Let us know if you have any further questions or concerns.

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jul 17, 2023

IlyaMichlin mentioned this issue Jul 19, 2023

Add regex control over separators in character text splitter #7933

Merged

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 19, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854

Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854

Karin-Basis commented Jul 17, 2023 •

edited

Loading

devstein commented Jul 17, 2023 •

edited

Loading

Karin-Basis commented Jul 18, 2023

IlyaMichlin commented Jul 19, 2023

Karin-Basis commented Jul 20, 2023

IlyaMichlin commented Jul 20, 2023 •

edited

Loading

dosubot bot commented Oct 19, 2023

Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854

Strange chunks coming out of CharacterTextSplitter starting in version 0.0.226 #7854

Comments

Karin-Basis commented Jul 17, 2023 • edited Loading

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

devstein commented Jul 17, 2023 • edited Loading

Karin-Basis commented Jul 18, 2023

IlyaMichlin commented Jul 19, 2023

Karin-Basis commented Jul 20, 2023

IlyaMichlin commented Jul 20, 2023 • edited Loading

dosubot bot commented Oct 19, 2023

Karin-Basis commented Jul 17, 2023 •

edited

Loading

devstein commented Jul 17, 2023 •

edited

Loading

IlyaMichlin commented Jul 20, 2023 •

edited

Loading