Documents are not saved in Opensearch when using textsplitters #2036

nirajmchauhan · 2023-07-21T12:37:34Z

I am using CheerioWebBaseLoader to load the documents and save them in the AWS Opensearch. What I realised is that the documents are not getting saved and there is no error as well. But when I remove the TextSplitter then the document is getting saved. Here is my code:

const loader = new CheerioWebBaseLoader(link);

const rawDocs = (await loader.load()).map((d) => {
  return new Document({
    metadata: { originalSource: link },
    pageContent: d.pageContent,
  });
});

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 10,
});

const docs = await textSplitter.splitDocuments(rawDocs);

await OpenSearchVectorStore.fromDocuments(
  docs,
  new OpenAIEmbeddings(),
  {
    client,
    indexName: "documents",
  }
);

I cannot go ahead without using TextSplitter as if the document is huge, then OpenAI gives 400 errors. Is there a solution to this?

The text was updated successfully, but these errors were encountered:

dosubot · 2023-07-21T14:37:35Z

Answer generated by a 🤖

Answer

I understand that you're having trouble saving documents in AWS OpenSearch when using the TextSplitter in the LangChain framework. Specifically, you're using the CheerioWebBaseLoader to load documents and the RecursiveCharacterTextSplitter to split them. However, when the TextSplitter is used, the documents are not being saved, even though no error is thrown.

The issue might be related to the size of the chunks that your documents are being split into. The RecursiveCharacterTextSplitter is currently set to split your documents into chunks of 100 characters with an overlap of 10 characters. If these chunks are too small, they might not meet the minimum size requirement for indexing in OpenSearch, which could be why the documents are not being saved.

As a potential solution, I suggest increasing the chunkSize and chunkOverlap values when you instantiate the RecursiveCharacterTextSplitter. Here's how you can adjust your code:

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000, // Increase chunkSize
  chunkOverlap: 200, // Increase chunkOverlap
});

This modification will split your documents into larger chunks, which should meet the indexing requirements of OpenSearch and allow your documents to be saved.

Please note that this is just one potential solution based on the information provided. If this doesn't resolve your issue, or if your experience contradicts this solution, I recommend checking if you're using the latest version of the LangChain framework, as updates often include bug fixes and improvements.

If you find that this solution is correct and you believe this is a bug that could impact other users, we encourage you to make a pull request. Your contribution would be greatly appreciated by the LangChain community.

Sources

langchain/src/text_splitter.ts

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

nirajmchauhan · 2023-07-21T14:41:58Z

Hey 🤖
I changed the chunk & overlap size as you suggested, still the documents are not getting saved.

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

nirajmchauhan · 2023-07-21T15:33:00Z

Ok, after some debugging, I came across this error.

{
  type: "mapper_parsing_exception",
  reason: "failed to parse field [metadata.loc] of type [keyword] in document with id 'bd2f243c-3a51-40a1-8887-d39b327c127d'. Preview of field's value: '{lines={from=1, to=173}}'",
  caused_by: {
    type: "illegal_state_exception",
    reason: "Can't get text on a START_OBJECT at 1:19523",
  },
}

dosubot · 2023-07-21T15:52:03Z

Answer generated by a 🤖

Answer

I found related issues that might be helpful. I did my best to summarize the solutions, but I recommend looking at the related issues yourself.

Open Issues

Opensearch store does not support object in metadata and fail silently

The suggested solution is:

Do not support object in metadata, similar to the approach taken in the Pinecone vector store. This can be seen in the code here: https://github.com/hwchase17/langchainjs/blob/61e89dbaec07b4b50a55efe0f74faec5b2e701fd/langchain/src/vectorstores/pinecone.ts#L59
Throw an appropriate error so that the client knows what's wrong. This can be implemented following the guide here: https://github.com/opensearch-project/opensearch-js/blob/main/guides/bulk.md#handling-errors

As a workaround until the solution is implemented, you can delete the object in metadata using the following code:

  docs.forEach((d) => {
    d.metadata.source = basename(d.metadata.source);
    delete d.metadata.pdf;
    delete d.metadata.loc;
  });

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

jacoblee93 · 2023-07-21T21:40:38Z

The issue seems to be that OpenSearch can't handle certain types of metadata?

I think the bot's response sounds reasonable here - adding a filter step to the vector store itself would be nice too if you feel up for opening a PR?

nirajmchauhan · 2023-07-22T04:07:33Z

I ended up fixing this by adding indexing for {metadata:{loc:{lines:{start: Int, end: Int}}}}

kiraraty · 2023-08-08T02:43:05Z

I ended up fixing this by adding indexing for {metadata:{loc:{lines:{start: Int, end: Int}}}}

What's Int? please

nirajmchauhan · 2023-08-13T08:58:39Z

Its Integer

zhjch05 · 2023-10-08T11:19:30Z

@nirajmchauhan Would you mind share how did you add indexing? Appreciate the help! Facing same issue. Is there a way to add a general dynamic mapping for metadata that automatically handles this?

rushikeshRL · 2023-11-09T00:09:22Z

@zhjch05
You can update the mapping from Opensearch dashboard:

zhjch05 · 2024-02-06T09:12:08Z

@zhjch05 You can update the mapping from Opensearch dashboard:

Thank you!

dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jul 21, 2023

nirajmchauhan closed this as completed Jul 22, 2023

mattraibert mentioned this issue Dec 12, 2023

community[patch]: Update ElasticSearch mappings to successfully add documents from TextSplitter #3629

Merged

huuyafwww mentioned this issue Feb 1, 2024

community[patch]: Support OpenSearch Serverless #4229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documents are not saved in Opensearch when using textsplitters #2036

Documents are not saved in Opensearch when using textsplitters #2036

nirajmchauhan commented Jul 21, 2023

dosubot bot commented Jul 21, 2023

nirajmchauhan commented Jul 21, 2023

nirajmchauhan commented Jul 21, 2023

dosubot bot commented Jul 21, 2023

jacoblee93 commented Jul 21, 2023

nirajmchauhan commented Jul 22, 2023

kiraraty commented Aug 8, 2023

nirajmchauhan commented Aug 13, 2023

zhjch05 commented Oct 8, 2023

rushikeshRL commented Nov 9, 2023

zhjch05 commented Feb 6, 2024

Documents are not saved in Opensearch when using textsplitters #2036

Documents are not saved in Opensearch when using textsplitters #2036

Comments

nirajmchauhan commented Jul 21, 2023

dosubot bot commented Jul 21, 2023

Answer

Sources

nirajmchauhan commented Jul 21, 2023

nirajmchauhan commented Jul 21, 2023

dosubot bot commented Jul 21, 2023

Answer

Open Issues

Opensearch store does not support object in metadata and fail silently

jacoblee93 commented Jul 21, 2023

nirajmchauhan commented Jul 22, 2023

kiraraty commented Aug 8, 2023

nirajmchauhan commented Aug 13, 2023

zhjch05 commented Oct 8, 2023

rushikeshRL commented Nov 9, 2023

zhjch05 commented Feb 6, 2024