Use any document as context #112

danielmeloalencar · 2023-12-06T14:06:46Z

danielmeloalencar
Dec 6, 2023

My question would be what is the best way to use the model to analyze a large document. From what I noticed, just passing it as common text does not work. I would like an example of how to split the text and provide it as context, as is done with langchain

mrddter · 2024-02-09T13:31:38Z

mrddter
Feb 9, 2024

It's an old question, but I'm facing into the same problem.
I believe that by using the v3, it's possible to do it following these steps:

From the example with the functions:

Create one that requests a search on that document.
Use one of the langchain retrievers (e.g. Faiss) inside the function.
Use directly the HuggingFaceTransformersEmbeddings on the 'Xenova/all-MiniLM-L6-v2' model.
Use one of the loaders (always langchain), e.g., new TextLoader('./documents/state_of_the_union.txt') or read by fs and then RecursiveCharacterTextSplitter
Call the similaritySearch on vectorStore.

With this prompt:

"Search on document: What did Biden say about Ketanji Brown Jackson is the state of the union address?"

The result will be something like that:

"Biden said in the State of the Union address that he nominated Judge Ketanji Brown Jackson to serve on the United States Supreme Court. He highlighted her as one of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. You need?"

Below is a (not optimized) example:

import * as fs from 'fs'
import { FaissStore } from '@langchain/community/vectorstores/faiss'
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'

import { HuggingFaceTransformersEmbeddings } from '@langchain/community/embeddings/hf_transformers'

const embeddings = new HuggingFaceTransformersEmbeddings({
  modelName: 'Xenova/all-MiniLM-L6-v2'
})

const text = fs.readFileSync('./documents/state_of_the_union.txt', 'utf8')
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 512 })
const docs = await textSplitter.createDocuments([text])

const vectorStore = await FaissStore.fromDocuments(docs, embeddings)

export default {
  description: 'Call this if user ask to retrieve the most recent informations about the State of the Union address',
  params: {
    type: 'object',
    properties: {
      input: {
        description: 'what you want retrieve from this document',
        type: 'string'
      }
    }
  },
  async handler({ input }) {
    console.log('call getDocumentContent')
    const result = await vectorStore.similaritySearch(input, 1)
    console.log(result[0].pageContent)
    return result[0].pageContent // just for test, to refine
  }
}

My 2 cent:

I've tried to do everything on langchain. For example, I've created a class that use (although in beta) node-llama-cpp as an LLM with langchain agents, etc., because I find them very interesting. But imho, they are too focused on OpenAI, so every example that works with OpenAI is likely to fail with the rest. At least for now.

The beta 3 of node-llama-cpp is awesome; it works well, and the functions are a big step forward. Currently, it only supports a few models but I expect a lot from this side. So, by use langchain's retrievers and node-llama-cpp's functions / chat history, it's possible to achieve good results.

2 replies

giladgd Feb 9, 2024
Maintainer

I think you better use LlamaEmbeddingContext from node-llama-cpp with the existing model you already load for evaluation, as it allows you to get more accurate results that align better with the understanding of the data by the model you use for inference.

mrddter Feb 9, 2024

Yes, right, the issue is that in the project I'm using the beta version, while langchain/community is working on the stable (2.8.x) version. So, I chose the Transformers library (despite the additional cost) to avoid conflicts/issues. Make sense for you? (I can try other solutions if you think)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use any document as context #112

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Use any document as context #112

danielmeloalencar Dec 6, 2023

Replies: 1 comment · 2 replies

mrddter Feb 9, 2024

giladgd Feb 9, 2024 Maintainer

mrddter Feb 9, 2024

danielmeloalencar
Dec 6, 2023

Replies: 1 comment 2 replies

mrddter
Feb 9, 2024

giladgd Feb 9, 2024
Maintainer