Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Document To Vector Store #838

Merged

Conversation

khoangothe
Copy link
Contributor

Documents, crawled urls, and website will be chunked and loaded to the inputted vector store if vector_store is not None. Although adding the data in Document would be more efficient, but I think this solution makes the code decoupled and easier to maintain.

By default this changes won't add any features, but new applications can be implemented based on the vectorstore (like chatting with the sources)

@khoangothe
Copy link
Contributor Author

Would be cool to have this idea reviewed! I'll add a script test if I am allowed to proceed

@assafelovic
Copy link
Owner

Hey @khoangothe this is a great direction! Can you share an example of how it can be used?

@khoangothe
Copy link
Contributor Author

khoangothe commented Sep 13, 2024

Hi @assafelovic, Thanks for the review! I just added a commit to document how it should be used. Here's the code that I used to test locally. Basically I stored the info in vector store whenever a vector store is defined and report_source is not langchain_vectorstore (in this case the vector_store will be used for knowledge instead of storing new things)

Will add a test script soon.

import asyncio

from gpt_researcher import GPTResearcher

from langchain_community.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

from dotenv import load_dotenv

load_dotenv()

async def main():
    vector_store = InMemoryVectorStore(embedding=OpenAIEmbeddings())

    query = "Which one is the best LLM"

    # Create an instance of GPTResearcher
    researcher = GPTResearcher(
        query=query,
        report_type="research_report",
        report_source="web",
        vector_store=vector_store, 
    )

    # Conduct research and write the report
    await researcher.conduct_research()

    # Check if the vector_store contains information from the sources
    related_contexts = await vector_store.asimilarity_search("GPT-4", k=5)
    print(related_contexts)
    print(len(related_contexts))


asyncio.run(main())

@assafelovic
Copy link
Owner

Thanks @khoangothe excuse me if I might be missing something but how is this different than this? https://docs.gptr.dev/docs/gpt-researcher/context/vector-stores

@khoangothe
Copy link
Contributor Author

khoangothe commented Sep 14, 2024

@assafelovic Sorry if my examples were not clear enough. The one you link allows you to talk to your vector store, so your vector store needs to already have information for gpt-researcher to do research on (when report_source is set to langchain_vectorstore). My changes will allow GPT-Researcher to add new reports to the vector store, so the scraped website + documents will be added to your vector store, so later you can reuse your vector store for other purpose, like RAG.

In the example, starting with an empty InMemoryVectorStore right after await researcher.conduct_research(), the vector store will have everything stored. and related_context now contains information that was scraped that is similar to the query

@hslee16
Copy link
Contributor

hslee16 commented Sep 14, 2024

Looks good 👍🏼

@ElishaKay
Copy link
Collaborator

ElishaKay commented Sep 22, 2024

@assafelovic

This path of persistent & re-usable vector storage that can be leveraged across reports & follow-up questions is very interesting for me.

I've merged this branch into #819 & am planning on testing extensively with PGVector Storage.

@ElishaKay ElishaKay mentioned this pull request Sep 22, 2024
@khoangothe
Copy link
Contributor Author

khoangothe commented Oct 5, 2024

@ElishaKay @assafelovic Hi guys, I was able to resolve the merge conflict and provided tests cases for the scenarios that I implemented. For each type of knowledge source (urls, hybrid, local, webs, langchain documents) data will be ingested in the vector_store that user provided, usage is written in the tests (I added a pdf to test the local and hybrid functionalities). I also raised 1 issue in Discord so hopefully you guys can check it out.
It would be cool to have this pr reviewed, tested and hopefully merged! I want to also implement chatting with Data Source but it currently depends on this PR. Thanks guys for your help!

Copy link
Owner

@assafelovic assafelovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @khoangothe kudos for the hard work and implementation. Looking forward to the next PRs that can empower this

@assafelovic assafelovic merged commit 9ed35db into assafelovic:master Oct 6, 2024
@danieldekay
Copy link
Contributor

Love the idea, as you can build out a knowledge base through various queries this way. It adds a bit more human-in-the-loop for complex topics.

One use case could be if you already have a corpus of literature, but want to add more recent content via GPT-R's searches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants