-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web research retriever #8102
Web research retriever #8102
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
os.environ["GOOGLE_API_KEY"] = self.GOOGLE_API_KEY | ||
search = GoogleSearchAPIWrapper() | ||
except Exception as e: | ||
print(f"Error: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray print
dont we want to raise here?
try: | ||
os.environ["GOOGLE_CSE_ID"] = self.GOOGLE_CSE_ID | ||
os.environ["GOOGLE_API_KEY"] = self.GOOGLE_API_KEY | ||
search = GoogleSearchAPIWrapper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be passed in, so they can configure it outside
|
||
# Get search questions | ||
logger.info("Generating questions for Google Search ...") | ||
llm_chain = LLMChain( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the llm_chain should be an attribute on this class, can be constructed from a class method
# This can use rate limit w/ embedding | ||
logger.info("Grabbing most relevant splits from urls ...") | ||
filtered_splits = [] | ||
text_splitter = RecursiveCharacterTextSplitter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an argument on the class, so user can configure
try: | ||
text = await response.text() | ||
except UnicodeDecodeError: | ||
print(f"Failed to decode content from {url}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logger not print
@rlancemartin is attempting to deploy a commit to the LangChain Team on Vercel. A member of the Team first needs to authorize it. |
) | ||
llm_chain: LLMChain | ||
search: GoogleSearchAPIWrapper = Field(..., description="Google Search API Wrapper") | ||
search_prompt: PromptTemplate = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think this is needed on the base class anymore
) | ||
|
||
|
||
DEFAULT_SEARCH_PROMPT = PromptTemplate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very llama2 specific right? could we use a prompt selector for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears we can do it like this -
PROMPT_SELECTOR = ConditionalPromptSelector(
default_prompt=DEFAULT_SEARCH_PROMPT, conditionals=[(isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)]
)
edit: actually, the above errors out. can't pass a conditional isinstance(llm, LlamaCpp)
; the examples use imported is_chat_model
, but that doesn't work for this case.
but of course this also works -
if isinstance(llm, LlamaCpp):
prompt = DEFAULT_LLAMA_SEARCH_PROMPT
print("Using LlamaCpp")
else:
prompt = DEFAULT_SEARCH_PROMPT
more docs / details on what using a ConditionalPromptSelector
buys would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do
PROMPT_SELECTOR = ConditionalPromptSelector(
default_prompt=DEFAULT_SEARCH_PROMPT, conditionals=[(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)]
)
From the code, looks like the first element of the conditional has to be a callable taking in a language model and returning a bool. https://github.com/langchain-ai/langchain/blob/00de334f81abddf4ce6e46a931a505fa21cf7d98/libs/langchain/langchain/chains/prompt_selector.py#L26C10-L26C10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, from discussion w/ Harrison usage appears to be:
QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(
default_prompt=DEFAULT_SEARCH_PROMPT,
conditionals=[
(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)
],
prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)
WebResearchRetriever | ||
""" | ||
|
||
if isinstance(llm, LlamaCpp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still want to let the user pass in a prompt
this should be
prompt: Optional[PromptTemplate] = None
and then you should be able to do the prompt selector as @efriis coded
70c2be0
to
305ca5e
Compare
RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50), | ||
description="Text splitter for splitting web pages into chunks", | ||
) | ||
urls: List[str] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
"""Returns num_serch_results pages per Google search.""" | ||
try: | ||
result = self.search.results(query, num_search_results) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt seem necc
raise Exception(f"Error: {str(e)}") | ||
return result | ||
|
||
def get_urls(self) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deklete
text_splitter = self.text_splitter | ||
for doc in html2text.transform_documents(loader.load()): | ||
doc_splits = text_splitter.split_documents([doc]) | ||
# Proect against very large documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in protect
logger.info("Grabbing most relevant splits from urls ...") | ||
filtered_splits = [] | ||
text_splitter = self.text_splitter | ||
for doc in html2text.transform_documents(loader.load()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove
logger.info("Grabbing most relevant splits from urls ...") | ||
_splits = [] | ||
text_splitter = self.text_splitter | ||
for doc in html2text.transform_documents(loader.load()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can just be:
docs = loader.load()
docs = html2text.transform_documents(docs)
docs = self.text_splitter.split_dcouments(docs)
self.vectorstore.add_documents(_splits) | ||
self.url_database.extend(new_urls) | ||
|
||
# Search for relevant splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a todo to make this async
Given a user question, this will - * Use LLM to generate a set of queries. * Query for each. * The URLs from search results are stored in self.urls. * A check is performed for any new URLs that haven't been processed yet (not in self.url_database). * Only these new URLs are loaded, transformed, and added to the vectorstore. * The vectorstore is queried for relevant documents based on the questions generated by the LLM. * Only unique documents are returned as the final result. This code will avoid reprocessing of URLs across multiple runs of similar queries, which should improve the performance of the retriever. It also keeps track of all URLs that have been processed, which could be useful for debugging or understanding the retriever's behavior. --------- Co-authored-by: Harrison Chase <[email protected]>
Given a user question, this will -
This code will avoid reprocessing of URLs across multiple runs of similar queries, which should improve the performance of the retriever. It also keeps track of all URLs that have been processed, which could be useful for debugging or understanding the retriever's behavior.