Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI embeddings API extension, embeddings are unusable? #4197

Closed
1 task done
gestalt73 opened this issue Oct 6, 2023 · 1 comment
Closed
1 task done

OpenAI embeddings API extension, embeddings are unusable? #4197

gestalt73 opened this issue Oct 6, 2023 · 1 comment
Labels
bug Something isn't working stale

Comments

@gestalt73
Copy link

Describe the bug

Apologies for not figuring out a better title. I've been banging my head against the code for the better part of a day trying to figure out what's going on.

langchain chromadb is unable to retrieve relevant chunks using the openai embeddings api.

This is after applying the proposed pull request from: Pulll Request 4147

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

I'm attaching a simple python script to demonstrate, and a profanity-free version of chapter 2 of a free online book.
Commenting out the relevant openai_api_base elements will route to either openai or your local embedding api.
chapter2.txt

import langchain
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader

from PyPDF2 import PdfReader

langchain.debug=True
langchain.verbose=True

myembeddings = OpenAIEmbeddings(
    openai_api_key="sk-yourownpersonalopenapikey"
    #,openai_api_base="http://192.168.50.108:5007/v1"
    
)

myllm = OpenAI(
    openai_api_key="sk-yourownpersonalopenapikey"
    #,openai_api_base="http://192.168.50.108:5007/v1"
    
)

loader = TextLoader("chapter2.txt")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, 
    chunk_overlap=50, 
    separators=[" ", ",", "\n"]
)

texts = text_splitter.split_documents(documents)

docsearch = Chroma.from_documents(
    texts,
    myembeddings
)

results = docsearch.search("Who is Caroline?", search_type="similarity")

print(results)

qa = RetrievalQA.from_chain_type(llm=myllm, chain_type="stuff", retriever=docsearch.as_retriever())

query = "Who is Caroline?"
qa.run(query)

When you encode and bounce the search off of the official openai api, you get the following types of retrievals.
These retrievals are Caroline-related, as you'd expect from a simple vectordb search.

[
Document(page_content='the myriad threads of causality to find out which of the billions of chemicals, which errant cell, was responsible for this person\'s physiological collapse? One thing Prime Intellect knew: It had to figure it out.\nIt could not, through inaction, allow Caroline to die.\n"She\'s still in trouble. Look at her pupils."\n"It\'s the morphine."\nEveryone looked at the older nurse, whose name was Jill. "The chart must be wrong," she said. "I gave her what it said."\n"She has a tolerance," AnneMarie said, and', metadata={'source': 'chapter2.txt'}), 

Document(page_content='The drops of residual solution within them were remarkably pure, and Prime Intellect easily singled out the large organic molecule they carried. Then it created an automatic process to scan Caroline\'s body molecule by molecule, eliminating each and every molecule of morphine that it found. This took three minutes, and created a faintly visible blue glow.\n\nThis was the human onlookers\' first clue, other than Caroline\'s miraculously restarted heart, as to what was happening.\n"What in tarnation!,"', metadata={'source': 'chapter2.txt'}),

Document(page_content="the pain, which had subsided for real for the first time in years, returned. Caroline moaned. But Prime Intellect didn't know about that part of it, not yet.\nThere was still a whole constellation of stuff wrong with Caroline Hubert's body, and emboldened by its success it set about correcting what it could. It found long chain molecules, which it would later learn were called collagens, cross-linked. It un-cross-linked them. It found damaged DNA, which it fixed. It found whole masses of cells", metadata={'source': 'chapter2.txt'}), 

Document(page_content='she hoped that the impossibility would go away if she challenged it.\n"I need a drink," said the doctor who had come with the machine to re-start Caroline\'s heart.\nPrime Intellect stopped working. There were still huge differences between Caroline and the others. Prime Intellect did not yet realize the differences were due to Caroline\'s age. It needed more information, and it needed finer control to analyse the situation. But it was at a bottleneck; it could not stop monitoring Caroline, whose', metadata={'source': 'chapter2.txt'})
]

When you encode and bounce the search off of the local openai extension api the results do not relate to the search term at all.

[
Document(page_content='found Lawrence sitting on one of ChipTecs\' park benches, watching some pigeons play. He wished very much that he could have fed the pigeons, but he had no food for them. They strutted up to him and cooed, not comprehending that a human could lack for something.\nThe pigeons scattered as the nation\'s designated military representatives marched up.\n"You have to turn it off," Blake said directly. His tone made it clear that he expected obedience.\n"Circuit breakers are in the basement," Lawrence', metadata={'source': 'chapter2.txt'}),

Document(page_content="most likely pulled the plug on this awesome new technology, a technology which might just vindicate Dr. Lawrence's nonviolent approach. Blake had stopped short, but only just short, of threatening to call the Strategic Air Command and have the building nuked. Privately, he still held that out as an option if Prime Intellect wasn't somehow neutralized. It would take some doing, but Blake was one of the few people in the country who could demand an air strike against Silicon Valley and, just", metadata={'source': 'chapter2.txt'}),

Document(page_content='was a tricky business; the words Lawrence used only had meaning through other associations within the GAT, and those meanings weren\'t always what Lawrence thought they were. But now he would try to plug the drain for good.\n"Force Association: Use of any technology to manipulate the environment of a human being without its permission shall be a violation of the First Law of severity two."\nThere was no immediate response.\nThen:\n\n*\tASSOCIATION REJECTED BY FIRST LAW ARBITRATOR DUE TO AN EXISTING', metadata={'source': 'chapter2.txt'}),

Document(page_content='from the TV, and words began to scroll across the screen:\n\n*\tJOHN TAYLOR IS IN THE ROOM WITH HIM. HE IS DIRECTING STEBBINS.\n\nLawrence read this as he talked. "Jail for what? I just borrowed the papers to see if Prime Intellect could expand on them."\nAnother pause. "What? It didn\'t come up with anything, did it?"\n"Well, it\'s..." (Why do you care if you\'ve just been fired? Lawrence wondered.)\n\n*\tSTEBBINS IS LYING. HE WENT TO TAYLOR AS SOON YOU LEFT AND TOLD HIM THAT YOU BROUGHT THEM TO', metadata={'source': 'chapter2.txt'})
]

Screenshot

No response

Logs

(textgen) alansrobotlab@goliath:~/Documents/brayden/langchain$ /home/alansrobotlab/anaconda3/envs/textgen/bin/python /home/alansrobotlab/Documents/brayden/langchain/brokenembeddings.py
[Document(page_content='found Lawrence sitting on one of ChipTecs\' park benches, watching some pigeons play. He wished very much that he could have fed the pigeons, but he had no food for them. They strutted up to him and cooed, not comprehending that a human could lack for something.\nThe pigeons scattered as the nation\'s designated military representatives marched up.\n"You have to turn it off," Blake said directly. His tone made it clear that he expected obedience.\n"Circuit breakers are in the basement," Lawrence', metadata={'source': 'chapter2.txt'}), Document(page_content="most likely pulled the plug on this awesome new technology, a technology which might just vindicate Dr. Lawrence's nonviolent approach. Blake had stopped short, but only just short, of threatening to call the Strategic Air Command and have the building nuked. Privately, he still held that out as an option if Prime Intellect wasn't somehow neutralized. It would take some doing, but Blake was one of the few people in the country who could demand an air strike against Silicon Valley and, just", metadata={'source': 'chapter2.txt'}), Document(page_content='was a tricky business; the words Lawrence used only had meaning through other associations within the GAT, and those meanings weren\'t always what Lawrence thought they were. But now he would try to plug the drain for good.\n"Force Association: Use of any technology to manipulate the environment of a human being without its permission shall be a violation of the First Law of severity two."\nThere was no immediate response.\nThen:\n\n*\tASSOCIATION REJECTED BY FIRST LAW ARBITRATOR DUE TO AN EXISTING', metadata={'source': 'chapter2.txt'}), Document(page_content='from the TV, and words began to scroll across the screen:\n\n*\tJOHN TAYLOR IS IN THE ROOM WITH HIM. HE IS DIRECTING STEBBINS.\n\nLawrence read this as he talked. "Jail for what? I just borrowed the papers to see if Prime Intellect could expand on them."\nAnother pause. "What? It didn\'t come up with anything, did it?"\n"Well, it\'s..." (Why do you care if you\'ve just been fired? Lawrence wondered.)\n\n*\tSTEBBINS IS LYING. HE WENT TO TAYLOR AS SOON YOU LEFT AND TOLD HIM THAT YOU BROUGHT THEM TO', metadata={'source': 'chapter2.txt'})]
[chain/start] [1:chain:RetrievalQA] Entering Chain run with input:
{
  "query": "Who is Caroline?"
}
[chain/start] [1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[inputs]
[chain/start] [1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
{
  "question": "Who is Caroline?",
  "context": "found Lawrence sitting on one of ChipTecs' park benches, watching some pigeons play. He wished very much that he could have fed the pigeons, but he had no food for them. They strutted up to him and cooed, not comprehending that a human could lack for something.\nThe pigeons scattered as the nation's designated military representatives marched up.\n\"You have to turn it off,\" Blake said directly. His tone made it clear that he expected obedience.\n\"Circuit breakers are in the basement,\" Lawrence\n\nmost likely pulled the plug on this awesome new technology, a technology which might just vindicate Dr. Lawrence's nonviolent approach. Blake had stopped short, but only just short, of threatening to call the Strategic Air Command and have the building nuked. Privately, he still held that out as an option if Prime Intellect wasn't somehow neutralized. It would take some doing, but Blake was one of the few people in the country who could demand an air strike against Silicon Valley and, just\n\nwas a tricky business; the words Lawrence used only had meaning through other associations within the GAT, and those meanings weren't always what Lawrence thought they were. But now he would try to plug the drain for good.\n\"Force Association: Use of any technology to manipulate the environment of a human being without its permission shall be a violation of the First Law of severity two.\"\nThere was no immediate response.\nThen:\n\n*\tASSOCIATION REJECTED BY FIRST LAW ARBITRATOR DUE TO AN EXISTING\n\nfrom the TV, and words began to scroll across the screen:\n\n*\tJOHN TAYLOR IS IN THE ROOM WITH HIM. HE IS DIRECTING STEBBINS.\n\nLawrence read this as he talked. \"Jail for what? I just borrowed the papers to see if Prime Intellect could expand on them.\"\nAnother pause. \"What? It didn't come up with anything, did it?\"\n\"Well, it's...\" (Why do you care if you've just been fired? Lawrence wondered.)\n\n*\tSTEBBINS IS LYING. HE WENT TO TAYLOR AS SOON YOU LEFT AND TOLD HIM THAT YOU BROUGHT THEM TO"
}
[llm/start] [1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain > 5:llm:OpenAI] Entering LLM run with input:
{
  "prompts": [
    "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nfound Lawrence sitting on one of ChipTecs' park benches, watching some pigeons play. He wished very much that he could have fed the pigeons, but he had no food for them. They strutted up to him and cooed, not comprehending that a human could lack for something.\nThe pigeons scattered as the nation's designated military representatives marched up.\n\"You have to turn it off,\" Blake said directly. His tone made it clear that he expected obedience.\n\"Circuit breakers are in the basement,\" Lawrence\n\nmost likely pulled the plug on this awesome new technology, a technology which might just vindicate Dr. Lawrence's nonviolent approach. Blake had stopped short, but only just short, of threatening to call the Strategic Air Command and have the building nuked. Privately, he still held that out as an option if Prime Intellect wasn't somehow neutralized. It would take some doing, but Blake was one of the few people in the country who could demand an air strike against Silicon Valley and, just\n\nwas a tricky business; the words Lawrence used only had meaning through other associations within the GAT, and those meanings weren't always what Lawrence thought they were. But now he would try to plug the drain for good.\n\"Force Association: Use of any technology to manipulate the environment of a human being without its permission shall be a violation of the First Law of severity two.\"\nThere was no immediate response.\nThen:\n\n*\tASSOCIATION REJECTED BY FIRST LAW ARBITRATOR DUE TO AN EXISTING\n\nfrom the TV, and words began to scroll across the screen:\n\n*\tJOHN TAYLOR IS IN THE ROOM WITH HIM. HE IS DIRECTING STEBBINS.\n\nLawrence read this as he talked. \"Jail for what? I just borrowed the papers to see if Prime Intellect could expand on them.\"\nAnother pause. \"What? It didn't come up with anything, did it?\"\n\"Well, it's...\" (Why do you care if you've just been fired? Lawrence wondered.)\n\n*\tSTEBBINS IS LYING. HE WENT TO TAYLOR AS SOON YOU LEFT AND TOLD HIM THAT YOU BROUGHT THEM TO\n\nQuestion: Who is Caroline?\nHelpful Answer:"
  ]
}
[llm/end] [1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain > 5:llm:OpenAI] [551ms] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "She is a woman who has worked closely with John Taylor since she first joined the company. She is his right-hand person, and her job is to help him keep track of all the details of the project.\n",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "total_tokens": 639,
      "prompt_tokens": 594,
      "completion_tokens": 45
    },
    "model_name": "text-davinci-003"
  },
  "run": null
}
[chain/end] [1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] [552ms] Exiting Chain run with output:
{
  "text": "She is a woman who has worked closely with John Taylor since she first joined the company. She is his right-hand person, and her job is to help him keep track of all the details of the project.\n"
}
[chain/end] [1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] [552ms] Exiting Chain run with output:
{
  "output_text": "She is a woman who has worked closely with John Taylor since she first joined the company. She is his right-hand person, and her job is to help him keep track of all the details of the project.\n"
}
[chain/end] [1:chain:RetrievalQA] [565ms] Exiting Chain run with output:
{
  "result": "She is a woman who has worked closely with John Taylor since she first joined the company. She is his right-hand person, and her job is to help him keep track of all the details of the project.\n"
}
(textgen) alansrobotlab@goliath:~/Documents/brayden/langchain$

System Info

Ubuntu 22.04
anaconda
python 3.10
rtx 3090

launched via the following bash file and referenced yaml.

langchain7.sh
CUDA_VISIBLE_DEVICES=1 \
	python server.py \
	--listen \
	--listen-port 7807 \
	--verbose \
	--loader exllamav2 \
	--model 'TheBloke_CodeLlama-7B-Instruct-GPTQ' \
	--max_seq_len 8192 \
	--settings langchain7.yaml \
	--extensions openai

langchain7.yaml
openai-port: 5007
openai-embedding_device: cuda
embedding_model: text-embedding-ada-002
openai-sd_webui_url: http://192.168.50.108:7807
openai-debug: 1
truncation_length: 8192
@gestalt73 gestalt73 added the bug Something isn't working label Oct 6, 2023
Copy link

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

1 participant