Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capitalization in query instruction prompt? #4

Closed
HanClinto opened this issue Nov 9, 2023 · 8 comments
Closed

Capitalization in query instruction prompt? #4

HanClinto opened this issue Nov 9, 2023 · 8 comments

Comments

@HanClinto
Copy link
Contributor

HanClinto commented Nov 9, 2023

Is there any particular reason why the word "Religious" is capitalized in the query instruction prompt?

query_instruction="Represent the Religious Bible verse text for semantic search:",

It's odd -- I changed it locally to be "religious" instead of "Religious", regenerated the database, and then I get slightly different results.

Example semantic search query: "What are the angriest verses in the Bible?"

Results w/ capitalized "Religious":

DEU 32

For a fire is kindled in my anger, that burns to the lowest Sheol, devours the earth with its increase, and sets the foundations of the mountains on fire. “I will heap evils on them. I will spend my arrows on them. They shall be wasted with hunger, and devoured with burning heat and bitter destruction. I will send the teeth of animals on them, with the venom of vipers that glide in the dust. Outside the sword will bereave, and in the rooms, terror on both young man and virgin, the nursing infant with the gray-haired man. I said that I would scatter them afar. I would make their memory to cease from among men; were it not that I feared the provocation of the enemy, lest their adversaries should judge wrongly, lest they should say, ‘Our hand is exalted; Yahweh has not done all this.’” For they are a nation void of counsel. There is no understanding in them. Oh that they were wise, that they understood this, that they would consider their latter end!

Similarity Score: 0.8714663982391357

JER 44

Therefore my wrath and my anger was poured out, and was kindled in the cities of Judah and in the streets of Jerusalem; and they are wasted and desolate, as it is today.’ “Therefore now Yahweh, the God of Armies, the God of Israel, says: ‘Why do you commit great evil against your own souls, to cut off from yourselves man and woman, infant and nursing child out of the middle of Judah, to leave yourselves no one remaining, in that you provoke me to anger with the works of your hands, burning incense to other gods in the land of Egypt where you have gone to live, that you may be cut off, and that you may be a curse and a reproach among all the nations of the earth? Have you forgotten the wickedness of your fathers, the wickedness of the kings of Judah, the wickedness of their wives, your own wickedness, and the wickedness of your wives which they committed in the land of Judah and in the streets of Jerusalem?

Similarity Score: 0.8698485493659973

JON 4

But it displeased Jonah exceedingly, and he was angry. He prayed to Yahweh, and said, “Please, Yahweh, wasn’t this what I said when I was still in my own country? Therefore I hurried to flee to Tarshish, for I knew that you are a gracious God and merciful, slow to anger, and abundant in loving kindness, and you relent of doing harm. Therefore now, Yahweh, take, I beg you, my life from me, for it is better for me to die than to live.” Yahweh said, “Is it right for you to be angry?” Then Jonah went out of the city and sat on the east side of the city, and there made himself a booth and sat under it in the shade, until he might see what would become of the city. Yahweh God prepared a vine and made it to come up over Jonah, that it might be a shade over his head to deliver him from his discomfort. So Jonah was exceedingly glad because of the vine. But God prepared a worm at dawn the next day, and it chewed on the vine so that it withered.

Similarity Score: 0.859514057636261

EZK 5

Therefore as I live,’ says the Lord Yahweh, ‘surely, because you have defiled my sanctuary with all your detestable things, and with all your abominations, therefore I will also diminish you. My eye won’t spare, and I will have no pity. A third part of you will die with the pestilence, and they will be consumed with famine within you. A third part will fall by the sword around you. A third part I will scatter to all the winds, and will draw out a sword after them. “‘Thus my anger will be accomplished, and I will cause my wrath toward them to rest, and I will be comforted. They will know that I, Yahweh, have spoken in my zeal, when I have accomplished my wrath on them. “‘Moreover I will make you a desolation and a reproach among the nations that are around you, in the sight of all that pass by.

Similarity Score: 0.8590936660766602

Results w/ uncapitalized "religious":

LUK 11

As he said these things to them, the scribes and the Pharisees began to be terribly angry, and to draw many things out of him, lying in wait for him, and seeking to catch him in something he might say, that they might accuse him.

Similarity Score: 0.8797159790992737

DEU 32

For a fire is kindled in my anger, that burns to the lowest Sheol, devours the earth with its increase, and sets the foundations of the mountains on fire. “I will heap evils on them. I will spend my arrows on them. They shall be wasted with hunger, and devoured with burning heat and bitter destruction. I will send the teeth of animals on them, with the venom of vipers that glide in the dust. Outside the sword will bereave, and in the rooms, terror on both young man and virgin, the nursing infant with the gray-haired man. I said that I would scatter them afar. I would make their memory to cease from among men; were it not that I feared the provocation of the enemy, lest their adversaries should judge wrongly, lest they should say, ‘Our hand is exalted; Yahweh has not done all this.’” For they are a nation void of counsel. There is no understanding in them. Oh that they were wise, that they understood this, that they would consider their latter end!

Similarity Score: 0.8709396719932556

JER 44

Therefore my wrath and my anger was poured out, and was kindled in the cities of Judah and in the streets of Jerusalem; and they are wasted and desolate, as it is today.’ “Therefore now Yahweh, the God of Armies, the God of Israel, says: ‘Why do you commit great evil against your own souls, to cut off from yourselves man and woman, infant and nursing child out of the middle of Judah, to leave yourselves no one remaining, in that you provoke me to anger with the works of your hands, burning incense to other gods in the land of Egypt where you have gone to live, that you may be cut off, and that you may be a curse and a reproach among all the nations of the earth? Have you forgotten the wickedness of your fathers, the wickedness of the kings of Judah, the wickedness of their wives, your own wickedness, and the wickedness of your wives which they committed in the land of Judah and in the streets of Jerusalem?

Similarity Score: 0.8692954182624817

LAM 3

“You have covered us with anger and pursued us. You have killed. You have not pitied. You have covered yourself with a cloud, so that no prayer can pass through. You have made us an off-scouring and refuse in the middle of the peoples. “All our enemies have opened their mouth wide against us. Terror and the pit have come on us, devastation and destruction.” My eye runs down with streams of water, for the destruction of the daughter of my people. My eye pours down and doesn’t cease, without any intermission, until Yahweh looks down, and sees from heaven. My eye affects my soul, because of all the daughters of my city. They have chased me relentlessly like a bird, those who are my enemies without cause. They have cut off my life in the dungeon, and have cast a stone on me. Waters flowed over my head. I said, “I am cut off.” I called on your name, Yahweh, out of the lowest dungeon. You heard my voice: “Don’t hide your ear from my sighing, and my cry.”

Similarity Score: 0.8587437868118286

In the capitalized version, the passage from Luke 11 has a similarity score of 0.8799077868461609.

Anyways, I thought it was very odd that such a simple capitalization change could affect the embeddings this much. I originally started looking at this, because I've heard it said that the quality of ones' results with LLMs relates very closely with the quality of ones' prompt, and I assume that this can even extend to things such as grammar, spelling, and capitalization.

I wouldn't have thought that such a simple change would make such a significant different, and especially for only one verse. I don't know what to make of this, but it was an interesting result and I felt I should share it with you. :)

@HanClinto
Copy link
Contributor Author

Oh weird. I just realized that the score of 0.879 should put the Luke 11 passage as the number 1 verse in both contexts, but yet it doesn't show up in that version.

I wonder if it's something about the ANN search in Chroma that is skipping over this verse somehow in this particular use-case. Perhaps we need to configure Chroma to search wider so that it returns better results?

@HanClinto
Copy link
Contributor Author

HanClinto commented Nov 9, 2023

Yeah, you can easily see this in your own application.

default_query = "What are the angriest verses in the Bible?"

...

search_results = db.similarity_search_with_relevance_scores(
    search_query,
    k=4,
    score_function="cosine",
    filter={"book": "LUK"},
)

This will limit the search results to only be within Luke. You will see the top result be the passage from Luke 11 with:

Similarity Score: 0.8799077868461609

If you then comment out the line with the filter parameter, you will see the top result be the Deu 32 passage with:

Similarity Score: 0.8714663982391357

By my view, that's less than the Luke 11 passage, and so Deu 32 should show up in second place.

So why is Luke 11 hidden here, despite having a higher score? I'm guessing it's a limitation of Chroma where it is skipping over this as part of the "approximate" part of approximate-nearest-neighbor search. I'm looking through the Chroma docs and not seeing options to set the number of search trees (like I'm used to seeing with other tools like Annoy) to help alleviate these misses, but I intend to keep looking.

@dssjon
Copy link
Owner

dssjon commented Nov 9, 2023

Good observations, Clint. I've reached out to the instructor engineers for their feedback and insights. I will reply with their feedback (when / if provided). Thanks

@dssjon
Copy link
Owner

dssjon commented Nov 13, 2023

Answer received regarding capitalization:

As the INSTRUCTOR was trained to have domain name capitalized, use “Religion” instead of “religion”. In general, we may expect better performance if the instruction formats were aligned with those in the training.

@HanClinto
Copy link
Contributor Author

That makes a lot of sense. I started reading the a instructor paper last night and I think my changes to the prompt were misguided.

I think there is room for improvement on the prompt, but not in the way I originally tried.

Modeling this prompt to be closer to the prompts used in the original paper is a good idea, and I want to also review the prompt used to generate the embedding for the user's search query, as I think that may need to be asymmetric / differently-worded from the document prompt.

But again, having tests to quantify the quality of search results is going to be valuable in this.

@dssjon
Copy link
Owner

dssjon commented Nov 13, 2023

Scaffolded a basic test here with some TODO ideas: https://github.com/dssjon/biblos/pull/11/files

@HanClinto
Copy link
Contributor Author

I believe I found out why Luke 11 is not being returned in both cases -- it has to do with the search parameters in Chroma being very easy to skip over the top results:

chroma-core/chroma#1205

langchain-ai/langchain#1946

I'm not sure if adjusting the construction_ef or search_ef parameters require recreation of the index or not -- I sure hope not -- but I'm going to play around with adjusting those to see if that increases the quality of results.

@dssjon
Copy link
Owner

dssjon commented Nov 14, 2023

Interesting! LUK 11 is the top result on my local laptop when I search for "What are the angriest verses in the Bible?".

@dssjon dssjon closed this as completed Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants