[Bug]: Doesn't return results with shortest distances unless n_results is sufficiently large #1205

zephyrprime · 2023-10-05T19:25:04Z

What happened?

Chromadb will fail to return the embeddings with the closest results unless I set n_results to a sufficiently large number.

I am using version 0.4.13 but this problem has happened with every version I've used. I find that basic querying of the db is buggy and does not return the highest scoring results if you don't have n_results set to a sufficiently big number. For example, if I perform this search with n_results=27, it will fail to find the correct highest scoring embeddings. I have ~44000 pieces of text in my DB.

collection.query(
query_texts="A description of 'visitor q'",
n_results=27
)

results:
{'ids': [['53593',
'397',
'54214',
...]],
'distances': [[0.832878589630127,
0.857905924320221,
0.8623576164245605,
...]],
'metadatas': [[{'release date': '2005-11-18', 'title': 'Unveiled'},
{'release date': '1935-03-08', 'title': 'Naughty Marietta'},
{'release date': '1997-05-09', 'title': 'Welcome To Sarajevo'},
...

However, if I search with n_results=28 or greater, it will return the correct results.
collection.query(
query_texts="A description of 'visitor q'",
n_results=28
)

Results:
{'ids': [['11917',
'11918',
'11919',
...]],
'distances': [[0.6459631323814392,
0.6608277559280396,
0.665003776550293,
...]],
'metadatas': [[{'release date': '2001-03-17', 'title': 'Visitor Q'},
{'release date': '2001-03-17', 'title': 'Visitor Q'},
{'release date': '2001-03-17', 'title': 'Visitor Q'},
{'title': 'The Visitors'},
...
I am using "BAAI/bge-large-en-v1.5" and sentence transformers. However this happens with other the Instructor models too.
I am using the default distance function and not setting a custom value for that.

This doesn't happen with all queries. Only some queries have this problem. This is a pretty big problem for me since it's giving me wrong results with some queries. Seems like a bug to me.

Versions

0.4.13, python 3.11.3, windows 11. I also had this happen on a debian 11 server I used.

Relevant log output

No response

HammadB · 2023-10-05T23:16:44Z

Have you tried altering the hnsw parameters of your index? They control the quality of the search, setting n_results to be large implictly increases the search_ef parameter which makes your search more exhaustive. You can set hnsw:M/search_ef/construction_ef when you create the collection metadata.

zephyrprime · 2023-10-06T03:11:43Z

I am very surprised that the search isn't always exhaustive already. I have now read that there is a performance problem that inhibits exhaustive search. How can I set the hnsw parameters? What are the default parameters being used? I cannot find any documentation in chromadb except for the hnsw:space parameter which doesn't seem to be the issue.

HammadB · 2023-10-06T04:22:30Z

Why would you expect it to be exhaustive? Chroma uses an approximate nearest neighbors index which will prune the candidates it searches.

chroma/chromadb/test/test_api.py

Line 1060 in fc4c8b5

collection = api.create_collection(

This is an example of setting the params.

Sorry the documentation here is sparse, we want to make the custom index parameter definition better by strongly typing it. But we should add docs in the interim!

pmeier · 2023-11-03T09:55:08Z

We were bitten by this as well. Documentation on the hnsw parameters would be much appreciated.

pmeier · 2023-11-03T10:31:09Z

In the mean time, here is a list of all available parameters

chroma/chromadb/segment/impl/vector/hnsw_params.py

Lines 10 to 23 in cdcafc8

    
           param_validators: Dict[str, Validator] = { 
        
               "hnsw:space": lambda p: bool(re.match(r"^(l2|cosine|ip)$", str(p))), 
        
               "hnsw:construction_ef": lambda p: isinstance(p, int), 
        
               "hnsw:search_ef": lambda p: isinstance(p, int), 
        
               "hnsw:M": lambda p: isinstance(p, int), 
        
               "hnsw:num_threads": lambda p: isinstance(p, int), 
        
               "hnsw:resize_factor": lambda p: isinstance(p, (int, float)), 
        
           } 
        
           # Extra params used for persistent hnsw 
        
           persistent_param_validators: Dict[str, Validator] = { 
        
               "hnsw:batch_size": lambda p: isinstance(p, int) and p > 2, 
        
               "hnsw:sync_threshold": lambda p: isinstance(p, int) and p > 2, 
        
           }

and the corresponding defaults

chroma/chromadb/segment/impl/vector/hnsw_params.py

Lines 55 to 63 in cdcafc8

    
           metadata = metadata or {} 
        
           self.space = str(metadata.get("hnsw:space", "l2")) 
        
           self.construction_ef = int(metadata.get("hnsw:construction_ef", 100)) 
        
           self.search_ef = int(metadata.get("hnsw:search_ef", 10)) 
        
           self.M = int(metadata.get("hnsw:M", 16)) 
        
           self.num_threads = int( 
        
               metadata.get("hnsw:num_threads", multiprocessing.cpu_count()) 
        
           ) 
        
           self.resize_factor = float(metadata.get("hnsw:resize_factor", 1.2))

chroma/chromadb/segment/impl/vector/hnsw_params.py

Lines 79 to 80 in cdcafc8

    
           self.batch_size = int(metadata.get("hnsw:batch_size", 100)) 
        
           self.sync_threshold = int(metadata.get("hnsw:sync_threshold", 1000))

ha-sante · 2023-11-03T14:25:49Z

Damn - How we all come to meet here 😂- same issue as well @pmeier thank you for the code guides.

falk0n · 2023-11-08T06:41:36Z

In the mean time, here is a list of all available parameters

chroma/chromadb/segment/impl/vector/hnsw_params.py

Lines 10 to 23 in cdcafc8

param_validators: Dict[str, Validator] = {

"hnsw:space": lambda p: bool(re.match(r"^(l2|cosine|ip)$", str(p))),

"hnsw:construction_ef": lambda p: isinstance(p, int),

"hnsw:search_ef": lambda p: isinstance(p, int),

"hnsw:M": lambda p: isinstance(p, int),

"hnsw:num_threads": lambda p: isinstance(p, int),

"hnsw:resize_factor": lambda p: isinstance(p, (int, float)),

}

# Extra params used for persistent hnsw

persistent_param_validators: Dict[str, Validator] = {

"hnsw:batch_size": lambda p: isinstance(p, int) and p > 2,

"hnsw:sync_threshold": lambda p: isinstance(p, int) and p > 2,

}

and the corresponding defaults

chroma/chromadb/segment/impl/vector/hnsw_params.py

Lines 55 to 63 in cdcafc8

metadata = metadata or {}

self.space = str(metadata.get("hnsw:space", "l2"))

self.construction_ef = int(metadata.get("hnsw:construction_ef", 100))

self.search_ef = int(metadata.get("hnsw:search_ef", 10))

self.M = int(metadata.get("hnsw:M", 16))

self.num_threads = int(

metadata.get("hnsw:num_threads", multiprocessing.cpu_count())

)

self.resize_factor = float(metadata.get("hnsw:resize_factor", 1.2))

chroma/chromadb/segment/impl/vector/hnsw_params.py

Lines 79 to 80 in cdcafc8

self.batch_size = int(metadata.get("hnsw:batch_size", 100))

self.sync_threshold = int(metadata.get("hnsw:sync_threshold", 1000))

How do the hnsw parameters relate to the parameters described in https://arxiv.org/abs/1603.09320 ?

Vermeille · 2024-03-01T09:59:48Z

Why would you expect it to be exhaustive?
Yes right? Why would you expect a search function to correctly search indeed? Seriously this answer would be extremely funny if it weren't this infuriating.

pilotofbalance · 2024-03-09T12:43:47Z

hey guys, if someone found a best one for "semantic search" pls post here your metadata index configuration.
I want chroma will have better documentation in the future..

wilsonweb · 2024-04-18T21:14:14Z

And for those of us that speak JavaScript, I took these parameters to values similar to hnswlib

       collection = await client.createCollection({
            name: "items",
            embeddingFunction: embedder,
            metadata: {
                "hnsw:space": "l2", "hnsw:M": 16, "hnsw:construction_ef": 200 
            }, 
        });

zephyrprime added the bug Something isn't working label Oct 5, 2023

pmeier mentioned this issue Nov 3, 2023

Up the number of queried documents by Chroma Quansight/ragna#160

Merged

pmeier mentioned this issue Nov 3, 2023

Configure the HNSW parameters in Chroma Quansight/ragna#161

Open

HanClinto mentioned this issue Nov 14, 2023

Capitalization in query instruction prompt? dssjon/biblos#4

Closed

guillaumecherel mentioned this issue Jul 12, 2024

Make HNSW algorithm parameters configurable with ChromaDocumentStore. deepset-ai/haystack-core-integrations#891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Doesn't return results with shortest distances unless n_results is sufficiently large #1205

[Bug]: Doesn't return results with shortest distances unless n_results is sufficiently large #1205

zephyrprime commented Oct 5, 2023

HammadB commented Oct 5, 2023 •

edited

Loading

zephyrprime commented Oct 6, 2023

HammadB commented Oct 6, 2023 •

edited

Loading

pmeier commented Nov 3, 2023

pmeier commented Nov 3, 2023 •

edited

Loading

ha-sante commented Nov 3, 2023

falk0n commented Nov 8, 2023

Vermeille commented Mar 1, 2024

pilotofbalance commented Mar 9, 2024

wilsonweb commented Apr 18, 2024 •

edited

Loading

[Bug]: Doesn't return results with shortest distances unless n_results is sufficiently large #1205

[Bug]: Doesn't return results with shortest distances unless n_results is sufficiently large #1205

Comments

zephyrprime commented Oct 5, 2023

What happened?

Versions

Relevant log output

HammadB commented Oct 5, 2023 • edited Loading

zephyrprime commented Oct 6, 2023

HammadB commented Oct 6, 2023 • edited Loading

pmeier commented Nov 3, 2023

pmeier commented Nov 3, 2023 • edited Loading

ha-sante commented Nov 3, 2023

falk0n commented Nov 8, 2023

Vermeille commented Mar 1, 2024

pilotofbalance commented Mar 9, 2024

wilsonweb commented Apr 18, 2024 • edited Loading

HammadB commented Oct 5, 2023 •

edited

Loading

HammadB commented Oct 6, 2023 •

edited

Loading

pmeier commented Nov 3, 2023 •

edited

Loading

wilsonweb commented Apr 18, 2024 •

edited

Loading