Query based on unique metadata. #29

Mhsh · 2024-09-27T11:04:25Z

I can tell you where I need to do query for unique metadata.

I am ingesting large document texts as embeddings into chromadb. I am creating chunks of tokens of these texts due to token limitation of embedding model. The token size is 512.
I will be generating embeddings of these tokens but these chunks are of same document which is referred as doc_id.
When I do query and if any of the chunk in this document is matched then i do not want any other chunk from same document. This ensures that one document chunk if matched then we do not search other chunks as it will be of same document.
I am planning to store the doc_id as metadata for all chunks.
So I need a distinct query on metadata for doc_id. Currently I am doing manual filtering by keeping doc_id in set and then trying to check whether doc_id exists or not which is ineffiecient.

tazarov · 2024-10-10T09:40:15Z

hey @Mhsh, you are not alone in thinking of this way of dealing with queries (e.g., avoid chunks from the same document - reasoning: if the document contains relevant info, I don't want any more paragraphs with less relevant info, greater distance from the query).

I had some work done on this; let me try to dig them out. This will require a PR on core Chroma, as the filtering itself won't help unless you can afford multiple queries.

Mhsh · 2024-10-16T06:28:38Z

Thanks @tazarov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query based on unique metadata. #29

Query based on unique metadata. #29

Mhsh commented Sep 27, 2024

tazarov commented Oct 10, 2024

Mhsh commented Oct 16, 2024

Query based on unique metadata. #29

Query based on unique metadata. #29

Comments

Mhsh commented Sep 27, 2024

tazarov commented Oct 10, 2024

Mhsh commented Oct 16, 2024