Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query based on unique metadata. #29

Open
Mhsh opened this issue Sep 27, 2024 · 2 comments
Open

Query based on unique metadata. #29

Mhsh opened this issue Sep 27, 2024 · 2 comments

Comments

@Mhsh
Copy link

Mhsh commented Sep 27, 2024

I can tell you where I need to do query for unique metadata.

  • I am ingesting large document texts as embeddings into chromadb. I am creating chunks of tokens of these texts due to token limitation of embedding model. The token size is 512.
  • I will be generating embeddings of these tokens but these chunks are of same document which is referred as doc_id.
  • When I do query and if any of the chunk in this document is matched then i do not want any other chunk from same document. This ensures that one document chunk if matched then we do not search other chunks as it will be of same document.
  • I am planning to store the doc_id as metadata for all chunks.
  • So I need a distinct query on metadata for doc_id. Currently I am doing manual filtering by keeping doc_id in set and then trying to check whether doc_id exists or not which is ineffiecient.
@tazarov
Copy link
Contributor

tazarov commented Oct 10, 2024

hey @Mhsh, you are not alone in thinking of this way of dealing with queries (e.g., avoid chunks from the same document - reasoning: if the document contains relevant info, I don't want any more paragraphs with less relevant info, greater distance from the query).

I had some work done on this; let me try to dig them out. This will require a PR on core Chroma, as the filtering itself won't help unless you can afford multiple queries.

@Mhsh
Copy link
Author

Mhsh commented Oct 16, 2024

Thanks @tazarov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants