-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to retrieve a list of indexed documents (paths and/or title/metadata), and the vector store itself, from a GPTVectorStoreIndex? #3255
Comments
I see that the (0.6.0) index persisted on disk contains: docstore.json, index_store.json and vector_store.json, but they don't seem to contain file paths or title metadata from the original documents, so maybe that's not captured and stored? |
I was able to get the vectors from my index like this:
but it seems a bit non-portable: is there a more direct method? |
The closest I could get to a list of the indexed documents is:
which is a list of ids only. I don't know if I can use those ids to look up paths or document metadata, but that would be great! (Path at least.). Any ideas? |
The vectors themselves are not exactly easily exposed. As you saw there, you can get the doc_ids. Then using those ids, I think you can fetch the nodes from the docstore Can I ask why you need this information though? Just for debugging? Another neat feature is you can do something like this, which will only retrieve the nodes that would have been sent to the LLM (but doesn't actually call the LLM). The source nodes will also include the similarity score as computed against your query string. query_engine = index.as_query_engine(response_mode='no_text')
response = query_engine.query("my query")
print(response.source_nodes) |
Thanks! I'm trying to cluster (all) the vectors, then generate a description (label) for each cluster by sending (just) the vectors in each cluster to GPT to summarize, then associate the vectors with the original documents and classify each document by applying a sort of weighted sum of its cluster-labeled snippets. Not sure how useful that will be, but I want to try! I've got the vectors now (although I'm bit worried that the nested structure I'm getting them from might change without warning in the future!), and I'm able to cluster them, but I don't know how to associate the vectors (via their nodes) back to the original documents yet... |
Should be able to associate them using the doc ids 🤔 vector_store._data.text_id_to_doc_id -> this will help map the doc ids to text ids |
Thx! I didn't see any document titles or file names/paths in the embedding_dict though... |
Ah yea, then there's that issue. Tbh for this task, you might be better off generating the embeddings yourself outside of llama index 😅 |
OK, thanks: it's not a must-have at this point. |
Related to this, is it possible to get a list of nodes from an existing index? I am trying to load saved vector index state into another vector index. |
Hi: I mistakenly asked #3249 but have now realised the correct question is: is there a way to query my GPTVectorStoreIndex in order to retrieve the list of documents that were originally indexed? And also their text snippets? And also to fetch the vectors their snippets mapped to?
The text was updated successfully, but these errors were encountered: