Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to retrieve a list of indexed documents (paths and/or title/metadata), and the vector store itself, from a GPTVectorStoreIndex? #3255

Closed
maspotts opened this issue May 11, 2023 · 10 comments
Labels

Comments

@maspotts
Copy link

Hi: I mistakenly asked #3249 but have now realised the correct question is: is there a way to query my GPTVectorStoreIndex in order to retrieve the list of documents that were originally indexed? And also their text snippets? And also to fetch the vectors their snippets mapped to?

@maspotts
Copy link
Author

I see that the (0.6.0) index persisted on disk contains: docstore.json, index_store.json and vector_store.json, but they don't seem to contain file paths or title metadata from the original documents, so maybe that's not captured and stored?

@maspotts
Copy link
Author

I was able to get the vectors from my index like this:

        vector_store = index.storage_context.vector_store
        vector_store_dict = vector_store.to_dict()
        embedding_dict = vector_store_dict['embedding_dict']
        vectors = numpy.array(list(embedding_dict.values()))

but it seems a bit non-portable: is there a more direct method?

@maspotts
Copy link
Author

maspotts commented May 12, 2023

The closest I could get to a list of the indexed documents is:

        index_store = index.storage_context.index_store
        index_store_dict = index_store.to_dict()
        doc_ids = list(index_store_dict['index_store/data'].values())[0]['__data__']['doc_id_dict']

which is a list of ids only. I don't know if I can use those ids to look up paths or document metadata, but that would be great! (Path at least.). Any ideas?

@logan-markewich
Copy link
Collaborator

The vectors themselves are not exactly easily exposed. As you saw there, you can get the doc_ids. Then using those ids, I think you can fetch the nodes from the docstore

Can I ask why you need this information though? Just for debugging?

Another neat feature is you can do something like this, which will only retrieve the nodes that would have been sent to the LLM (but doesn't actually call the LLM). The source nodes will also include the similarity score as computed against your query string.

query_engine = index.as_query_engine(response_mode='no_text')
response = query_engine.query("my query")
print(response.source_nodes)

@maspotts
Copy link
Author

Thanks! I'm trying to cluster (all) the vectors, then generate a description (label) for each cluster by sending (just) the vectors in each cluster to GPT to summarize, then associate the vectors with the original documents and classify each document by applying a sort of weighted sum of its cluster-labeled snippets. Not sure how useful that will be, but I want to try! I've got the vectors now (although I'm bit worried that the nested structure I'm getting them from might change without warning in the future!), and I'm able to cluster them, but I don't know how to associate the vectors (via their nodes) back to the original documents yet...

@logan-markewich
Copy link
Collaborator

Should be able to associate them using the doc ids 🤔

vector_store._data.text_id_to_doc_id -> this will help map the doc ids to text ids
vector_store._data.embedding_dict -> this is accessed using text ids

@maspotts
Copy link
Author

Thx! I didn't see any document titles or file names/paths in the embedding_dict though...

@logan-markewich
Copy link
Collaborator

Ah yea, then there's that issue.

Tbh for this task, you might be better off generating the embeddings yourself outside of llama index 😅

@maspotts
Copy link
Author

OK, thanks: it's not a must-have at this point.

@Disiok Disiok closed this as completed Jun 11, 2023
@sarahwooders
Copy link

Related to this, is it possible to get a list of nodes from an existing index? I am trying to load saved vector index state into another vector index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants