Is it possible to retrieve a list of indexed documents (paths and/or title/metadata), and the vector store itself, from a GPTVectorStoreIndex? #3255

maspotts · 2023-05-11T23:38:40Z

Hi: I mistakenly asked #3249 but have now realised the correct question is: is there a way to query my GPTVectorStoreIndex in order to retrieve the list of documents that were originally indexed? And also their text snippets? And also to fetch the vectors their snippets mapped to?

maspotts · 2023-05-11T23:48:29Z

I see that the (0.6.0) index persisted on disk contains: docstore.json, index_store.json and vector_store.json, but they don't seem to contain file paths or title metadata from the original documents, so maybe that's not captured and stored?

maspotts · 2023-05-12T00:28:35Z

I was able to get the vectors from my index like this:

        vector_store = index.storage_context.vector_store
        vector_store_dict = vector_store.to_dict()
        embedding_dict = vector_store_dict['embedding_dict']
        vectors = numpy.array(list(embedding_dict.values()))

but it seems a bit non-portable: is there a more direct method?

maspotts · 2023-05-12T00:43:06Z

The closest I could get to a list of the indexed documents is:

        index_store = index.storage_context.index_store
        index_store_dict = index_store.to_dict()
        doc_ids = list(index_store_dict['index_store/data'].values())[0]['__data__']['doc_id_dict']

which is a list of ids only. I don't know if I can use those ids to look up paths or document metadata, but that would be great! (Path at least.). Any ideas?

logan-markewich · 2023-05-12T01:01:42Z

The vectors themselves are not exactly easily exposed. As you saw there, you can get the doc_ids. Then using those ids, I think you can fetch the nodes from the docstore

Can I ask why you need this information though? Just for debugging?

Another neat feature is you can do something like this, which will only retrieve the nodes that would have been sent to the LLM (but doesn't actually call the LLM). The source nodes will also include the similarity score as computed against your query string.

query_engine = index.as_query_engine(response_mode='no_text')
response = query_engine.query("my query")
print(response.source_nodes)

maspotts · 2023-05-12T01:08:19Z

Thanks! I'm trying to cluster (all) the vectors, then generate a description (label) for each cluster by sending (just) the vectors in each cluster to GPT to summarize, then associate the vectors with the original documents and classify each document by applying a sort of weighted sum of its cluster-labeled snippets. Not sure how useful that will be, but I want to try! I've got the vectors now (although I'm bit worried that the nested structure I'm getting them from might change without warning in the future!), and I'm able to cluster them, but I don't know how to associate the vectors (via their nodes) back to the original documents yet...

logan-markewich · 2023-05-12T02:57:05Z

Should be able to associate them using the doc ids 🤔

vector_store._data.text_id_to_doc_id -> this will help map the doc ids to text ids
vector_store._data.embedding_dict -> this is accessed using text ids

maspotts · 2023-05-12T03:19:41Z

Thx! I didn't see any document titles or file names/paths in the embedding_dict though...

logan-markewich · 2023-05-12T03:29:40Z

Ah yea, then there's that issue.

Tbh for this task, you might be better off generating the embeddings yourself outside of llama index 😅

maspotts · 2023-05-12T03:30:57Z

OK, thanks: it's not a must-have at this point.

sarahwooders · 2023-11-01T18:18:09Z

Related to this, is it possible to get a list of nodes from an existing index? I am trying to load saved vector index state into another vector index.

logan-markewich added the discord label May 12, 2023

Disiok closed this as completed Jun 11, 2023

dosubot bot mentioned this issue Nov 29, 2023

[Question]: Get all nodes on an index(VectorStoreIndex) #9206

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to retrieve a list of indexed documents (paths and/or title/metadata), and the vector store itself, from a GPTVectorStoreIndex? #3255

Is it possible to retrieve a list of indexed documents (paths and/or title/metadata), and the vector store itself, from a GPTVectorStoreIndex? #3255

maspotts commented May 11, 2023

maspotts commented May 11, 2023

maspotts commented May 12, 2023

maspotts commented May 12, 2023 •

edited

Loading

logan-markewich commented May 12, 2023

maspotts commented May 12, 2023

logan-markewich commented May 12, 2023

maspotts commented May 12, 2023

logan-markewich commented May 12, 2023

maspotts commented May 12, 2023

sarahwooders commented Nov 1, 2023

Is it possible to retrieve a list of indexed documents (paths and/or title/metadata), and the vector store itself, from a GPTVectorStoreIndex? #3255

Is it possible to retrieve a list of indexed documents (paths and/or title/metadata), and the vector store itself, from a GPTVectorStoreIndex? #3255

Comments

maspotts commented May 11, 2023

maspotts commented May 11, 2023

maspotts commented May 12, 2023

maspotts commented May 12, 2023 • edited Loading

logan-markewich commented May 12, 2023

maspotts commented May 12, 2023

logan-markewich commented May 12, 2023

maspotts commented May 12, 2023

logan-markewich commented May 12, 2023

maspotts commented May 12, 2023

sarahwooders commented Nov 1, 2023

maspotts commented May 12, 2023 •

edited

Loading