Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Added initial Knowledge Graph support #1801

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

jaluma
Copy link
Collaborator

@jaluma jaluma commented Mar 27, 2024

Knowledge Graph

This PR introduces knowledge graph capabilities.

What is a knowledge graph?

A knowledge graph is a collection of nodes and edges that represent entities or concepts, and their relationships, such as facts, properties, or categories.
It can be used to query or infer factual information about different entities or concepts, based on their node and edge attributes.

Changes Made

  1. Knowledge Graph Support:

    • Added support for integrating a knowledge graph into the project. This feature allows for the combination of the knowledge graph with the vector store to leverage different contextual sources.
  2. Neo4j Graph Store Provider:

    • Integrated a Neo4j Graph Store provider. A graph database like Neo4j is instrumental in managing complex relationships between data entities. By representing data as nodes and relationships, it enables efficient querying and traversal of interconnected data, making it an ideal choice for implementing a knowledge graph. Additionally, it offers powerful querying capabilities such as pattern matching, making it easier to extract insights from interconnected data.
    • During development, encountered issues related to lowercase and string formatting, which have been addressed in this PR.
  3. RDF File Support (Turtle Syntax):

    • Implemented support for ingesting RDF files in Turtle syntax into the graph. RDF files represent data in a graph-like structure using subject-predicate-object triples. This allows us to incorporate structured data into the knowledge graph, facilitating richer data representation and enabling advanced querying and analysis.
    • The main reason for implementing RDF in the project is to allow processing any kind of linked data on the web locally, following the principles of the project.
    • To generate a Wikidata RDF file, a sample Jupyter notebook has been provided: here.
  4. Router Retriever Support (Ensemble retrievers):

    • Added support for router retrievers, allowing the simultaneous use of multiple sources with a score ranking mechanism. This enhancement enhances the project's ability to retrieve information from diverse sources and prioritize the most relevant results.
    • This feature has been limited to use just one source in this version, it would be nice to parametrize this information in configuration or define a better selection strategy :).

TODO

  • Ingesting files to Knowledge Graph using ParallelizedIngestComponent, BatchIngestComponent, PipelineIngestComponent
  • Refactor code to support VectorIndex and KnowledgeGraphIndex
  • More Graph providers like Nebula.
  • Allow specific extensions when a provider is enabled e.g. RDF can be used when any GraphStore provider is enabled.
  • Refactor methods to better identification between vector and graph components.

How to activate it?

In order to select one or the other, set the graphstore.database property in the settings.yaml file to neo4j. It will be need to install extra graph-stores-neo4j.

graphstore:
  database: neo4j

To configure Neo4J connection, set the neo4j object in the settings.yaml.

neo4j:
  url: neo4j://localhost:7687
  username: neo4j
  password: password
  database: neo4j

Run local Neo4J using Docker

To run Neo4j using Docker, you can use the following command:

docker run \
    --restart always \
    --publish=7474:7474 --publish=7687:7687 \
    --env NEO4J_AUTH=neo4j/password \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS='["apoc"]' \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    neo4j:5.18.0

@@ -494,24 +558,28 @@ def get_ingestion_component(
embed_model=embed_model,
transformations=transformations,
count_workers=settings.embedding.count_workers,
llm=kwargs.get("llm"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels error prone, can't you use the type directly?

@@ -48,7 +52,10 @@ def _try_loading_included_file_formats() -> dict[str, type[BaseReader]]:
".mbox": MboxReader,
".ipynb": IPYNBReader,
}
return default_file_reader_cls
optional_file_reader_cls: dict[str, type[BaseReader]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move it back with the default readers, you are importing it unconditionally anyway

Comment on lines +73 to +75
graph_store=graph_store_component.graph_store
if graph_store_component and graph_store_component.graph_store
else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify this with just graph_store_component.graph_store, the component can't be None. The dependency injector would fail before that.

retrievers = [
r for r in [vector_index_retriever, graph_knowledge_retrevier] if r
]
retriever = RouterRetriever.from_defaults(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

past experience with llama-index makes me not trust these from_defaults, can you check the implementation to make sure it's doing sane things only (for example, some defaults try to call OpenAI if you omit one of the parameters)

@@ -389,10 +412,12 @@ class Settings(BaseModel):
ollama: OllamaSettings
azopenai: AzureOpenAISettings
vectorstore: VectorstoreSettings
graphstore: GraphStoreSettings | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a non-nullable type here and add a enabled property instead, makes it easier to configure through env vars that way

@@ -0,0 +1,92 @@
# mypy: ignore-errors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit dangerous, what types were giving trouble?

"""Read RDF files.

This module is used to read RDF files.
It was created by llama-hub but it has not been ported
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it was ported to llama-index 0.1.0 with fixes, right? This sentence is a little bit confusing...

@spsach
Copy link

spsach commented Jun 5, 2024

Is the Knowledge Graph functionality working? Has anyone tried it?

@gkorland
Copy link

Is the PR still alive? Are you going to make it more generic such that it will be able to support more Graph Databases?

@jaluma jaluma changed the title Added initial Knowledge Graph support PoC: Added initial Knowledge Graph support Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants