Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored data directory #6

Merged
merged 4 commits into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,4 @@ dmypy.json
.pyre/
src/kg_chat/graph_output/knowledge_graph.html
data/database/*.db
tests/input/database/*.db
27 changes: 20 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ LLM-based chatbot that queries and visualizes [`KGX`](https://github.com/biolink
3. Install the APOC plugin in Neo4j Desktop.
4. Update settings to match [`neo4j_db_settings.conf`](conf_files/neo4j_db_settings.conf).

### General Setup
### General Setup

#### For Developers
1. Clone this repository.
2. Create a virtual environment and install dependencies:
```shell
Expand All @@ -23,6 +25,16 @@ LLM-based chatbot that queries and visualizes [`KGX`](https://github.com/biolink
```
3. Replace [`data/nodes.tsv`](data/nodes.tsv) and [`data/edges.tsv`](data/edges.tsv) with desired KGX files if needed.

### For using kg-chat as a dependency

```shell
pip install kg-chat
```
OR
```shell
poetry add kg-chat@latest
```

### Supported Backends
- DuckDB [default]
- Neo4j
Expand All @@ -31,27 +43,28 @@ LLM-based chatbot that queries and visualizes [`KGX`](https://github.com/biolink

1. **Import KG**: Load nodes and edges into a database (default: duckdb).
```shell
poetry run kg import
poetry run kg import --data-dir data
```

2. **Test Query**: Run a test query.
2. **Test Query**: Run a test query.
> NOTE: `--data-dir` is a required parameter for all commands. This is the path for the directory which contains the nodes.tsv and edges.tsv file. The filenames are expected to be exactly that.
```shell
poetry run kg test-query
poetry run kg test-query --data-dir data
```

3. **QnA**: Ask questions about the data.
```shell
poetry run kg qna "how many nodes do we have here?"
poetry run kg qna "how many nodes do we have here?" --data-dir data
```

4. **Chat**: Start an interactive chat session.
```shell
poetry run kg chat
poetry run kg chat --data-dir data
```

5. **App**: Deploy a local web application.
```shell
poetry run kg app
poetry run kg app --data-dir data
```

### Visualization
Expand Down
12 changes: 6 additions & 6 deletions docs/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ Commands

.. code-block:: shell

poetry run kg import
poetry run kg import --data-dir data

2. ``test-query``: To test that the above worked, run a built-in test query:

.. code-block:: shell

poetry run kg test-query --database neo4j
poetry run kg test-query --database neo4j --data-dir data

This should return something like (as per KGX data in the repo):

Expand All @@ -32,7 +32,7 @@ Commands

.. code-block:: shell

poetry run kg qna "give me the sorted (descending) frequency count nodes with relationships. Give me label and id. I want this as a table "
poetry run kg qna "give me the sorted (descending) frequency count nodes with relationships. Give me label and id. I want this as a table " --data-dir data

This should return

Expand Down Expand Up @@ -126,7 +126,7 @@ Commands

.. code-block:: shell

poetry run kg chat --database neo4j
poetry run kg chat --database neo4j --data-dir data

Gives you the following:

Expand Down Expand Up @@ -208,7 +208,7 @@ Commands

.. code-block:: shell

kg-chat $ poetry run kg chat
kg-chat $ poetry run kg chat --data-dir data
Ask me about your data! : show me 20 edges with subject prefix = CHEBI


Expand Down Expand Up @@ -421,7 +421,7 @@ Commands

.. code-block:: shell

poetry run kg app
poetry run kg app --data-dir data

This will start the app on http://localhost:8050/ which can be accessed in the browser.

Expand Down
23 changes: 23 additions & 0 deletions docs/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ Setup

Update the memory heaps as per your preference.

For Developers
---------------
6. Clone this repository locally

7. Create a virtual environment of your choice and ``pip install poetry`` in it.
Expand All @@ -38,3 +40,24 @@ Setup
poetry install

9. Replace the ``data/nodes.tsv`` and ``data/edges.tsv`` file in the project with corresponding files of choice that needs to be queried against.



For Users
----------
10. Install the package from PyPI

.. code-block:: shell

pip install kg-chat

or
.. code-block:: shell

pip install poetry
poetry add kg-chat@latest

.. note::

* The KGX files should have the names `nodes.tsv` and `edges.tsv`.
* The data directory must be provided to run every command. The data directory contains the `nodes.tsv` and `edges.tsv` files.
51 changes: 30 additions & 21 deletions src/kg_chat/cli.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
"""Command line interface for kg-chat."""

import logging
from pathlib import Path
from pprint import pprint
from typing import Union

import click

from kg_chat import __version__
from kg_chat.app import create_app
from kg_chat.constants import DATA_DIR
from kg_chat.implementations.duckdb_implementation import DuckDBImplementation
from kg_chat.implementations.neo4j_implementation import Neo4jImplementation
from kg_chat.main import KnowledgeGraphChat
Expand All @@ -28,7 +29,7 @@
"--data-dir",
type=click.Path(exists=True, file_okay=False, dir_okay=True),
help="Directory containing the data.",
default=DATA_DIR,
required=True,
)


Expand Down Expand Up @@ -56,30 +57,33 @@ def main(verbose: int, quiet: bool):
@main.command("import")
@database_options
@data_dir_option
def import_kg(database: str = "duckdb", data_dir: str = DATA_DIR):
def import_kg(database: str = "duckdb", data_dir: str = None):
"""Run the kg-chat's demo command."""
if not data_dir:
raise ValueError("Data directory is required. This typically contains the KGX tsv files.")
if database == "neo4j":
impl = Neo4jImplementation()
impl.load_kg(data_dir=data_dir)
impl = Neo4jImplementation(data_dir=data_dir)
impl.load_kg()
elif database == "duckdb":
impl = DuckDBImplementation()
impl.load_kg(data_dir=data_dir)
impl = DuckDBImplementation(data_dir=data_dir)
impl.load_kg()
else:
raise ValueError(f"Database {database} not supported.")


@main.command()
@data_dir_option
@database_options
def test_query(database: str = "duckdb"):
def test_query(data_dir: Union[str, Path], database: str = "duckdb"):
"""Run the kg-chat's chat command."""
if database == "neo4j":
impl = Neo4jImplementation()
impl = Neo4jImplementation(data_dir=data_dir)
query = "MATCH (n) RETURN n LIMIT 10"
result = impl.execute_query(query)
for record in result:
print(record)
elif database == "duckdb":
impl = DuckDBImplementation()
impl = DuckDBImplementation(data_dir=data_dir)
query = "SELECT * FROM nodes LIMIT 10"
result = impl.execute_query(query)
for record in result:
Expand All @@ -89,14 +93,15 @@ def test_query(database: str = "duckdb"):


@main.command()
@data_dir_option
@database_options
def show_schema(database: str = "duckdb"):
def show_schema(data_dir: Union[str, Path], database: str = "duckdb"):
"""Run the kg-chat's chat command."""
if database == "neo4j":
impl = Neo4jImplementation()
impl = Neo4jImplementation(data_dir=data_dir)
impl.show_schema()
elif database == "duckdb":
impl = DuckDBImplementation()
impl = DuckDBImplementation(data_dir=data_dir)
impl.show_schema()
else:
raise ValueError(f"Database {database} not supported.")
Expand All @@ -105,30 +110,32 @@ def show_schema(database: str = "duckdb"):
@main.command()
@database_options
@click.argument("query", type=str, required=True)
def qna(query: str, database: str = "duckdb"):
@data_dir_option
def qna(query: str, data_dir: Union[str, Path], database: str = "duckdb"):
"""Run the kg-chat's chat command."""
if database == "neo4j":
impl = Neo4jImplementation()
impl = Neo4jImplementation(data_dir=data_dir)
response = impl.get_human_response(query, impl)
pprint(response)
elif database == "duckdb":
impl = DuckDBImplementation()
impl = DuckDBImplementation(data_dir=data_dir)
response = impl.get_human_response(query)
pprint(response)
else:
raise ValueError(f"Database {database} not supported.")


@main.command("chat")
@data_dir_option
@database_options
def run_chat(database: str = "duckdb"):
def run_chat(data_dir: Union[str, Path], database: str = "duckdb"):
"""Run the kg-chat's chat command."""
if database == "neo4j":
impl = Neo4jImplementation()
impl = Neo4jImplementation(data_dir=data_dir)
kgc = KnowledgeGraphChat(impl)
kgc.chat()
elif database == "duckdb":
impl = DuckDBImplementation()
impl = DuckDBImplementation(data_dir=data_dir)
kgc = KnowledgeGraphChat(impl)
kgc.chat()
else:
Expand All @@ -137,17 +144,19 @@ def run_chat(database: str = "duckdb"):

@main.command("app")
@click.option("--debug", is_flag=True, help="Run the app in debug mode.")
@data_dir_option
@database_options
def run_app(
data_dir: Union[str, Path],
debug: bool = False,
database: str = "duckdb",
):
"""Run the kg-chat's chat command."""
if database == "neo4j":
impl = Neo4jImplementation()
impl = Neo4jImplementation(data_dir=data_dir)
kgc = KnowledgeGraphChat(impl)
elif database == "duckdb":
impl = DuckDBImplementation()
impl = DuckDBImplementation(data_dir=data_dir)
kgc = KnowledgeGraphChat(impl)
else:
raise ValueError(f"Database {database} not supported.")
Expand Down
4 changes: 0 additions & 4 deletions src/kg_chat/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

PWD = Path(__file__).parent.resolve()
PROJ_DIR = PWD.parents[1]
DATA_DIR = PROJ_DIR / "data"
GRAPH_OUTPUT_DIR = PWD / "graph_output"
ASSETS_DIR = PWD / "assets"
TEST_DIR = PROJ_DIR / "tests"
Expand All @@ -21,6 +20,3 @@
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "password"


DATABASE_DIR = DATA_DIR / "database"
21 changes: 12 additions & 9 deletions src/kg_chat/implementations/duckdb_implementation.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@
from sqlalchemy import create_engine

from kg_chat.constants import (
DATA_DIR,
DATABASE_DIR,
OPEN_AI_MODEL,
OPENAI_KEY,
)
Expand All @@ -25,10 +23,15 @@
class DuckDBImplementation(DatabaseInterface):
"""Implementation of the DatabaseInterface for DuckDB."""

def __init__(self):
def __init__(self, data_dir: Union[Path, str]):
"""Initialize the DuckDB database and the Langchain components."""
if not data_dir:
raise ValueError("Data directory is required. This typically contains the KGX tsv files.")
self.safe_mode = True
self.database_path = DATABASE_DIR / "kg_chat.db"
self.data_dir = Path(data_dir)
self.database_path = self.data_dir / "database/kg_chat.db"
if not self.database_path.exists():
self.database_path.parent.mkdir(parents=True, exist_ok=True)
self.conn = duckdb.connect(database=str(self.database_path))
self.llm = ChatOpenAI(model=OPEN_AI_MODEL, temperature=0, api_key=OPENAI_KEY)
self.engine = create_engine(f"duckdb:///{self.database_path}")
Expand Down Expand Up @@ -113,7 +116,7 @@ def execute_query_using_langchain(self, prompt: str):
result = self.agent.invoke(prompt)
return result["output"]

def load_kg(self, data_dir: Union[Path, str] = DATA_DIR):
def load_kg(self):
"""Load the Knowledge Graph into the database."""

def _load_kg():
Expand Down Expand Up @@ -189,9 +192,9 @@ def _load_kg():

return self.execute_unsafe_operation(_load_kg)

def _import_nodes(self, data_dir: Union[Path, str] = DATA_DIR):
def _import_nodes(self):
columns_of_interest = ["id", "category", "name"]
nodes_filepath = Path(data_dir) / "nodes.tsv"
nodes_filepath = Path(self.data_dir) / "nodes.tsv"

with open(nodes_filepath, "r") as nodes_file:
header_line = nodes_file.readline().strip().split("\t")
Expand All @@ -211,9 +214,9 @@ def _import_nodes(self, data_dir: Union[Path, str] = DATA_DIR):
# Load data from temporary file into DuckDB
self.conn.execute(f"COPY nodes FROM '{temp_nodes_file.name}' (DELIMITER '\t', HEADER)")

def _import_edges(self, data_dir: Union[Path, str] = DATA_DIR):
def _import_edges(self):
edge_column_of_interest = ["subject", "predicate", "object"]
edges_filepath = Path(data_dir) / "edges.tsv"
edges_filepath = Path(self.data_dir) / "edges.tsv"
with open(edges_filepath, "r") as edges_file:
header_line = edges_file.readline().strip().split("\t")
column_indexes = {col: idx for idx, col in enumerate(header_line) if col in edge_column_of_interest}
Expand Down
Loading
Loading