KG-LLM: Knowledge Graph Large Language Model for Link Prediction

Abstract:

The task of multi-hop link prediction within knowledge graphs (KGs) stands as a challenge in the field of knowledge graph analysis, a challenge increasingly resolvable due to advancements in natural language processing (NLP) and KG embedding techniques. This paper introduces a novel methodology, the Knowledge Graph Large Language Model Framework (KG-LLM), which leverages pivotal NLP paradigms. We use our method to convert structured knowledge graph data into natural language, and then use these natural language prompts to fine-tune large language models (LLMs) to enhance multi-hop link prediction in KGs. By converting the KG to natural language prompts, our framework is designed to discern and learn the latent representations of entities and their interrelations. To show the efficacy of the KG-LLM Framework, we fine-tune three leading LLMs within this framework, using both in-context learning (ICL) and without ICL techniques for a thorough evaluation. Further, we explore the framework's potential to provide LLMs with zero-shot capabilities for handling previously unseen prompts. Our experimental findings discover that the performance of our approach can significantly boost the models' generalization capacity, thereby ensuring more precise predictions in unfamiliar scenarios.

Requirements:

Set up your Python environment
Install the requirements.txt file by running
```
pip install -r requirements.txt
```
Requesting model access from META and Google

Visit this link and request the access to the Llama-2 models

Visit this link and request the access to the Gemma models
Requesting model access from HuggingFace

Once request is approved, use the same email adrress to get the access of the model from HF Llama-2 and Gemma.
Authorising HF token

Once HF request to access the model has been approved, create hugging face token here

Run below code and enter your token. It will authenticate your HF account
```
>>> huggingface-cli login

or

>>> from huggingface_hub import login
>>> login(YOUR_HF_TOKEN)
```

Dataset:

We conduct experiments over two real-world datasets, WN18RR and NELL-995, which are constructed and released by the OpenKE library.

Dataset	#Entities	#Triples	#Relations
WN18RR	40,943	86,835	11
NELL-995	75,492	149,678	200
FB15k-237	14,541	310,116	237
YAGO3-10	123,182	1,179,040	37

Getting Start

Clone Repository: Clone this repository to your local machine.
```
git clone https://github.com/rutgerswiselab/KG-LLM.git
```
Download Data: Download the dataset from the previously mentioned library.
Place Files: Put the downloaded dataset in the preprocess folder.

Preprocess Data

Run Script: Open a terminal or command prompt, navigate to the directory containing the script and files, and run the following command:
```
python preprocess.py
```
- Check Output: After running the script, you should find some new CSV files containing the preprocessed data in the same directory.
Split Data:
```
python split_data.py
```
- Check Output: After running the script, you should find the training data, validation data, testing data.

Finetune

Three distinct LLMs are utilized: Flan-T5-Large, LlaMa2-7B, and Gemma-7B.

Model	#Parameter	#Maximum Token	#Technique
Flan-T5-Large	783M	512	Global Fine-tune
LLaMa2-7B	7B	4096	4bit-Lora
Gemma-7B	7B	4096	4bit-Lora

Get the Data Ready

Make sure your training, validation, testing data is ready to use

Train the model

Open the train.py script in train folder and modify all hyperparameters as you like. Here are the parameters you can modify:

per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=4,
warmup_steps=2,
weight_decay=0.01,
num_train_epochs=5,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit",

When running the train.py, it takes several arguments:

"--model_name", type=str, default="flan-t5", help="which model: flan-t5, llama2, gemma"
"--train_file", type=str, default=r"train_data.csv", help="Path to the train CSV file"
"--valid_file", type=str, default=r"val_data.csv", help="Path to the validation CSV file"
"--entity_file", type=str, default=r"entity2id.txt", help="Path to the entity2id.txt file"
"--relation_file", type=str, default=r"relation2id.txt", help="Path to the relation2id.txt file"

Start Finetune:

   python train.py

Model Training Completed

After training, the trained model checkpoint files will be generated in the specified output_dir directory.

Test

Modify Test File

Open the test folder and test script file, for example (test_link_icl.py)

test_file: Modify this parameter to specify the path to the test file.

model: Modify this parameter to specify the path to the trained model checkpoint directory.

Run the Test Script

Run the test script (test.py) in the command line and wait for the model testing to complete.

Example command:

   python test_link_icl.py

View Accuracy

After testing, the script will output the accuracy of the model on the test dataset.

Example output:

   AUC: 0.95
   F1: 0.93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KG-LLM: Knowledge Graph Large Language Model for Link Prediction

Abstract:

Requirements:

Dataset:

Getting Start

Preprocess Data

Finetune

Test

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
NELL-995		NELL-995
WN18RR		WN18RR
preprocess		preprocess
test		test
train		train
KG.png		KG.png
KG_LLM.ipynb		KG_LLM.ipynb
README.md		README.md
requirements.txt		requirements.txt

rutgerswiselab/KG-LLM

Folders and files

Latest commit

History

Repository files navigation

KG-LLM: Knowledge Graph Large Language Model for Link Prediction

Abstract:

Requirements:

Dataset:

Getting Start

Preprocess Data

Finetune

Test

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages