The task of multi-hop link prediction within knowledge graphs (KGs) stands as a challenge in the field of knowledge graph analysis, a challenge increasingly resolvable due to advancements in natural language processing (NLP) and KG embedding techniques. This paper introduces a novel methodology, the Knowledge Graph Large Language Model Framework (KG-LLM), which leverages pivotal NLP paradigms. We use our method to convert structured knowledge graph data into natural language, and then use these natural language prompts to fine-tune large language models (LLMs) to enhance multi-hop link prediction in KGs. By converting the KG to natural language prompts, our framework is designed to discern and learn the latent representations of entities and their interrelations. To show the efficacy of the KG-LLM Framework, we fine-tune three leading LLMs within this framework, using both in-context learning (ICL) and without ICL techniques for a thorough evaluation. Further, we explore the framework's potential to provide LLMs with zero-shot capabilities for handling previously unseen prompts. Our experimental findings discover that the performance of our approach can significantly boost the models' generalization capacity, thereby ensuring more precise predictions in unfamiliar scenarios.
-
Set up your Python environment
Install the requirements.txt file by running
pip install -r requirements.txt
-
Requesting model access from META and Google
Visit this link and request the access to the Llama-2 models
Visit this link and request the access to the Gemma models
-
Requesting model access from HuggingFace
Once request is approved, use the same email adrress to get the access of the model from HF Llama-2 and Gemma.
-
Authorising HF token
Once HF request to access the model has been approved, create hugging face token here
Run below code and enter your token. It will authenticate your HF account
>>> huggingface-cli login or >>> from huggingface_hub import login >>> login(YOUR_HF_TOKEN)
We conduct experiments over two real-world datasets, WN18RR and NELL-995, which are constructed and released by the OpenKE library.
Dataset | #Entities | #Triples | #Relations |
---|---|---|---|
WN18RR | 40,943 | 86,835 | 11 |
NELL-995 | 75,492 | 149,678 | 200 |
FB15k-237 | 14,541 | 310,116 | 237 |
YAGO3-10 | 123,182 | 1,179,040 | 37 |
-
Clone Repository: Clone this repository to your local machine.
git clone https://github.com/rutgerswiselab/KG-LLM.git
-
Download Data: Download the dataset from the previously mentioned library.
-
Place Files: Put the downloaded dataset in the
preprocess
folder.
- Run Script: Open a terminal or command prompt, navigate to the directory containing the script and files, and run the following command:
python preprocess.py
- Check Output: After running the script, you should find some new CSV files containing the preprocessed data in the same directory.
- Split Data:
python split_data.py
- Check Output: After running the script, you should find the training data, validation data, testing data.
Three distinct LLMs are utilized: Flan-T5-Large, LlaMa2-7B, and Gemma-7B.
Model | #Parameter | #Maximum Token | #Technique |
---|---|---|---|
Flan-T5-Large | 783M | 512 | Global Fine-tune |
LLaMa2-7B | 7B | 4096 | 4bit-Lora |
Gemma-7B | 7B | 4096 | 4bit-Lora |
- Get the Data Ready
Make sure your training, validation, testing data is ready to use
- Train the model
Open the
train.py
script intrain
folder and modify all hyperparameters as you like. Here are the parameters you can modify:
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=4,
warmup_steps=2,
weight_decay=0.01,
num_train_epochs=5,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit",
When running the train.py, it takes several arguments:
"--model_name", type=str, default="flan-t5", help="which model: flan-t5, llama2, gemma"
"--train_file", type=str, default=r"train_data.csv", help="Path to the train CSV file"
"--valid_file", type=str, default=r"val_data.csv", help="Path to the validation CSV file"
"--entity_file", type=str, default=r"entity2id.txt", help="Path to the entity2id.txt file"
"--relation_file", type=str, default=r"relation2id.txt", help="Path to the relation2id.txt file"
Start Finetune:
python train.py
- Model Training Completed
After training, the trained model checkpoint files will be generated in the specified output_dir directory.
- Modify Test File
Open the
test
folder and test script file, for example (test_link_icl.py)
test_file: Modify this parameter to specify the path to the test file.
model: Modify this parameter to specify the path to the trained model checkpoint directory.
- Run the Test Script
Run the test script (test.py) in the command line and wait for the model testing to complete.
Example command:
python test_link_icl.py
- View Accuracy
After testing, the script will output the accuracy of the model on the test dataset.
Example output:
AUC: 0.95
F1: 0.93