English | 简体中文
- Requirements
- News
- Dataset
- Models
- OneKE
- LLaMA-series
- ChatGLM
- MOSS
- Baichuan
- CPM-Bee
- GPT-series
- Case 1: Information Extraction with LLMs English | Chinese
- Case 2: Data Augmentation with LLMs English | Chinese
- Case 3: CCKS2023 Instruction-based KG Construction with LLMs English | Chinese
- Case 4: Unleash the Power of Large Language Models for Few-shot Relation Extraction English | Chinese
- Case 5: CodeKGC-Code Language Models for KG Construction English | Chinese
- Methods
- Citation
In the era of large models, DeepKE-LLM utilizes a completely new environment dependency.
conda create -n deepke-llm python=3.9
conda activate deepke-llm
cd example/llm
pip install -r requirements.txt
Please note that the requirements.txt
file is located in the example/llm
folder.
- [2024/04] We release a new bilingual (Chinese and English) schema-based information extraction model called OneKE based on Chinese-Alpaca-2-13B.
- [2024/02] We release a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction dataset named IEPile, along with two models trained with
IEPile
, baichuan2-13b-iepile-lora and llama2-13b-iepile-lora. - [2023/11] The weights of knowlm-13b-ie have been updated. This update mainly adjusted the NAN outputs, shortened the inference length, and added support for instructions without a specified schema.
- [2023/10] We released a new bilingual (Chinese and English) theme-based Information Extraction (IE) instruction dataset named InstructIE.
- [2023/08] A specialized version of KnowLM for information extraction (IE), named knowlm-13b-ie, was launched.
- [2023/07] Some of the instruction datasets used for training were released, including knowlm-ke and KnowLM-IE.
- [2023/06] The first version of pre-trained weights, knowlm-13b-base-v1.0, and the first version of zhixi-13b-lora were released.
- [2023/05] We initiated an instruction-based Information Extraction project.
Existing Datasets
Name | Download | Quantity | Description |
---|---|---|---|
InstructIE | Google Drive Hugging Face ModelScope WiseModel |
300k+ | Bilingual (Chinese and English) topic-based Information Extraction (IE) instruction dataset |
IEPile | Google Drive Hugging Face WiseModel ModelScope |
2 million+ | Large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction fine-tuning dataset |
Details of InstructIE
An example of a single data entry
{
"id": "841ef2af4cfe766dd9295fb7daf321c299df0fd0cef14820dfcb421161eed4a1",
"text": "NGC1313 is a galaxy in the constellation of Reticulum. It was discovered by the Australian astronomer James Dunlop on September 27, 1826. It has a prominent uneven shape, and its axis does not completely revolve around its center. Near NGC1313, there is another galaxy, NGC1309.",
"relation": [
{"head": "NGC1313", "head_type": "astronomical object type", "relation": "time of discovery", "tail": "September 27, 1826", "tail_type": "time"},
{"head": "NGC1313", "head_type": "astronomical object type", "relation": "discoverer or inventor", "tail": "James Dunlop", "tail_type": "organization/human"},
{"head": "NGC1313", "head_type": "astronomical object type", "relation": "of", "tail": "Reticulum", "tail_type": "astronomical object type"}
]
}
Field | Description |
---|---|
id | The unique identifier for each data point. |
cate | The category of the text's subject, with a total of 12 different thematic categories. |
text | The input text for the model, with the goal of extracting all the involved relationship triples. |
relation | Describes the relationship triples contained in the text, i.e., (head, head_type, relation, tail, tail_type). |
Details of IEPile
Each instance in IEPile
contains four fields: task
, source
, instruction
, and output
. Below are the explanations for each field:
Field | Description |
---|---|
task | The task to which the instance belongs, one of the five types (NER , RE , EE , EET , EEA ). |
source | The dataset to which the instance belongs. |
instruction | The instruction for inputting into the model, processed into a JSON string via json.dumps, including three fields: "instruction" , "schema" , and "input" . |
output | The output in the format of a dictionary's JSON string, where the key is the schema, and the value is the extracted content. |
In IEPile
, the instruction format of IEPile
adopts a JSON-like string structure, which is essentially a dictionary-type string composed of the following three main components:
(1) 'instruction'
: Task description, which outlines the task to be performed by the instruction (one of NER
, RE
, EE
, EET
, EEA
).
(2) 'schema'
: A list of schemas to be extracted (entity types
, relation types
, event types
).
(3) 'input'
: The text from which information is to be extracted.
The file instruction.py provides instructions for various tasks.
Below is a data example:
{
"task": "NER",
"source": "CoNLL2003",
"instruction": "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}",
"output": "{\"person\": [\"Robert Allenby\", \"Allenby\", \"Miguel Angel Martin\"], \"organization\": [], \"else\": [], \"location\": [\"Australia\", \"Spain\"]}"
}
The data instance belongs to the NER
task, is part of the CoNLL2003
dataset, the schema list to be extracted includes ["person
", "organization
", "else
", "location
"], and the text to be extracted from is "284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )". The output is {"person": ["Robert Allenby", "Allenby", "Miguel Angel Martin"], "organization": [], "else": [], "location": ["Australia", "Spain"]}
.
A Bilingual Large Language Model for Information Extraction: Chinese Tutorial.
LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We also provide a bilingual LLM for knowledge extraction named ZhiXi (智析)
(which means intelligent analysis of data for knowledge extraction) based on KnowLM.
ZhiXi follows a two-step approach: (1) It performs further full pre-training on LLaMA (13B)
using Chinese/English corpora to enhance the model's Chinese comprehension and knowledge while preserving its English and code capabilities as much as possible. (2) It fine-tunes the model using an instruction dataset to improve the language model's understanding of human instructions. For detailed information about the model, please refer to KnowLM.
Case 1: LoRA Fine-tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese
Case 1: OpenDelta Fine-tuning of Moss for CCKS2023 Instruction-based KG Construction English | Chinese
Case 1: OpenDelta Fine-tuning of Baichuan for CCKS2023 Instruction-based KG Construction English | Chinese
Case 1: OpenDelta Fine-tuning of Qwen for CCKS2023 Instruction-based KG Construction English | Chinese
Case 1: OpenDelta Fine-tuning of CPM-Bee for CCKS2023 Instruction-based KG Construction English | Chinese
Case 4: Unleash the Power of Large Language Models for Few-shot Relation Extraction English | Chinese
To better address Relational Triple Extraction (rte) task in Knowledge Graph Construction, we have designed code-style prompts to model the structure of Relational Triple, and used Code-LLMs to generate more accurate predictions. The key step of code-style prompt construction is to transform (text, output triples) pairs into semantically equivalent program language written in Python.
In-Context Learning is an approach to guide large language models to improve their performance on specific tasks. It involves iterative fine-tuning and training of the model in a specific context to better understand and address the requirements of a particular domain. Through In-Context Learning, we can enable large language models to perform tasks such as information extraction, data augmentation, and instruction-driven knowledge graph construction.
LoRA (Low-Rank Adaptation of Large Language Models) reduces the number of trainable parameters by learning low-rank decomposition matrices while freezing the original weights. This significantly reduces the storage requirements of large language models for specific tasks and enables efficient task switching during deployment without introducing inference latency. For more details, please refer to the original paper LoRA: Low-Rank Adaptation of Large Language Models.
The PT (P-Tuning) method, as referred to in the official code of ChatGLM, is a soft-prompt method specifically designed for large models. P-Tuning introduces new parameters only to the embeddings of large models. P-Tuning-V2 adds new parameters to both the embeddings and the preceding layers of large models.
If you use this project, please cite the following papers:
@misc{knowlm,
author = {Ningyu Zhang and Jintian Zhang and Xiaohan Wang and Honghao Gui and Kangwei Liu and Yinuo Jiang and Xiang Chen and Shengyu Mao and Shuofei Qiao and Yuqi Zhu and Zhen Bi and Jing Chen and Xiaozhuan Liang and Yixin Ou and Runnan Fang and Zekun Xi and Xin Xu and Lei Li and Peng Wang and Mengru Wang and Yunzhi Yao and Bozhong Tian and Yin Fang and Guozhou Zheng and Huajun Chen},
title = {KnowLM Technical Report},
year = {2023},
url = {http://knowlm.zjukg.cn/},
}