This project provides a Python-based solution for named entity recognition (NER), specifically identifying personal names (ФИО) in Russian texts. It uses OpenAI's GPT-4 model via the langchain
framework to extract names from input texts, retaining their original case and position. The solution consists of three main scripts:
model.py
: Defines the language model pipeline for entity extraction.main.py
: Implements a cloud-hosted service for processing text and extracting entities.test.py
: Provides testing and evaluation metrics for the entity extraction performance.
- Purpose: Defines the pipeline for extracting personal names from text.
- Key Components:
- Uses
ChatOpenAI
fromlangchain_openai
to initialize the GPT-4 model. - Constructs a prompt to identify all names (ФИО) within a given text.
- Defines an asynchronous function,
generate_answer()
, which processes the text to extract names and their positions.
- Uses
- Purpose: Implements the NER service as a cloud-hosted API.
- Key Components:
SimpleActionExample
class defines the service logic for entity extraction using the model pipeline frommodel.py
.- Takes input texts, processes them to extract entities, and formats the results according to predefined schemas.
- Uses
mlp_sdk
to handle API hosting and deployment.
- Purpose: Tests and evaluates the entity extraction performance.
- Key Components:
- Downloads a dataset of annotated texts to test the NER functionality.
- Computes precision, recall, and F1-score metrics for the model output.
- Supports running tests on a configurable number of files using command-line arguments.
-
Clone the repository:
git clone <repository-url> cd <repository-folder>
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
- Create a
.env
file in the root directory and add your OpenAI API key:OPENAI_API_KEY=your_openai_api_key
- Create a
To start the entity extraction service:
python main.py
This will host the NER service using mlp_sdk on the cloud.
To evaluate the performance of the NER model:
python test.py --count <number_of_files>
Replace <number_of_files> with the number of test files you wish to evaluate.
##Example
To extract entities from the text:
- Input Text: Гагарин полетел на орбиту на ракете Сергея Королёва.
- Output:
{
"entities_list": [
{
"entities": [
{
"value": "Гагарин",
"entity_type": "PERSON",
"span": {
"start_index": 0,
"end_index": 7
},
"entity": "Гагарин",
"source_type": "SLOVNET"
},
{
"value": "Сергея Королёва",
"entity_type": "PERSON",
"span": {
"start_index": 28,
"end_index": 42
},
"entity": "Сергея Королёва",
"source_type": "SLOVNET"
}
]
}
]
}
- Precision: Measures the accuracy of the names extracted by the model.
- Recall: Measures the coverage of the model in identifying all relevant names.
- F1 Score: Harmonic mean of precision and recall, providing a balanced evaluation metric.