suggestion for Chinese general entity extractor. #5400
Labels
area:rasa-oss/ml/nlu-components
Issues focused around rasa's NLU components
area:rasa-oss/ml 👁
All issues related to machine learning
type:discussion 👨👧👦
Early stage of an idea or validation of thoughts. Should NOT be closed by PR.
type:enhancement ✨
Additions of new features or changes to existing ones, should be doable in a single PR
Description of Problem:
There is currently no general Chinese entity extraction component in the Rasa components. Recently I found a good solution.
Overview of the Solution:
Lexical Analysis of Chinese (LAC) is from Baidu. LAC provides a trained model, which can be easily called through paddlehub, which can extract the four main entities of PER-person name, LOC-place name, TIME-time, ORG-organization name, and can also extract 24 types of part-of-speech tags. See details https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis#%E4%BB%BB%E5%8A%A1%E5%AE%9A%E4%B9%89%E4%B8%8E%E5%BB%BA%E6%A8%A1
LAC precision and recall about extracting is nearly 90%, It is good enough, I think this will like DucklingHTTPExtractor for chinese.
Based on the above situation, this model can be easily integrated into rasa and used as a general entity extraction component. Refer to spacy_entity_extractor, I written in the chinese_normal_entity-extractor, the detailed code is as follows:
import logging
import time
from typing import Any, Dict, List, Text
from rasa.nlu.extractors import EntityExtractor
from rasa.nlu.training_data import Message
logger = logging.getLogger(name)
class ChineseNormalEntityExtractor(EntityExtractor):
provides = ["entities"]
Definition of Done:
After the component is written, it is integrated into the project for testing. In the i5-8250U CPU @ 1.60GHz 8-core 16G memory environment, the time to run entity extraction is 3 ~ 5ms once.
The model will be stored locally after the first download: normal in /home/xx/.paddlehub/modules/lac
If possible,May I request a PR to submit the above code.
The text was updated successfully, but these errors were encountered: