suggestion for Chinese general entity extractor. #5400

BobCN2017 · 2020-03-10T14:13:42Z

Description of Problem:
There is currently no general Chinese entity extraction component in the Rasa components. Recently I found a good solution.

Overview of the Solution:
Lexical Analysis of Chinese (LAC) is from Baidu. LAC provides a trained model, which can be easily called through paddlehub, which can extract the four main entities of PER-person name, LOC-place name, TIME-time, ORG-organization name, and can also extract 24 types of part-of-speech tags. See details https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis#%E4%BB%BB%E5%8A%A1%E5%AE%9A%E4%B9%89%E4%B8%8E%E5%BB%BA%E6%A8%A1

LAC precision and recall about extracting is nearly 90%, It is good enough, I think this will like DucklingHTTPExtractor for chinese.

Based on the above situation, this model can be easily integrated into rasa and used as a general entity extraction component. Refer to spacy_entity_extractor, I written in the chinese_normal_entity-extractor, the detailed code is as follows:

import logging
import time
from typing import Any, Dict, List, Text

from rasa.nlu.extractors import EntityExtractor
from rasa.nlu.training_data import Message

logger = logging.getLogger(name)

class ChineseNormalEntityExtractor(EntityExtractor):
provides = ["entities"]

defaults = {
    # by default PER LOC TIME ORG dimensions recognized by LAC are returned
    # dimensions can be configured to contain an array of strings
    # with the names of the dimensions to filter for
    "dimensions": ["PER", "LOC", "TIME", "ORG"],
    "rename": {},
}

def __init__(self, component_config: Text = None) -> None:
    super(ChineseNormalEntityExtractor, self).__init__(component_config)
    import paddlehub as hub
    self.module = hub.Module(name='lac')

def process(self, message: Message, **kwargs: Any) -> None:
    start = time.time()
    all_extracted = self.add_extractor_name(self.extract_entities(message.text))
    dimensions = self.component_config["dimensions"]
    extracted = ChineseNormalEntityExtractor.filter_irrelevant_entities(
        all_extracted, dimensions
    )
    message.set(
        "entities", message.get("entities", []) + extracted, add_to_output=True
    )
    logger.debug("LAC cost time:{}".format(time.time() - start))

def extract_entities(self, text: "Text") -> List[Dict[Text, Any]]:
    entities = []
    results = self.module.lexical_analysis(data={"text": [text]})
    if not results or len(results) == 0:
        return entities
    result = results[0]
    position = 0
    for word, tag in zip(result["word"], result["tag"]):
        entity = {
            "entity": self.component_config.get("rename", {}).get(tag, tag),
            "value": word,
            "start": position,
            "confidence": None,
            "end": position + len(word),
        }
        position += len(word)
        entities.append(entity)

    return entities

Definition of Done:
After the component is written, it is integrated into the project for testing. In the i5-8250U CPU @ 1.60GHz 8-core 16G memory environment, the time to run entity extraction is 3 ~ 5ms once.
The model will be stored locally after the first download: normal in /home/xx/.paddlehub/modules/lac

If possible,May I request a PR to submit the above code.

The text was updated successfully, but these errors were encountered:

sara-tagger · 2020-03-11T07:00:05Z

Thanks for submitting this feature request 🚀 @JustinaPetr will get back to you about it soon! ✨

koaning · 2021-02-02T13:17:25Z

This might be something that we can easily support in rasa-nlu-examples but in the meantime, it deserves mentioning that I'm working on support for spaCy 3.0 which might also address this issue. You can confirm the docs from spaCy here.

Note that the PR for spaCy 3.0 can be found here.

BobCN2017 added the type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR label Mar 10, 2020

alwx added area:rasa-oss/ml 👁 All issues related to machine learning area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. labels Jan 28, 2021

koaning mentioned this issue Feb 2, 2021

spaCy 3.0 #7869

Merged

koaning closed this as completed in #7869 Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggestion for Chinese general entity extractor. #5400

suggestion for Chinese general entity extractor. #5400

BobCN2017 commented Mar 10, 2020

sara-tagger commented Mar 11, 2020

koaning commented Feb 2, 2021 •

edited

Loading

suggestion for Chinese general entity extractor. #5400

suggestion for Chinese general entity extractor. #5400

Comments

BobCN2017 commented Mar 10, 2020

sara-tagger commented Mar 11, 2020

koaning commented Feb 2, 2021 • edited Loading

koaning commented Feb 2, 2021 •

edited

Loading