Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestion for Chinese general entity extractor. #5400

Closed
BobCN2017 opened this issue Mar 10, 2020 · 2 comments · Fixed by #7869
Closed

suggestion for Chinese general entity extractor. #5400

BobCN2017 opened this issue Mar 10, 2020 · 2 comments · Fixed by #7869
Labels
area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components area:rasa-oss/ml 👁 All issues related to machine learning type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@BobCN2017
Copy link

Description of Problem:
There is currently no general Chinese entity extraction component in the Rasa components. Recently I found a good solution.

Overview of the Solution:
Lexical Analysis of Chinese (LAC) is from Baidu. LAC provides a trained model, which can be easily called through paddlehub, which can extract the four main entities of PER-person name, LOC-place name, TIME-time, ORG-organization name, and can also extract 24 types of part-of-speech tags. See details https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis#%E4%BB%BB%E5%8A%A1%E5%AE%9A%E4%B9%89%E4%B8%8E%E5%BB%BA%E6%A8%A1

LAC precision and recall about extracting is nearly 90%, It is good enough, I think this will like DucklingHTTPExtractor for chinese.

Based on the above situation, this model can be easily integrated into rasa and used as a general entity extraction component. Refer to spacy_entity_extractor, I written in the chinese_normal_entity-extractor, the detailed code is as follows:

import logging
import time
from typing import Any, Dict, List, Text

from rasa.nlu.extractors import EntityExtractor
from rasa.nlu.training_data import Message

logger = logging.getLogger(name)

class ChineseNormalEntityExtractor(EntityExtractor):
provides = ["entities"]

defaults = {
    # by default PER LOC TIME ORG dimensions recognized by LAC are returned
    # dimensions can be configured to contain an array of strings
    # with the names of the dimensions to filter for
    "dimensions": ["PER", "LOC", "TIME", "ORG"],
    "rename": {},
}

def __init__(self, component_config: Text = None) -> None:
    super(ChineseNormalEntityExtractor, self).__init__(component_config)
    import paddlehub as hub
    self.module = hub.Module(name='lac')

def process(self, message: Message, **kwargs: Any) -> None:
    start = time.time()
    all_extracted = self.add_extractor_name(self.extract_entities(message.text))
    dimensions = self.component_config["dimensions"]
    extracted = ChineseNormalEntityExtractor.filter_irrelevant_entities(
        all_extracted, dimensions
    )
    message.set(
        "entities", message.get("entities", []) + extracted, add_to_output=True
    )
    logger.debug("LAC cost time:{}".format(time.time() - start))

def extract_entities(self, text: "Text") -> List[Dict[Text, Any]]:
    entities = []
    results = self.module.lexical_analysis(data={"text": [text]})
    if not results or len(results) == 0:
        return entities
    result = results[0]
    position = 0
    for word, tag in zip(result["word"], result["tag"]):
        entity = {
            "entity": self.component_config.get("rename", {}).get(tag, tag),
            "value": word,
            "start": position,
            "confidence": None,
            "end": position + len(word),
        }
        position += len(word)
        entities.append(entity)

    return entities

Definition of Done:
After the component is written, it is integrated into the project for testing. In the i5-8250U CPU @ 1.60GHz 8-core 16G memory environment, the time to run entity extraction is 3 ~ 5ms once.
The model will be stored locally after the first download: normal in /home/xx/.paddlehub/modules/lac

If possible,May I request a PR to submit the above code.

@BobCN2017 BobCN2017 added the type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR label Mar 10, 2020
@sara-tagger
Copy link
Collaborator

Thanks for submitting this feature request 🚀 @JustinaPetr will get back to you about it soon! ✨

@alwx alwx added area:rasa-oss/ml 👁 All issues related to machine learning area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. labels Jan 28, 2021
@koaning
Copy link
Contributor

koaning commented Feb 2, 2021

This might be something that we can easily support in rasa-nlu-examples but in the meantime, it deserves mentioning that I'm working on support for spaCy 3.0 which might also address this issue. You can confirm the docs from spaCy here.

Note that the PR for spaCy 3.0 can be found here.

@koaning koaning mentioned this issue Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components area:rasa-oss/ml 👁 All issues related to machine learning type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants