Improve handling of multiple entity extractors in config #7685

samsucik · 2021-01-06T13:52:07Z

Description of Problem:
Currently, having multiple entity extractors in the NLU pipeline in the config file can lead to surprising behaviour: an entity being extracted multiple times, e.g. user message I'll travel to Edinburgh can appear in interactive learning as I'll travel to [Edinburgh](city)[Edinburgh](city) (see also #7533 for an example).

Overview of the Solution:
Using multiple extractors can lead to this kind of a surprise, but it doesn't have to. For instance, if DucklingHTTPExtractor is used to extract time and date entities, and CRFEntityExtractor is trained on annotated entities city and cuisine, then these extractors should never extract the same thing.

Therefore, we should allow multiple extractors, but we should also warn the user appropriately, in particular when there are multiple extractors being trained on user data (because then these extractors can "clash" at prediction time).

Importantly, we should check that each entity is displayed correctly in interactive learning (and exported into data files) when it's extracted by multiple extractors -- i.e. in the above example, we want to show only I'll travel to [Edinburgh](city).

A peculiar pathological case which we might want to discuss elsewhere is when 2 extractors extract 2 different things from the same word, but I haven't verified this can actually happen in reality...

Examples (if relevant):
See #7533.

Definition of Done:

The config gets checked for multiple potentially clashing extractors and appropriate warning is issued
Entities extracted multiple times are displayed correctly
Tests are added
Docs mention what happens when multiple extractors are used
Feature mentioned in the changelog

The text was updated successfully, but these errors were encountered:

twerkmeister · 2021-03-17T13:24:37Z

related to #7490

twerkmeister · 2021-03-18T10:59:35Z

I made some comments on this problem in the previously linked issue.

The biggest issue is probably two entity extractors looking for the same type of entities as you outlined. Here we can warn people if they use multiple extractors that just relate to the training data, like you using DIETClassifierand CRFEntityExtractor together. Also we can add a warning if someone uses regexes + RegExEntityExtractor for the same types that they use DIET or CRF for.

Thinking about it a bit more, however, even entities like date and meal could overlap as in I'd like to order the monday special where the meal here might be monday special and some date or time entity monday. Or you use duckling for numbers and also have another extractor for addresses. Probably really difficult to know with certainty which entities might clash. Here we can add a runtime warning whenever there are overlapping entities.

I am not sure, however, how to correctly display overlapping entities @samsucik. Obviously, if they are perfectly the same, no issue. But what if

one is a subphrase of the other - as in monday special with meals and time or 77 Boulevard Rd. if you look for numbers and street addresses separately?
The area of extraction is the same, but the entity types don't match

So for now I would focus on adding the warnings and improving the docs

Add warning when people use multiple extractors for the same entity types
Add a runtime warning when extracted entities overlap
Improve Docs

What do you think @samsucik ?

samsucik · 2021-03-18T14:10:11Z

@twerkmeister thanks! I totally agree that there isn't a magical solution to all the edge cases and we just have to take small steps to get to the ideal state 🙂

The proposed steps make sense to me. For the docs, I think we should make it very clear that the double extraction can happen, but we could also say that users can directly influence this (at least for DIET and CRF Extractor) by including the troublesome examples in their training data and annotating them exactly as desired. Removing one extractor also being a good solution, though sometimes undesirable.

twerkmeister · 2021-03-19T09:59:27Z

Sounds good, I am on it!

enriquemaffezzini · 2023-09-22T23:06:48Z

Hello, I have seen in different places that Rasa can handle specific extractor for specific entities in a few places (eg. link) but the implementation is nowhere to be found. Do you have any idea of how could this be done on version 3.0? forum thread

samsucik added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jan 6, 2021

alwx added area:rasa-oss/ml 👁 All issues related to machine learning area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components labels Jan 29, 2021

twerkmeister self-assigned this Mar 18, 2021

This was referenced Mar 19, 2021

Double entity extraction #7490

Closed

7685 warn of competing entity extractors #8289

Merged

m-vdb closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of multiple entity extractors in config #7685

Improve handling of multiple entity extractors in config #7685

samsucik commented Jan 6, 2021 •

edited

Loading

twerkmeister commented Mar 17, 2021

twerkmeister commented Mar 18, 2021 •

edited by samsucik

Loading

samsucik commented Mar 18, 2021 •

edited

Loading

twerkmeister commented Mar 19, 2021

enriquemaffezzini commented Sep 22, 2023 •

edited

Loading

Improve handling of multiple entity extractors in config #7685

Improve handling of multiple entity extractors in config #7685

Comments

samsucik commented Jan 6, 2021 • edited Loading

twerkmeister commented Mar 17, 2021

twerkmeister commented Mar 18, 2021 • edited by samsucik Loading

samsucik commented Mar 18, 2021 • edited Loading

twerkmeister commented Mar 19, 2021

enriquemaffezzini commented Sep 22, 2023 • edited Loading

samsucik commented Jan 6, 2021 •

edited

Loading

twerkmeister commented Mar 18, 2021 •

edited by samsucik

Loading

samsucik commented Mar 18, 2021 •

edited

Loading

enriquemaffezzini commented Sep 22, 2023 •

edited

Loading