Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of multiple entity extractors in config #7685

Closed
5 tasks
samsucik opened this issue Jan 6, 2021 · 5 comments
Closed
5 tasks

Improve handling of multiple entity extractors in config #7685

samsucik opened this issue Jan 6, 2021 · 5 comments
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components area:rasa-oss/ml 👁 All issues related to machine learning type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@samsucik
Copy link
Contributor

samsucik commented Jan 6, 2021

Description of Problem:
Currently, having multiple entity extractors in the NLU pipeline in the config file can lead to surprising behaviour: an entity being extracted multiple times, e.g. user message I'll travel to Edinburgh can appear in interactive learning as I'll travel to [Edinburgh](city)[Edinburgh](city) (see also #7533 for an example).

Overview of the Solution:
Using multiple extractors can lead to this kind of a surprise, but it doesn't have to. For instance, if DucklingHTTPExtractor is used to extract time and date entities, and CRFEntityExtractor is trained on annotated entities city and cuisine, then these extractors should never extract the same thing.

Therefore, we should allow multiple extractors, but we should also warn the user appropriately, in particular when there are multiple extractors being trained on user data (because then these extractors can "clash" at prediction time).

Importantly, we should check that each entity is displayed correctly in interactive learning (and exported into data files) when it's extracted by multiple extractors -- i.e. in the above example, we want to show only I'll travel to [Edinburgh](city).

A peculiar pathological case which we might want to discuss elsewhere is when 2 extractors extract 2 different things from the same word, but I haven't verified this can actually happen in reality...

Examples (if relevant):
See #7533.

Definition of Done:

  • The config gets checked for multiple potentially clashing extractors and appropriate warning is issued
  • Entities extracted multiple times are displayed correctly
  • Tests are added
  • Docs mention what happens when multiple extractors are used
  • Feature mentioned in the changelog
@samsucik samsucik added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jan 6, 2021
@alwx alwx added area:rasa-oss/ml 👁 All issues related to machine learning area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components labels Jan 29, 2021
@twerkmeister
Copy link
Contributor

related to #7490

@twerkmeister twerkmeister self-assigned this Mar 18, 2021
@twerkmeister
Copy link
Contributor

twerkmeister commented Mar 18, 2021

I made some comments on this problem in the previously linked issue.

The biggest issue is probably two entity extractors looking for the same type of entities as you outlined. Here we can warn people if they use multiple extractors that just relate to the training data, like you using DIETClassifierand CRFEntityExtractor together. Also we can add a warning if someone uses regexes + RegExEntityExtractor for the same types that they use DIET or CRF for.

Thinking about it a bit more, however, even entities like date and meal could overlap as in I'd like to order the monday special where the meal here might be monday special and some date or time entity monday. Or you use duckling for numbers and also have another extractor for addresses. Probably really difficult to know with certainty which entities might clash. Here we can add a runtime warning whenever there are overlapping entities.

I am not sure, however, how to correctly display overlapping entities @samsucik. Obviously, if they are perfectly the same, no issue. But what if

  • one is a subphrase of the other - as in monday special with meals and time or 77 Boulevard Rd. if you look for numbers and street addresses separately?
  • The area of extraction is the same, but the entity types don't match

So for now I would focus on adding the warnings and improving the docs

  • Add warning when people use multiple extractors for the same entity types
  • Add a runtime warning when extracted entities overlap
  • Improve Docs

What do you think @samsucik ?

@samsucik
Copy link
Contributor Author

samsucik commented Mar 18, 2021

@twerkmeister thanks! I totally agree that there isn't a magical solution to all the edge cases and we just have to take small steps to get to the ideal state 🙂

The proposed steps make sense to me. For the docs, I think we should make it very clear that the double extraction can happen, but we could also say that users can directly influence this (at least for DIET and CRF Extractor) by including the troublesome examples in their training data and annotating them exactly as desired. Removing one extractor also being a good solution, though sometimes undesirable.

@twerkmeister
Copy link
Contributor

Sounds good, I am on it!

@enriquemaffezzini
Copy link

enriquemaffezzini commented Sep 22, 2023

Hello, I have seen in different places that Rasa can handle specific extractor for specific entities in a few places (eg. link) but the implementation is nowhere to be found. Do you have any idea of how could this be done on version 3.0? forum thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components area:rasa-oss/ml 👁 All issues related to machine learning type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

No branches or pull requests

5 participants