-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of multiple entity extractors in config #7685
Comments
related to #7490 |
I made some comments on this problem in the previously linked issue. The biggest issue is probably two entity extractors looking for the same type of entities as you outlined. Here we can warn people if they use multiple extractors that just relate to the training data, like you using Thinking about it a bit more, however, even entities like I am not sure, however, how to correctly display overlapping entities @samsucik. Obviously, if they are perfectly the same, no issue. But what if
So for now I would focus on adding the warnings and improving the docs
What do you think @samsucik ? |
@twerkmeister thanks! I totally agree that there isn't a magical solution to all the edge cases and we just have to take small steps to get to the ideal state 🙂 The proposed steps make sense to me. For the docs, I think we should make it very clear that the double extraction can happen, but we could also say that users can directly influence this (at least for DIET and CRF Extractor) by including the troublesome examples in their training data and annotating them exactly as desired. Removing one extractor also being a good solution, though sometimes undesirable. |
Sounds good, I am on it! |
Hello, I have seen in different places that Rasa can handle specific extractor for specific entities in a few places (eg. link) but the implementation is nowhere to be found. Do you have any idea of how could this be done on version 3.0? forum thread |
Description of Problem:
Currently, having multiple entity extractors in the NLU pipeline in the config file can lead to surprising behaviour: an entity being extracted multiple times, e.g. user message
I'll travel to Edinburgh
can appear in interactive learning asI'll travel to [Edinburgh](city)[Edinburgh](city)
(see also #7533 for an example).Overview of the Solution:
Using multiple extractors can lead to this kind of a surprise, but it doesn't have to. For instance, if
DucklingHTTPExtractor
is used to extracttime
anddate
entities, andCRFEntityExtractor
is trained on annotated entitiescity
andcuisine
, then these extractors should never extract the same thing.Therefore, we should allow multiple extractors, but we should also warn the user appropriately, in particular when there are multiple extractors being trained on user data (because then these extractors can "clash" at prediction time).
Importantly, we should check that each entity is displayed correctly in interactive learning (and exported into data files) when it's extracted by multiple extractors -- i.e. in the above example, we want to show only
I'll travel to [Edinburgh](city)
.A peculiar pathological case which we might want to discuss elsewhere is when 2 extractors extract 2 different things from the same word, but I haven't verified this can actually happen in reality...
Examples (if relevant):
See #7533.
Definition of Done:
The text was updated successfully, but these errors were encountered: