Entity Recognition on sub-words #5509
Labels
area:rasa-oss 🎡
Anything related to the open source Rasa framework
type:enhancement ✨
Additions of new features or changes to existing ones, should be doable in a single PR
Description of Problem:
Related to #5475
We found other edge cases that can happen if we are using a tokenizer that splits up words into sub-words. Let's take a look at an example:
Sentence:
Buenos Aires is a city
Tokens:
Buen
,os
,Ai
,res
,is
,a
,city
Scenario 1:
One entity covers multiple words or a single word.
city
entity ->Buen
os
Ai
res
type
entity ->city
Scenario 2:
An entity covers just a part of a word.
city
entity ->Buen
Scenario 3:
An entity covers two words, but at least on of the words just partly.
city
entity ->os
Ai
Scenario 4:
The sub-words of one word are annotated with different entities.
city
entity ->Ai
,state
entity ->res
Scenario 1 and 4 are handled. We need to take care of Scenario 2 and 3.
Overview of the Solution:
We should keep labels if possible. Extend the entities to cover complete words instead of just parts of the words.
The text was updated successfully, but these errors were encountered: