-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple entities for one token #10394
Conversation
and removed debug output
…raoulvm/rasa into 2.8.14-test_multi_entities_only
to pass `poetry run flake8 rasa tests --extend-ignore D`
"D415 First line should end with a period, question mark, or exclamation point" With best regards to @erohmensing ...
@ancalita EDIT seems to had to do with my line-length setting. After changing that globally I could run |
Hi @samsucik I was told by @hsm207 that you are the dev on duty today. I do not understand what is happening here. Thanks EDIT -------------------------- I was in the wrong line. It gets explicitly passed Question now changes to: Why should None Extractors not support overlapping entities? |
to work with overlapping entities.
@raoulvm I checked your changes now. I'm glad you found a workaround for the failing test. And 'm excited about the proposed improvement, thanks a lot for that! As for going ahead with this, I have to think about it a bit more and discuss with others because it'd likely have a broader impact. Some points that come to my mind are:
|
HI @ancalita , you should consider cleaning "old stuff" anyhow. If you run the lint against the whole repo (no diff) you get 3000+ doc-string complaints. (rasa2source) #:~/PyProjects/rasa$ poetry run flake8 rasa tests --select D | wc -l
3720 |
Hi @samsucik
I just ran a test with the mood bot including new multi-annotated entities in the NLU Training Data , with DIET as the entity extractor. DIET is training on the last item in the list only.
Good question. It should.
Actually that line is already there. There are intent training data containing entities, and there are Regex Expressions and Lookups, which are separated. If you include other ontological tools like mine it is separated even more, as those data do not fit into the nlu files at all. For the other details, you are perfectly right: I do not speak any non-whitespace language, and I am no fan of regex for literal lookups (regards to @koaning for his splendid flashtext extractor idea which I used as a basis for the hierarchies) so I wasn't very deep into those and did not make test cases for them beyond what is already in the tests. |
Thanks for digging into the questions so quickly. As you can see, there are some rough edges to overcome if this was to become part of 2.8 (and later 3.x) and, more importantly, if these changes were to be maintained and further built upon in future iterations. Regarding the line between DIET and RegexExtractor: Indeed, that line already exists (sorry for not being 100% clear in my previous reply). But, in my opinion, we should try to not make these arbitrary lines even worse. All in all, I think it's important that you're exploring how far you can go with 2.8. The other part is then taking a step back and seeing what's the high-level problem that you're trying to solve, as we discussed previously 🙂 |
As original didn't change the changes from RasaHQ#10394 are applied without change
Test with 3.0 see here #10404 |
Thank you @raoulvm for creating this PR (and the discussion on Friday)! We should use this also to resolve this issue. And we also have to check that this doesn't break Rasa X (especially if we'd allow multiple entities to be annotated in training stories). Since we're overhauling the entity annotation here anyway, it'd be good if the annotation could be extractor-specific. So one could say "this entity is an example for RegexEntityExtractor, but not for DIETClassifier". Right now, because the domain object is not available at the right place in the code, one has to annotate at least one example of an entity that is extracted by RegexEntityExtractor and by doing so DIETClassifier will try to extract it as well. We don't necessarily have to do all these changes together, but we have to decide what the right format would be in the end, so we don't jump back-and-forth too often. |
though I do not have changed that code block
…raoulvm/rasa into 2.8.14-test_multi_entities_only
Has been merged as #10773 by @JEM-Mosig |
Proposed changes:
Allow multiple entities to be annotated for the same word/tokens.
When using entity extractors that support generating multiple entities from a single expression, the test stories fail as there is no way to annotate multiple entity_types and entity_values
New annotation option is
As the extractors as pipeline components neither have access to domain during training, nor to the dialog tracker during inference, they can't judge which characteristic (or attribute) of the word "Rasa" is important for answering a question. If the entity extractor could access the domain knowledge, it would probably be able to filter the possible entity types to those defined with
use_entities
in the domain (e.g. employer).If it could access the tracker, and the preceding question was a closed question, the possible answers could also limit the applicable entity types.
In this PR I have changed the training data parser to be able to cope with the format shown above, and also changed the readerwriter.py to be able to generate such training data messages if the entities found have exact same start and end positions.
Status (please check what you already did):
black
(please check Readme for instructions)