Allow multiple entities for one token #10394

raoulvm · 2021-11-25T16:10:45Z

Proposed changes:
Allow multiple entities to be annotated for the same word/tokens.

When using entity extractors that support generating multiple entities from a single expression, the test stories fail as there is no way to annotate multiple entity_types and entity_values

New annotation option is

stories:
  - story: Some story
    steps:
      - user: |
          I would like work for [Rasa][{"entity":"chatbot_software", "value":"rasa"},{"entity":"software_company", "value":"rasa"},{"entity":"employer", "value":"rasa"}]
        intent: apply_job

As the extractors as pipeline components neither have access to domain during training, nor to the dialog tracker during inference, they can't judge which characteristic (or attribute) of the word "Rasa" is important for answering a question. If the entity extractor could access the domain knowledge, it would probably be able to filter the possible entity types to those defined with use_entities in the domain (e.g. employer).

If it could access the tracker, and the preceding question was a closed question, the possible answers could also limit the applicable entity types.

In this PR I have changed the training data parser to be able to cope with the format shown above, and also changed the readerwriter.py to be able to generate such training data messages if the entities found have exact same start and end positions.

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

and removed debug output

…raoulvm/rasa into 2.8.14-test_multi_entities_only

to pass `poetry run flake8 rasa tests --extend-ignore D`

@erohmensing

"D415 First line should end with a period, question mark, or exclamation point" With best regards to @erohmensing ...

raoulvm · 2021-11-25T17:20:28Z

@ancalita ~~Do you know why I get linting errors in "old" code? And if I change that, some other old lines are complained about.~~
Thanks
Raoul

EDIT seems to had to do with my line-length setting. After changing that globally I could run poetry run flake8 rasa test --extend-ignore D without getting old code complained about.

raoulvm · 2021-11-26T08:13:54Z

Hi @samsucik I was told by @hsm207 that you are the dev on duty today.
Could you explain why extractors is None (and thus it returns True here) if it was [CRFEntityExtractor.name] provided here?

I do not understand what is happening here.

Thanks
Raoul

EDIT --------------------------

I was in the wrong line. It gets explicitly passed None in the failing test, which is this one:
https://github.com/RasaHQ/rasa/blob/2.8.x/tests/nlu/test_evaluation.py#L210

Question now changes to: Why should None Extractors not support overlapping entities?

to work with overlapping entities.

ancalita · 2021-11-26T09:16:25Z

After changing that globally I could run poetry run flake8 rasa test --extend-ignore D without getting old code complained about.

@raoulvm Great! In addition, you can also use make lint command in your Terminal for a more comprehensive check.

samsucik · 2021-11-26T09:34:17Z

@raoulvm I checked your changes now. I'm glad you found a workaround for the failing test. And 'm excited about the proposed improvement, thanks a lot for that!

As for going ahead with this, I have to think about it a bit more and discuss with others because it'd likely have a broader impact. Some points that come to my mind are:

When an entity can be "multi-annotated", the validity of this annotation is strongly coupled with the models used. I.e. it'd be invalid if we only have DIET in the pipeline (or a different extractor that assigns only a single entity label to a token), but would be valid if we've got a RegexExtractor.
Once an entity can be "multi-annotated" in test stories, a natural follow-up is "why not in training stories?". And that leads us to the thick line drawn between extractors like DIET vs ones like RegexExtractor -- namely, the latter one not requiring annotations in training data. Additionally, we'd be also drawing another line, this time between training and test story format.
The exact annotation format needs to work with all special cases, i.e. we'd need to triple-check the regexes (one immediate candidate edge case to check is when a language doesn't have words separated by spaces and you've got 2 entities coming one right after another). But this is ultimately just a technicality.
If we decide to go with the change, then it'll have to become part of 3.x sooner or later, which means we want to be certain about it.

raoulvm · 2021-11-26T09:35:49Z

HI @ancalita , you should consider cleaning "old stuff" anyhow. If you run the lint against the whole repo (no diff) you get 3000+ doc-string complaints.

(rasa2source) #:~/PyProjects/rasa$ poetry run flake8 rasa tests --select D | wc -l
3720

raoulvm · 2021-11-26T11:52:19Z

Hi @samsucik

it'd be invalid if we only have DIET in the pipeline.

I just ran a test with the mood bot including new multi-annotated entities in the NLU Training Data , with DIET as the entity extractor. DIET is training on the last item in the list only.

Once an entity can be "multi-annotated" in test stories, a natural follow-up is "why not in training stories?"

Good question. It should.

And that leads us to the thick line drawn between extractors like DIET vs ones like RegexExtractor -- namely, the latter one not requiring annotations in training data.

Actually that line is already there. There are intent training data containing entities, and there are Regex Expressions and Lookups, which are separated. If you include other ontological tools like mine it is separated even more, as those data do not fit into the nlu files at all.

For the other details, you are perfectly right: I do not speak any non-whitespace language, and I am no fan of regex for literal lookups (regards to @koaning for his splendid flashtext extractor idea which I used as a basis for the hierarchies) so I wasn't very deep into those and did not make test cases for them beyond what is already in the tests.

samsucik · 2021-11-26T12:09:42Z

Thanks for digging into the questions so quickly. As you can see, there are some rough edges to overcome if this was to become part of 2.8 (and later 3.x) and, more importantly, if these changes were to be maintained and further built upon in future iterations.

Regarding the line between DIET and RegexExtractor: Indeed, that line already exists (sorry for not being 100% clear in my previous reply). But, in my opinion, we should try to not make these arbitrary lines even worse.

All in all, I think it's important that you're exploring how far you can go with 2.8. The other part is then taking a step back and seeing what's the high-level problem that you're trying to solve, as we discussed previously 🙂

As original didn't change the changes from RasaHQ#10394 are applied without change

raoulvm · 2021-11-26T14:39:32Z

Test with 3.0 see here #10404

redoing canceled test at https://github.com/RasaHQ/rasa/runs/4332261715?check_suite_focus=true

JEM-Mosig · 2021-12-06T17:20:16Z

Thank you @raoulvm for creating this PR (and the discussion on Friday)!

We should use this also to resolve this issue. And we also have to check that this doesn't break Rasa X (especially if we'd allow multiple entities to be annotated in training stories). Since we're overhauling the entity annotation here anyway, it'd be good if the annotation could be extractor-specific. So one could say "this entity is an example for RegexEntityExtractor, but not for DIETClassifier". Right now, because the domain object is not available at the right place in the code, one has to annotate at least one example of an entity that is extracted by RegexEntityExtractor and by doing so DIETClassifier will try to extract it as well.

We don't necessarily have to do all these changes together, but we have to decide what the right format would be in the end, so we don't jump back-and-forth too often.

though I do not have changed that code block

…raoulvm/rasa into 2.8.14-test_multi_entities_only

raoulvm · 2022-02-08T11:15:43Z

Has been merged as #10773 by @JEM-Mosig

raoulvm and others added 8 commits November 25, 2021 12:09

Allow multiple entities for one token

ba76878

Formatted w/black

a2b5380

and removed debug output

Merge branch '2.8.x' into 2.8.14-test_multi_entities_only

1f923e2

line-length=88

a528aee

Merge branch '2.8.14-test_multi_entities_only' of https://github.com/…

80430f0

…raoulvm/rasa into 2.8.14-test_multi_entities_only

black --line-length 88

8620ca2

to pass `poetry run flake8 rasa tests --extend-ignore D`

add changelog

1b15ef8

Fix docstring liniting in test.py:1091

4603677

"D415 First line should end with a period, question mark, or exclamation point" With best regards to @erohmensing ...

raoulvm added 2 commits November 26, 2021 09:57

Change test to allow None extractor

cde992a

to work with overlapping entities.

Spelling in changelog

29262b4

Revert accidential revert.

829195b

raoulvm added a commit to raoulvm/rasa that referenced this pull request Nov 26, 2021

Ported from 2.8.x

978b220

As original didn't change the changes from RasaHQ#10394 are applied without change

raoulvm mentioned this pull request Nov 26, 2021

3.0.0 test multi entities only #10404

Closed

4 tasks

Some docstring text to force

60dcf2e

redoing canceled test at https://github.com/RasaHQ/rasa/runs/4332261715?check_suite_focus=true

ka-bu mentioned this pull request Dec 3, 2021

fix rasa test when one token has multiple entity annotations #10451

Closed

7 tasks

raoulvm and others added 3 commits December 8, 2021 18:10

Merge branch '2.8.x' into 2.8.14-test_multi_entities_only

15d909e

Merge branch '2.8.x' into 2.8.14-test_multi_entities_only

0b5aac3

init version of multiple entity writer

ca3bcb6

raoulvm marked this pull request as ready for review January 19, 2022 15:52

raoulvm requested a review from a team as a code owner January 19, 2022 15:52

raoulvm requested a review from a team January 19, 2022 15:52

raoulvm requested review from wochinge and JEM-Mosig and removed request for a team January 19, 2022 15:52

raoulvm and others added 7 commits January 19, 2022 16:59

satisfy make lint

558de89

though I do not have changed that code block

reduce complexity for mypy test

b0c50d6

assert starts AND ends are equal

1287d1c

Merge branch '2.8.x' into 2.8.14-test_multi_entities_only

0ad61f3

activate the change :-)

d296d49

Merge branch '2.8.14-test_multi_entities_only' of https://github.com/…

6b36dcf

…raoulvm/rasa into 2.8.14-test_multi_entities_only

make generate_message more readable

1142e08

m-vdb requested review from a team and jupyterjazz and removed request for wochinge and a team January 21, 2022 13:57

TyDunn removed the request for review from jupyterjazz January 21, 2022 14:06

Merge branch '2.8.x' into 2.8.14-test_multi_entities_only

f27af50

JEM-Mosig mentioned this pull request Jan 28, 2022

Allow single token to be annotated by multiple entities in test stories #10773

Merged

4 tasks

raoulvm closed this Feb 8, 2022

hsm207 mentioned this pull request Feb 10, 2022

prepared release of version 2.8.24 #10853

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple entities for one token #10394

Allow multiple entities for one token #10394

raoulvm commented Nov 25, 2021 •

edited

Loading

raoulvm commented Nov 25, 2021 •

edited

Loading

raoulvm commented Nov 26, 2021 •

edited

Loading

ancalita commented Nov 26, 2021

samsucik commented Nov 26, 2021

raoulvm commented Nov 26, 2021

raoulvm commented Nov 26, 2021

samsucik commented Nov 26, 2021

raoulvm commented Nov 26, 2021

JEM-Mosig commented Dec 6, 2021 •

edited

Loading

raoulvm commented Feb 8, 2022

Allow multiple entities for one token #10394

Allow multiple entities for one token #10394

Conversation

raoulvm commented Nov 25, 2021 • edited Loading

raoulvm commented Nov 25, 2021 • edited Loading

raoulvm commented Nov 26, 2021 • edited Loading

ancalita commented Nov 26, 2021

samsucik commented Nov 26, 2021

raoulvm commented Nov 26, 2021

raoulvm commented Nov 26, 2021

samsucik commented Nov 26, 2021

raoulvm commented Nov 26, 2021

JEM-Mosig commented Dec 6, 2021 • edited Loading

raoulvm commented Feb 8, 2022

raoulvm commented Nov 25, 2021 •

edited

Loading

raoulvm commented Nov 25, 2021 •

edited

Loading

raoulvm commented Nov 26, 2021 •

edited

Loading

JEM-Mosig commented Dec 6, 2021 •

edited

Loading