Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple entities for one token #10394

Closed
wants to merge 23 commits into from

Conversation

raoulvm
Copy link
Contributor

@raoulvm raoulvm commented Nov 25, 2021

Proposed changes:
Allow multiple entities to be annotated for the same word/tokens.

When using entity extractors that support generating multiple entities from a single expression, the test stories fail as there is no way to annotate multiple entity_types and entity_values

New annotation option is

stories:
  - story: Some story
    steps:
      - user: |
          I would like work for [Rasa][{"entity":"chatbot_software", "value":"rasa"},{"entity":"software_company", "value":"rasa"},{"entity":"employer", "value":"rasa"}]
        intent: apply_job

As the extractors as pipeline components neither have access to domain during training, nor to the dialog tracker during inference, they can't judge which characteristic (or attribute) of the word "Rasa" is important for answering a question. If the entity extractor could access the domain knowledge, it would probably be able to filter the possible entity types to those defined with use_entities in the domain (e.g. employer).

If it could access the tracker, and the preceding question was a closed question, the possible answers could also limit the applicable entity types.

In this PR I have changed the training data parser to be able to cope with the format shown above, and also changed the readerwriter.py to be able to generate such training data messages if the entities found have exact same start and end positions.

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

raoulvm and others added 8 commits November 25, 2021 12:09
and removed debug output
to pass `poetry run flake8 rasa tests --extend-ignore D`
"D415 First line should end with a period, question mark,
or exclamation point"
With best regards to @erohmensing ...
@raoulvm
Copy link
Contributor Author

raoulvm commented Nov 25, 2021

@ancalita Do you know why I get linting errors in "old" code? And if I change that, some other old lines are complained about.
Thanks
Raoul

EDIT seems to had to do with my line-length setting. After changing that globally I could run poetry run flake8 rasa test --extend-ignore D without getting old code complained about.

@raoulvm
Copy link
Contributor Author

raoulvm commented Nov 26, 2021

Hi @samsucik I was told by @hsm207 that you are the dev on duty today.
Could you explain why extractors is None (and thus it returns True here) if it was [CRFEntityExtractor.name] provided here?

I do not understand what is happening here.

Thanks
Raoul

EDIT --------------------------

I was in the wrong line. It gets explicitly passed None in the failing test, which is this one:
https://github.com/RasaHQ/rasa/blob/2.8.x/tests/nlu/test_evaluation.py#L210

Question now changes to: Why should None Extractors not support overlapping entities?

@ancalita
Copy link
Member

After changing that globally I could run poetry run flake8 rasa test --extend-ignore D without getting old code complained about.

@raoulvm Great! In addition, you can also use make lint command in your Terminal for a more comprehensive check.

@samsucik
Copy link
Contributor

@raoulvm I checked your changes now. I'm glad you found a workaround for the failing test. And 'm excited about the proposed improvement, thanks a lot for that!

As for going ahead with this, I have to think about it a bit more and discuss with others because it'd likely have a broader impact. Some points that come to my mind are:

  • When an entity can be "multi-annotated", the validity of this annotation is strongly coupled with the models used. I.e. it'd be invalid if we only have DIET in the pipeline (or a different extractor that assigns only a single entity label to a token), but would be valid if we've got a RegexExtractor.
  • Once an entity can be "multi-annotated" in test stories, a natural follow-up is "why not in training stories?". And that leads us to the thick line drawn between extractors like DIET vs ones like RegexExtractor -- namely, the latter one not requiring annotations in training data. Additionally, we'd be also drawing another line, this time between training and test story format.
  • The exact annotation format needs to work with all special cases, i.e. we'd need to triple-check the regexes (one immediate candidate edge case to check is when a language doesn't have words separated by spaces and you've got 2 entities coming one right after another). But this is ultimately just a technicality.
  • If we decide to go with the change, then it'll have to become part of 3.x sooner or later, which means we want to be certain about it.

@raoulvm
Copy link
Contributor Author

raoulvm commented Nov 26, 2021

HI @ancalita , you should consider cleaning "old stuff" anyhow. If you run the lint against the whole repo (no diff) you get 3000+ doc-string complaints.

(rasa2source) #:~/PyProjects/rasa$ poetry run flake8 rasa tests --select D | wc -l
3720

@raoulvm
Copy link
Contributor Author

raoulvm commented Nov 26, 2021

Hi @samsucik

it'd be invalid if we only have DIET in the pipeline.

I just ran a test with the mood bot including new multi-annotated entities in the NLU Training Data , with DIET as the entity extractor. DIET is training on the last item in the list only.

Once an entity can be "multi-annotated" in test stories, a natural follow-up is "why not in training stories?"

Good question. It should.

And that leads us to the thick line drawn between extractors like DIET vs ones like RegexExtractor -- namely, the latter one not requiring annotations in training data.

Actually that line is already there. There are intent training data containing entities, and there are Regex Expressions and Lookups, which are separated. If you include other ontological tools like mine it is separated even more, as those data do not fit into the nlu files at all.

For the other details, you are perfectly right: I do not speak any non-whitespace language, and I am no fan of regex for literal lookups (regards to @koaning for his splendid flashtext extractor idea which I used as a basis for the hierarchies) so I wasn't very deep into those and did not make test cases for them beyond what is already in the tests.

@samsucik
Copy link
Contributor

Thanks for digging into the questions so quickly. As you can see, there are some rough edges to overcome if this was to become part of 2.8 (and later 3.x) and, more importantly, if these changes were to be maintained and further built upon in future iterations.

Regarding the line between DIET and RegexExtractor: Indeed, that line already exists (sorry for not being 100% clear in my previous reply). But, in my opinion, we should try to not make these arbitrary lines even worse.

All in all, I think it's important that you're exploring how far you can go with 2.8. The other part is then taking a step back and seeing what's the high-level problem that you're trying to solve, as we discussed previously 🙂

raoulvm added a commit to raoulvm/rasa that referenced this pull request Nov 26, 2021
As original didn't change the changes from RasaHQ#10394
are applied without change
@raoulvm
Copy link
Contributor Author

raoulvm commented Nov 26, 2021

Test with 3.0 see here #10404

@raoulvm raoulvm mentioned this pull request Nov 26, 2021
4 tasks
@JEM-Mosig
Copy link

JEM-Mosig commented Dec 6, 2021

Thank you @raoulvm for creating this PR (and the discussion on Friday)!

We should use this also to resolve this issue. And we also have to check that this doesn't break Rasa X (especially if we'd allow multiple entities to be annotated in training stories). Since we're overhauling the entity annotation here anyway, it'd be good if the annotation could be extractor-specific. So one could say "this entity is an example for RegexEntityExtractor, but not for DIETClassifier". Right now, because the domain object is not available at the right place in the code, one has to annotate at least one example of an entity that is extracted by RegexEntityExtractor and by doing so DIETClassifier will try to extract it as well.

We don't necessarily have to do all these changes together, but we have to decide what the right format would be in the end, so we don't jump back-and-forth too often.

@raoulvm raoulvm marked this pull request as ready for review January 19, 2022 15:52
@raoulvm raoulvm requested a review from a team as a code owner January 19, 2022 15:52
@raoulvm raoulvm requested a review from a team January 19, 2022 15:52
@raoulvm raoulvm requested review from wochinge and JEM-Mosig and removed request for a team January 19, 2022 15:52
@m-vdb m-vdb requested review from a team and jupyterjazz and removed request for wochinge and a team January 21, 2022 13:57
@TyDunn TyDunn removed the request for review from jupyterjazz January 21, 2022 14:06
@raoulvm
Copy link
Contributor Author

raoulvm commented Feb 8, 2022

Has been merged as #10773 by @JEM-Mosig

@raoulvm raoulvm closed this Feb 8, 2022
@hsm207 hsm207 mentioned this pull request Feb 10, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants