Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize testing output: Response Selection #5824

Merged
merged 21 commits into from
May 18, 2020
Merged

Conversation

tabergma
Copy link
Contributor

@tabergma tabergma commented May 14, 2020

Proposed changes:

  • Plot histogram with confidence distribution for response selection
  • Write correct and incorrect predictions for response selection

related to #5748

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@tabergma tabergma requested a review from dakshvar22 May 14, 2020 13:43
@@ -604,7 +604,7 @@ def plot_story_evaluation(
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
import matplotlib.pyplot as plt
from rasa.nlu.test import plot_confusion_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit unlikely that it happens but it could break someone's code relying on this function in some glue code of their own. Should we add a deprecation warning or keep a dummy function in rasa.nlu.test which just calls the one in rasa.utils.plotting?

Copy link
Contributor Author

@tabergma tabergma May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that is needed as it will be released with 2.0. I also removed two cli options for naming the histogram and confusion matrix file which are not useful anymore. If you think it is needed, I'll keep those and add a deprecation warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmbo Just wanted to double check if this is fine.

@dakshvar22
Copy link
Contributor

dakshvar22 commented May 18, 2020

@tabergma I pushed a small fix to add the response_key to the intent part of response selection results so that when viewing the error file, the dict looks like -

{
    "text": "How do I know someone is actually doing something for my offsets?",
    "intent_target": "faq/is-this-legit",
    "response": "All offset projects have to pass a rigorous verification process set by https://goldstandard.org .",
    "response_prediction": {
      "name": "The cost per ton depends on the project you donate to. Some projects can remove a ton of CO2 for less than $1!",
      "confidence": 0.15156520903110504
    }

instead of

{
    "text": "How do I know someone is actually doing something for my offsets?",
    "intent_target": "faq",
    "response": "All offset projects have to pass a rigorous verification process set by https://goldstandard.org .",
    "response_prediction": {
      "name": "The cost per ton depends on the project you donate to. Some projects can remove a ton of CO2 for less than $1!",
      "confidence": 0.15156520903110504
    }

Otherwise all intents would just be faq which is a bit unhelpful.

@dakshvar22
Copy link
Contributor

@tabergma In the confusion matrix should we use the intent/response_key as the labels of the rows and columns instead of the actual responses? The mapping is deterministic since there is only one response per intent/response_key but it would make the confusion matrix fit well when you have large number of unique responses. What do you think?

@tabergma
Copy link
Contributor Author

@dakshvar22 Good idea. Will try it now.

@dakshvar22
Copy link
Contributor

Ahh, that would need a bigger refactor. Because getting the intent/response_key of the predicted response is not trivial right now. Let's keep it as a separate issue.

@tabergma
Copy link
Contributor Author

😄 Ok. As an alternative, we could also maybe use a legend that maps the actual response to A, B, etc. What do you think?

@dakshvar22
Copy link
Contributor

I would say let us directly replace them with the intent/response_key separately because that's the most ideal.

Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small key change for avoiding the ambiguity. Looks great! 💯

rasa/nlu/test.py Outdated Show resolved Hide resolved
rasa/nlu/test.py Outdated Show resolved Hide resolved
@tmbo tmbo merged commit dd954c3 into master May 18, 2020
@tmbo tmbo deleted the response-testing-output branch May 18, 2020 13:39
indam23 pushed a commit that referenced this pull request Jul 27, 2022
* create plotting utils

* write errors and successes for responses

* fix file names

* update plot filenames

* add changelog entry

* add missing docstrings

* update tests

* fix type

* address deepsource issues

* use io_utils

* fix call to write_text_file

* added complete intent and response key to the 'intent' key

* Update rasa/nlu/test.py

Co-authored-by: Daksh Varshneya <[email protected]>

* Update rasa/nlu/test.py

Co-authored-by: Daksh Varshneya <[email protected]>

* use intent_target key for confustion matrix

Co-authored-by: Daksh <[email protected]>
Co-authored-by: Roberto <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants