Clean up and log to file Conversation level performance measures. #8000

kedz · 2021-02-19T19:19:57Z

Description of Problem:
When running rasa test, performance measures for the core model are printed to the console and/or logged to results/story_report.json. What is printed to the console is split into two main result blocks, CONVERSATION (or E2E when evaluating E2E data) and ACTION level performance.

2021-02-19 14:07:43 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:     #<-- CONVERSATION LEVEL BLOCK
2021-02-19 14:07:43 INFO     rasa.core.test  - 	Correct:          42 / 43
2021-02-19 14:07:43 INFO     rasa.core.test  - 	F1-Score:         0.988
2021-02-19 14:07:43 INFO     rasa.core.test  - 	Precision:        1.000
2021-02-19 14:07:43 INFO     rasa.core.test  - 	Accuracy:         0.977
2021-02-19 14:07:43 INFO     rasa.core.test  - 	In-data fraction: 0.863
2021-02-19 14:07:43 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-02-19 14:07:43 INFO     rasa.core.test  - Evaluation Results on ACTION level:  #<-- ACTION LEVEL BLOCK
2021-02-19 14:07:43 INFO     rasa.core.test  -  Correct:          247 / 249
2021-02-19 14:07:43 INFO     rasa.core.test  -  F1-Score:         0.996
2021-02-19 14:07:43 INFO     rasa.core.test  -  Precision:        1.000
2021-02-19 14:07:43 INFO     rasa.core.test  -  Accuracy:         0.992
2021-02-19 14:07:43 INFO     rasa.core.test  -  In-data fraction: 0.863

Digging into the CONVERSATION level measures, it seems that what is being computed is not very informative or useful. Because of how these metrics are computed, precision is always 1.0 (unless no stories are correct, in which case it is 0), F1-Score is just the harmonic mean of 1 and the recall (recall is not printed to the console), and recall = accuracy. Additionally, in-data fraction is always the same in the CONVERSATION block as it is in ACTION block. The only thing that would currently be helpful to display at the CONVERSATION level block would be accuracy (number of correct stories / total stories).

Overview of the Solution:
I propose only printing Correct and Accuracy at the CONVERSATION level.

2021-02-18 10:55:04 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-02-18 10:55:04 INFO     rasa.core.test  -  Correct:          42 / 43
2021-02-18 10:55:04 INFO     rasa.core.test  -  Accuracy:         0.977
2021-02-18 10:55:04 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-02-18 10:55:04 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2021-02-18 10:55:04 INFO     rasa.core.test  -  Correct:          247 / 249
2021-02-18 10:55:04 INFO     rasa.core.test  -  F1-Score:         0.996
2021-02-18 10:55:04 INFO     rasa.core.test  -  Precision:        1.000
2021-02-18 10:55:04 INFO     rasa.core.test  -  Accuracy:         0.992
2021-02-18 10:55:04 INFO     rasa.core.test  -  In-data fraction: 0.863

Additionally, I propose to include in the results/story_report.json an additional field for conversation level accuracy:

{
...
 "conversation_accuracy": {"accuracy": 0.977, "correct": 42, "total": 43}
...
}

The text was updated successfully, but these errors were encountered:

sara-tagger · 2021-02-22T07:00:14Z

Thanks for submitting this feature request 🚀 @joejuzl will get back to you about it soon! ✨

kedz added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Feb 19, 2021

kedz self-assigned this Feb 19, 2021

kedz linked a pull request Feb 23, 2021 that will close this issue

Fix logging of conversation level core metrics. #8030

Merged

4 tasks

kedz closed this as completed in #8030 Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up and log to file Conversation level performance measures. #8000

Clean up and log to file Conversation level performance measures. #8000

kedz commented Feb 19, 2021 •

edited

Loading

sara-tagger commented Feb 22, 2021

Clean up and log to file Conversation level performance measures. #8000

Clean up and log to file Conversation level performance measures. #8000

Comments

kedz commented Feb 19, 2021 • edited Loading

sara-tagger commented Feb 22, 2021

kedz commented Feb 19, 2021 •

edited

Loading