You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notice that the accuracy serves as a metric to measure the performance of various models for hallucination recognition, as shown in Table 5.
However, I find that the hallucination cases are significantly imbalanced in many subsets.
For example, for the QA dataset, only 19.5% of instances are hallucinations. Similarly, around 18% of instances from the General dataset are hallucinations.
Thus, the high accuracy achieved by the models cannot demonstrate that the models could well identify the hallucination contents since the model could achieve almost 80% accuracy even if it constantly predicts non-hallucination.
Precision/Recall/F1 scores over the positive samples (the hallucination samples) may be more suitable for this scenario.
Again, this is a great work and thanks for sharing!
Best,
Xuansheng
The text was updated successfully, but these errors were encountered:
Thanks for your suggestion. Some reviewers also mentioned that reporting P/R/F1 scores besides accuracy will make our results more convincing. Therefore, we suggest adding P/R/F1 scores when using our benchmark.
Hi authors,
It is definitely a great work for our community!
I notice that the accuracy serves as a metric to measure the performance of various models for hallucination recognition, as shown in Table 5.
However, I find that the hallucination cases are significantly imbalanced in many subsets.
For example, for the QA dataset, only 19.5% of instances are hallucinations. Similarly, around 18% of instances from the General dataset are hallucinations.
Thus, the high accuracy achieved by the models cannot demonstrate that the models could well identify the hallucination contents since the model could achieve almost 80% accuracy even if it constantly predicts non-hallucination.
Precision/Recall/F1 scores over the positive samples (the hallucination samples) may be more suitable for this scenario.
Again, this is a great work and thanks for sharing!
Best,
Xuansheng
The text was updated successfully, but these errors were encountered: