Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it a good choice to use Accuracy in evaluating the hallucination detection/recognition performance? #20

Open
JacksonWuxs opened this issue Jan 16, 2024 · 1 comment

Comments

@JacksonWuxs
Copy link

Hi authors,

It is definitely a great work for our community!

I notice that the accuracy serves as a metric to measure the performance of various models for hallucination recognition, as shown in Table 5.
However, I find that the hallucination cases are significantly imbalanced in many subsets.
For example, for the QA dataset, only 19.5% of instances are hallucinations. Similarly, around 18% of instances from the General dataset are hallucinations.
Thus, the high accuracy achieved by the models cannot demonstrate that the models could well identify the hallucination contents since the model could achieve almost 80% accuracy even if it constantly predicts non-hallucination.
Precision/Recall/F1 scores over the positive samples (the hallucination samples) may be more suitable for this scenario.

Again, this is a great work and thanks for sharing!

Best,
Xuansheng

@turboLJY
Copy link
Member

Thanks for your suggestion. Some reviewers also mentioned that reporting P/R/F1 scores besides accuracy will make our results more convincing. Therefore, we suggest adding P/R/F1 scores when using our benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants