You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find that in qa data. Right answers are from hotpot qa that are short, however the constructed hallucinated answer is longer that is usually a sentence.
I guess this may induce some length bias when detecting hallucination using it.
The text was updated successfully, but these errors were encountered:
Thank you for raising the issue. We have also noticed the potential problems in HaluEval. In our hallucination detection experiments, we randomly select the hallucinated or normal output (e.g., an answer) of each sample for classification. We require the model to focus on whether the content of the output contains hallucinations, so the impact of the length of the response may be relatively minor. You can follow our latest work, HaluEval 2.0, where we have constructed a brand-new dataset for evaluating hallucinations: "The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models."
I find that in qa data. Right answers are from
hotpot qa
that are short, however the constructed hallucinated answer is longer that is usually a sentence.I guess this may induce some length bias when detecting hallucination using it.
The text was updated successfully, but these errors were encountered: