-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to do calculate all bleu scores during evaluvaion #37
Comments
The current codebase uses NLTK to calculate BLEU-4 scores. However, BLEU-1 to BLEU-n can be easily implemented, if you want to do that yourself. If you don't want to do that, you can then simply use NLTK for doing this which provides a nice interface to achieve this. (see code below) Here is the explanation of how BLEU score computation is defined:
For example, Having said that, if you want to compute specific n-gram BLEU scores, you have to pass a To compute,
To compute,
To compute,
To compute,
Here is a demonstration using a toy example adapted from NLTK webpage: Note how the BLEU score keeps decreasing as we increase the number Refer this page for more information on the NLTK BLEU score implementation |
Thank you very much for your explanation |
@sgrvinod ping! |
Oops, didn't see this. Yes, it's a good idea, I'll add it tomorrow with credit to you, thanks! I think the entire detailed explanation is too long for the Remarks section. I'll either link to your post here from the Remarks section, or add a question to the FAQ with your answer (and crediting you), or both. You could also submit a pull request if you wish, and I'll make minor edits to it if needed. |
@sgrvinod done! |
Merged #52. |
@kmario23 Thanks for your brilliant explanation, I got the proceprocessing to calculate BLEU by NLTK. But I still confuse that if I have 3 References and only 1 Hypotheses, does the tool calculate <ref, hyp> pairs one by one? And then gets the mean value of them or get the maximum value of them? |
Hello @forence, thanks! Contrary to our intuition, it's not how the BLEU score is computed. However, luckily the paper that proposed BLEU is quite very well written (and easy to understand). Please have a look at Section 2 of BLEU: a Method for Automatic Evaluation of Machine Translation for how they compute a Modified Unigram Precision, which is better than simple precision. |
Hi,
Thanks for the well-documented code and tutorial. I trained my model from scratch using your code now when I want to evaluate am not sure how to get all the BLEU scores, not just bleu4 as currently in the eval.py.
The text was updated successfully, but these errors were encountered: