-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation script for Huggingface Causal models #13
base: master
Are you sure you want to change the base?
Conversation
I've used this script while working on the MMLU eval for LLaMA models in lm-evaluation-harness. I've replicated the numbers for the two of the LLaMA models, 7B & 13B. Results are here. |
Hello @ollmer it's me again (I discussed with you on the other repo) The prompt here:
makes it clear that there is no space after "Answer:" when requesting the answer from the model. When using a BPE model the generated token for " B" will be "ĠB" id 347. The way you are getting the probs and compute the argmax is based on logits from "A" "B" "C" "D" instead of the token with the space character " " in front of it. I don't know if you kept the same logic when making your PR https://github.com/EleutherAI/lm-evaluation-harness/pull/497/files |
Thanks for pointing it out! Could you provide the complete code snippet to reproduce the issue? |
Just run your script with HF's model "Salesforce/xgen-7b-8k-base" (or any BPE based model) |
|
In the end not sure if Harness'MMLU is ok or not but I had the feeling it was taking the same loglikeihood of "A" "B" "C" and "D". |
Followup thought, for the logprob selection approach to be correct we can use the following method. Encode the whole line of the possible answer |
The point was to fully reproduce the original prompt format in lm-harness. Before this discussion, I assumed that logprob selection is numerically equivalent to calculating the loglikelihoods of 4 different samples with different choices. It should be that way because all the tokens are exactly the same except the last one, so you end up with the loglikelihood of the last token in lm-harness being the same as the logprob of the token in the OG approach. Plus, the lm-harness approach is suitable for choices longer than 1 token, because they're using the same code for a lot of different multiple-choice tasks. |
For this repo, I think we are saying the same thing, encoding For Eleuther, I have not digged enough into the code but when I get time I will print intermediary steps because I need to explain why we are not getting the same results. |
@ollmer How about this solution :
to
Now in the few-shot setting, no matter what LLM is using, we are expecting the models output one of the four tokens
|
For others who might stumble here in the future, the current implementation here atm is definitely wrong for some models/tokenizers. E.g. GPT series would not work. Sentencepiece also would not work unless add_dummy_prefix is turned on. The problem is here: probs = (
torch.nn.functional.softmax(
torch.tensor(
[
logits[tokenizer("A").input_ids[-1]],
logits[tokenizer("B").input_ids[-1]],
logits[tokenizer("C").input_ids[-1]],
logits[tokenizer("D").input_ids[-1]],
]
).float(),
dim=0,
)
.detach()
.cpu()
.numpy()
) where instead the space must be included prepended when querying the logits, e.g.: It might be helpful to play with the tiktokenizer web app and copy paste in the example
(make sure to show whitespace and delete the last \n if it is present after copy paste. You should have exactly 344 tokens. For GPT-2 as an example, here we see that the correct next token to match the few-shot examples is " B" (with space), i.e. token 347. But
It seems like there is discussion in this Issue noting this problem, but it was never resolved and the code as is right now is buggy. |
Also another reason this code will fail is that it hardcodes max context length to be 2048: while input_ids.shape[-1] > 2048: but e.g. GPT-2 has max context length 1024 so the code will violently fail and throw an error. |
here yes, but after this discussion it was fixed in lm_eval from Eleuther. and in OpenNMT-py we left strip the space: https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/run_mmlu_opennmt.py#L183 |
exactly (noticed it too) but you know what, when you want to replicate papers numbers you just keep those in :) (seriously it changes results) |
It was made using the existing FLAN eval script as a reference.
Minor changes: