Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation script for Huggingface Causal models #13

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ollmer
Copy link

@ollmer ollmer commented May 28, 2023

It was made using the existing FLAN eval script as a reference.
Minor changes:

  • load models as Float16;
  • put the samples on the same device as a model;
  • use the last token of option tokenization instead of the first one;
  • save the aggregated results to a JSON file.

@ollmer
Copy link
Author

ollmer commented May 28, 2023

I've used this script while working on the MMLU eval for LLaMA models in lm-evaluation-harness. I've replicated the numbers for the two of the LLaMA models, 7B & 13B. Results are here.
Example usage, to reproduce LLaMA-7B metrics: python evaluate_hf.py -m huggyllama/llama-7b

@vince62s
Copy link

vince62s commented Jul 3, 2023

Hello @ollmer it's me again (I discussed with you on the other repo)
I think there is a mistake in this implementation for models using the gpt2_pretok, let me explain.

The prompt here:

def format_example(df, idx, include_answer=True):
    prompt = df.iloc[idx, 0]
    k = df.shape[1] - 2
    for j in range(k):
        prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1])
    prompt += "\nAnswer:"
    if include_answer:
        prompt += " {}\n\n".format(df.iloc[idx, k + 1])
    return prompt

makes it clear that there is no space after "Answer:" when requesting the answer from the model.
However, since in the five-shots there is a space before the answer the model will generate this space before the A/B/C/D

When using a BPE model the generated token for " B" will be "ĠB" id 347.

The way you are getting the probs and compute the argmax is based on logits from "A" "B" "C" "D" instead of the token with the space character " " in front of it.

I don't know if you kept the same logic when making your PR https://github.com/EleutherAI/lm-evaluation-harness/pull/497/files
but I think it is wrong for BPE models using this special pretokenization like XGEN.

@ollmer
Copy link
Author

ollmer commented Jul 4, 2023

Thanks for pointing it out! Could you provide the complete code snippet to reproduce the issue?

@vince62s
Copy link

vince62s commented Jul 4, 2023

Just run your script with HF's model "Salesforce/xgen-7b-8k-base" (or any BPE based model)
and print the logits of "A","B","C","D", " A", " B", " C", " D" (note the leading space in the second sequence)
you will immediately understand and see the issue. I had a look at Eleuther lm harness and I think this is the same issue.

@ollmer
Copy link
Author

ollmer commented Jul 7, 2023

  1. In this repo, it's true that the space is not inserted at the end of the sample. I did it precisely the same way it was implemented in this repo for FLAN models which I use as a reference implementation when I started this PR. I agree that it could be a problem for some tokenizers. Maybe it should be changed to some tokenizer-agnostic way of encoding the choices. But the addition of the space at the end of the sample will not solve this problem, as you can see in the following illustration. Because the trailing space is tokenized in a completely different way.
image
  1. This issue is not the case for the Eleuther harness, because it prepends a space to each answer before calculating loglikelihood, likes this: " {}".format(choice) and calculates the likelihood of a full sample for the each of the possible choices.

@vince62s
Copy link

vince62s commented Jul 7, 2023

  1. for this repo, I think just taking logits of " A", " B", " C", " D" would work independently of the tokenizer.
    Sentencepiece will encode " A" the same way as "A".
    BPE will encode " A" with "ĠA" the same way the model will answer.
    In the end it shoule work this way.

  2. for Eleuther I don't understand your point here nor the one in the other thread. I thought the whole point of your PR was to align Harness on the original implementation, hence using logits of the 4 choices.

In the end not sure if Harness'MMLU is ok or not but I had the feeling it was taking the same loglikeihood of "A" "B" "C" and "D".

@ollmer
Copy link
Author

ollmer commented Jul 7, 2023

Followup thought, for the logprob selection approach to be correct we can use the following method. Encode the whole line of the possible answer Answer: A, where the choice is the last character, then use only the last token of this sequence. This case should yield the correct results for both tokenizers (assuming that the choice character always translated into exactly one token).

@ollmer
Copy link
Author

ollmer commented Jul 7, 2023

  1. for Eleuther I don't understand your point here nor the one in the other thread. I thought the whole point of your PR was to align Harness on the original implementation, hence using logits of the 4 choices.

The point was to fully reproduce the original prompt format in lm-harness. Before this discussion, I assumed that logprob selection is numerically equivalent to calculating the loglikelihoods of 4 different samples with different choices. It should be that way because all the tokens are exactly the same except the last one, so you end up with the loglikelihood of the last token in lm-harness being the same as the logprob of the token in the OG approach. Plus, the lm-harness approach is suitable for choices longer than 1 token, because they're using the same code for a lot of different multiple-choice tasks.

@vince62s
Copy link

vince62s commented Jul 8, 2023

For this repo, I think we are saying the same thing, encoding Answer: A or A (and get the argmax amongst A, B, C, D to compare to gold) is the same thing, except that it is quicker to get the logprob of a single token.

For Eleuther, I have not digged enough into the code but when I get time I will print intermediary steps because I need to explain why we are not getting the same results.

@Heepo
Copy link

Heepo commented Jul 26, 2023

@ollmer How about this solution :
Append the space after Answer: instead of prepending it before "{}\n\n".format(df.iloc[idx, k + 1])
Change this

for j in range(k):
    prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1])
prompt += "\nAnswer:"
if include_answer:
    prompt += " {}\n\n".format(df.iloc[idx, k + 1])

to

for j in range(k):
    prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1])
prompt += "\nAnswer: "
if include_answer:
    prompt += "{}\n\n".format(df.iloc[idx, k + 1])

Now in the few-shot setting, no matter what LLM is using, we are expecting the models output one of the four tokens ['A', 'B', 'C', 'D'].
Then we don't use the tokenizer() or tokenizer.encode() method, just use tokenizer.convert_tokens_to_ids() instead to acquire the logits of this four tokens.

>>> llama_tokenizer.convert_tokens_to_ids('A')
29909
>>> neox_tokenizer.convert_tokens_to_ids('A')
34
  1. In this repo, it's true that the space is not inserted at the end of the sample. I did it precisely the same way it was implemented in this repo for FLAN models which I use as a reference implementation when I started this PR. I agree that it could be a problem for some tokenizers. Maybe it should be changed to some tokenizer-agnostic way of encoding the choices. But the addition of the space at the end of the sample will not solve this problem, as you can see in the following illustration. Because the trailing space is tokenized in a completely different way.
image 3. This issue is not the case for the Eleuther harness, because it [prepends a space](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/base.py#L749) to each answer before calculating loglikelihood, likes this: `" {}".format(choice)` and calculates the likelihood of a full sample for the each of the possible choices.

@karpathy
Copy link

karpathy commented Feb 27, 2024

For others who might stumble here in the future, the current implementation here atm is definitely wrong for some models/tokenizers. E.g. GPT series would not work. Sentencepiece also would not work unless add_dummy_prefix is turned on.

The problem is here:

probs = (
    torch.nn.functional.softmax(
        torch.tensor(
            [
                logits[tokenizer("A").input_ids[-1]],
                logits[tokenizer("B").input_ids[-1]],
                logits[tokenizer("C").input_ids[-1]],
                logits[tokenizer("D").input_ids[-1]],
            ]
        ).float(),
        dim=0,
    )
    .detach()
    .cpu()
    .numpy()
)

where instead the space must be included prepended when querying the logits, e.g.: logits[tokenizer(" A").input_ids[-1]],.

It might be helpful to play with the tiktokenizer web app and copy paste in the example prompt, e.g.:

The following are multiple choice questions (with answers) about  abstract algebra.

Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A. 0
B. 1
C. 2
D. 3
Answer: B

Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: B

Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: C

Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: A

Find the characteristic of the ring 2Z.
A. 0
B. 3
C. 12
D. 30
Answer: A

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:

(make sure to show whitespace and delete the last \n if it is present after copy paste. You should have exactly 344 tokens.

For GPT-2 as an example, here we see that the correct next token to match the few-shot examples is " B" (with space), i.e. token 347. But tokenizer("B") would return token 33 instead of 347:

(Pdb) tokenizer("B")
{'input_ids': [33], 'attention_mask': [1]}
(Pdb) tokenizer(" B")
{'input_ids': [347], 'attention_mask': [1]}

It seems like there is discussion in this Issue noting this problem, but it was never resolved and the code as is right now is buggy.

@karpathy
Copy link

Also another reason this code will fail is that it hardcodes max context length to be 2048:

while input_ids.shape[-1] > 2048:

but e.g. GPT-2 has max context length 1024 so the code will violently fail and throw an error.

@vince62s
Copy link

vince62s commented Feb 27, 2024

It seems like there is discussion in this Issue noting this problem, but it was never resolved and the code as is right now is buggy.

here yes, but after this discussion it was fixed in lm_eval from Eleuther. and in OpenNMT-py we left strip the space: https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/run_mmlu_opennmt.py#L183

@karpathy
Copy link

I noticed one more bug

image

The issue is that there is a double space in the intro line, right before the subject. This is because

def format_subject(subject):
    l = subject.split("_")
    s = ""
    for entry in l:
        s += " " + entry
    return s

see how it prepends " " at beginning. But then

def gen_prompt(train_df, subject, k=-1):
    prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(
        format_subject(subject)
    )

see how it has a space right after "about", in the string "about {}". This causes a double space.

@vince62s
Copy link

exactly (noticed it too) but you know what, when you want to replicate papers numbers you just keep those in :) (seriously it changes results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants