-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the prompt and setting for GSM8K evaluation? #325
Comments
Same issue here. |
Hi sry for the late reply my args are |
Here is my args, but I only get 1.5% ACC for gsm8k. (btw, I am using model from huggingface)
And there is a lot of repetition in the response:
|
The hugging face model has no difference from meta's llama model. Are you using few-shot cot or zero-shot cot? Repetition is not a problem as long as you have a correct answer extractor (see https://github.com/kojima-takeshi188/zero_shot_cot/blob/main/utils.py) and I also notice this effect of repetition of LLaMa. Vanilla LLaMa has not been fine-tuned with instructions and can only do completion. |
Thanks for your quick response:
The ACC seems to be normal now for about 7%. |
Yes, I am also acquiring ~7%, which is still a bit lower than Meta's report. If you have any ideas to improve the 8-shots results to 11% please let me know. |
You can modify answer clean like this:
which takes the first answer as the model answer will help you to get ~11.41% ACC. Btw, I used a modified tokenizer from alpaca team. |
Thanks for the hints. I try this new clean_answer() script but the accuracy is still ~7% for llama-7b 8-shot. Could you please share more about the generation, are you using https://github.com/kojima-takeshi188/zero_shot_cot ? |
My generation args are:
Here is my evaluation code which is built in an FL framework: https://github.com/alibaba/FederatedScope/blob/dev/llm/federatedscope/llm/eval/eval_for_gsm8k/eval.py Have you tried to modify tokenizer following alpaca team? |
Same issue here. |
Hi @rayrayraykk , thanks for the information you shared! |
I use the llama tokenizer with some special tokens added instead of tokenizer whose weights that have been modified through alpaca model.
|
Thx! Do you have alignment the results on 13b model or other versions? |
Sry, I've not tried 13b model. |
Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:
|
I think the problem is that the |
To be more specific, the setting of the original llama paper is the few-shot cot (8-shot, I guess). You could not reproduce the results or improve the accuracy closer with a zero-shot setting. |
@HuangOwen,so, you can get 11% acc by 8-shot-cot? i will try it now. in llama paper, it gives two type results, GSM8K && GSM8k+maj1@k, the sencod type will use k samples with k = 40, that means 40 shot? do you have any idea? |
thanks for your quick reply , i have another question in other dialog box。 |
Yes, you can achieve ~11% accuracy with an 8-shot cot. Remember to add special tokens of alpaca, which is mentioned above. The maj1@k indicates "majority voting", which means that you generate the answer with the same question for k times and use majority voting to select the answer. |
what is DefaultToken.PAD_TOKEN.value here? |
Hi @HuangOwen, do you have the script you used to reproduce llama-1 results on GSM8k? If you have done something similar for llama-2, it would be great! |
I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. Model: decapoda-research/llama-7b-hf I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue. Also I'm using windows and getting the following warning: |
|
Thanks for the reply!
My tests are still running, I've sampled ~200 answers that give ~5% on the 4-shot prompt. I seriously doubt it will make 12% after it is finished. Note that for each question I randomly sample 4 prompts to compose the 4-shot prompt. My |
max_new_tokens 512 maybe better even though math answers are usually short. I agree with the fact that reasoning tasks fit in a lower temperature, but in general setting (QA, creating), temperature for llama/vicuna is 0.7. prompt for zero-shot-cot like : "How many bolts in total does it take? Let's think step by step." prompt for 8-shot-cot like : "USER: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now? ASSISTANT: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9USER: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot? ASSISTANT: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5USER: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny? ASSISTANT: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8USER: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room? ASSISTANT: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29USER: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total? ASSISTANT: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39USER: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday? ASSISTANT: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33USER: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? ASSISTANT: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6USER: Olivia has $23. She bought five bagels for $3 each. How much money does she have left? ASSISTANT: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8USER: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? ASSISTANT:" (just ignore the USER, ASSISTANT and replaca it to your role setting, like Q, A for example) for 3, llama7b prone to output repetitions and you can add repetition_penalty during generation actually, finally, recommend to use VLLM for accelerating your generation although the acc may get lower. |
Based on your suggestions I'm running an experiment with Well... it is prepending all the 8 prompt examples before the main answer and sometimes it adds stuff like "### ADVANCED..." after the answer. Any ideas? Commenting out num_beams seems to free some memory, since I can now use the 8-shot (I'm not sure what this parameter does). |
what about using zero_shot_cot and see the result. |
I'm running some experiments. I found that using
|
I got 7.5% with
How can I setup the special tokenizer mentioned here? What is DefaultToken? |
Hi, if I understand correctly, is it that we have to take the last token in the generated prompt and assume that to be the answer? Or rather, we check the last element of the list generated by I still can't find the prompt anywhere, does anyone know what the prompt is? |
@surya-narayanan For the prompt you can refer the same for the 8 shot prompt - https://github.com/EleutherAI/lm-evaluation-harness/blob/ae79b1217aad7738b91e88a4017c86a5d5e45aa7/lm_eval/tasks/gsm8k/gsm8k-cot.yaml#L8 |
Hi, I am trying to reproduce the LLaMa on the GSM8K dataset. I basically follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot. However, the performance across LLaMa-7B/13B/30B is far from the paper's result. I can only get 7.13 for an 8-shot with LLaMa-7B. May I know if anyone has reproduced the results and what is the prompt you are using?
The text was updated successfully, but these errors were encountered: