Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the prompt and setting for GSM8K evaluation? #325

Open
HuangOwen opened this issue Jun 5, 2023 · 32 comments
Open

What is the prompt and setting for GSM8K evaluation? #325

HuangOwen opened this issue Jun 5, 2023 · 32 comments
Labels
community-discussion question General questions about using Llama2 research-paper Issues and questions relating to the published architecture or methodology

Comments

@HuangOwen
Copy link

Hi, I am trying to reproduce the LLaMa on the GSM8K dataset. I basically follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot. However, the performance across LLaMa-7B/13B/30B is far from the paper's result. I can only get 7.13 for an 8-shot with LLaMa-7B. May I know if anyone has reproduced the results and what is the prompt you are using?
image

@rayrayraykk
Copy link

rayrayraykk commented Jun 6, 2023

Same issue here.
Could you share your args for generate?

@HuangOwen
Copy link
Author

Same issue here. Could you share your args for generate?

Hi sry for the late reply my args are
temperature = 0.8 top_p = 0.95 max_seq_len = 512 max_batch_size = 1
The few-shot prompt is from https://github.com/kojima-takeshi188/zero_shot_cot

@rayrayraykk
Copy link

rayrayraykk commented Jun 13, 2023

Here is my args, but I only get 1.5% ACC for gsm8k. (btw, I am using model from huggingface)

        response = self.model.generate(input_ids=input_ids,
                                       attention_mask=attention_mask,
                                       max_new_tokens=512,
                                       top_p=0.95,
                                       temperature=0.8)
        response = \
            self.tokenizer.decode(response[0][input_ids.shape[1]:],
                                  skip_special_tokens=True)

And there is a lot of repetition in the response:

Q: Alisa biked 12 miles per hour for 4.5 hours. Stanley biked at 10 miles per hour for 2.5 hours. How many miles did Alisa and Stanley bike in total?
A:Alisa biked 12 miles per hour for 4.5 hours. So she biked 12 * 4.5 = 54 miles. Stanley biked 10 miles per hour for 2.5 hours. So he biked 10 * 2.5 = 25 miles. So in total they biked 54 + 25 = 79 miles. The answer is 79.

Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.

Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.

Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.


@HuangOwen
Copy link
Author

Here is my args, but I only get 1.5% ACC for gsm8k. (btw, I am using model from huggingface)

        response = self.model.generate(input_ids=input_ids,
                                       attention_mask=attention_mask,
                                       max_new_tokens=512,
                                       top_p=0.95,
                                       temperature=0.8)
        response = \
            self.tokenizer.decode(response[0][input_ids.shape[1]:],
                                  skip_special_tokens=True)

And there is a lot of repetition in the response:

Q: Alisa biked 12 miles per hour for 4.5 hours. Stanley biked at 10 miles per hour for 2.5 hours. How many miles did Alisa and Stanley bike in total?
A:Alisa biked 12 miles per hour for 4.5 hours. So she biked 12 * 4.5 = 54 miles. Stanley biked 10 miles per hour for 2.5 hours. So he biked 10 * 2.5 = 25 miles. So in total they biked 54 + 25 = 79 miles. The answer is 79.

Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.

Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.

Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.

The hugging face model has no difference from meta's llama model. Are you using few-shot cot or zero-shot cot? Repetition is not a problem as long as you have a correct answer extractor (see https://github.com/kojima-takeshi188/zero_shot_cot/blob/main/utils.py) and I also notice this effect of repetition of LLaMa. Vanilla LLaMa has not been fine-tuned with instructions and can only do completion.

@rayrayraykk
Copy link

Thanks for your quick response:
I modify the answer_cleaning:

def answer_cleansing(args, pred):

    print("pred_before : " + pred)

    if args.method in ("few_shot", "few_shot_cot"):
        pred = pred.lower()
        split = args.direct_answer_trigger_for_fewshot.lower()
        preds = pred.split(split)
        answer_flag = True if len(preds) > 1 else False
        if answer_flag:
            pred = preds[1]
        else:
            pred = preds[-1]
...

The ACC seems to be normal now for about 7%.

@HuangOwen
Copy link
Author

Thanks for your quick response: I modify the answer_cleaning:

def answer_cleansing(args, pred):

    print("pred_before : " + pred)

    if args.method in ("few_shot", "few_shot_cot"):
        pred = pred.lower()
        split = args.direct_answer_trigger_for_fewshot.lower()
        preds = pred.split(split)
        answer_flag = True if len(preds) > 1 else False
        if answer_flag:
            pred = preds[1]
        else:
            pred = preds[-1]
...

The ACC seems to be normal now for about 7%.

Yes, I am also acquiring ~7%, which is still a bit lower than Meta's report. If you have any ideas to improve the 8-shots results to 11% please let me know.

@rayrayraykk
Copy link

rayrayraykk commented Jun 14, 2023

You can modify answer clean like this:

def clean_answer(model_pred):
    model_pred = model_pred.lower()
    preds = model_pred.split(ANSWER_TRIGGER.lower())
    answer_flag = True if len(preds) > 1 else False
    if answer_flag:
        # Pick first answer with flag
        pred = preds[1]
    else:
        # Pick last number without flag
        pred = preds[-1]

    pred = pred.replace(",", "")
    pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)]

    if len(pred) == 0:
        return INVALID_ANS

    if answer_flag:
        # choose the first element in list
        pred = pred[0]
    else:
        # choose the last element in list
        pred = pred[-1]

    # (For arithmetic tasks) if a word ends with period, it will be omitted ...
    if pred[-1] == ".":
        pred = pred[:-1]

    return pred

which takes the first answer as the model answer will help you to get ~11.41% ACC. Btw, I used a modified tokenizer from alpaca team.

@HuangOwen
Copy link
Author

You can modify answer clean like this:

def clean_answer(model_pred):
    model_pred = model_pred.lower()
    preds = model_pred.split(ANSWER_TRIGGER.lower())
    answer_flag = True if len(preds) > 1 else False
    if answer_flag:
        # Pick first answer with flag
        pred = preds[1]
    else:
        # Pick last number without flag
        pred = preds[-1]

    pred = pred.replace(",", "")
    pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)]

    if len(pred) == 0:
        return INVALID_ANS

    if answer_flag:
        # choose the first element in list
        pred = pred[0]
    else:
        # choose the last element in list
        pred = pred[-1]

    # (For arithmetic tasks) if a word ends with period, it will be omitted ...
    if pred[-1] == ".":
        pred = pred[:-1]

    return pred

which takes the first answer as the model answer will help you to get ~11.41% ACC. Btw, I used a modified tokenizer from alpaca team.

Thanks for the hints. I try this new clean_answer() script but the accuracy is still ~7% for llama-7b 8-shot. Could you please share more about the generation, are you using https://github.com/kojima-takeshi188/zero_shot_cot ?

@rayrayraykk
Copy link

rayrayraykk commented Jun 15, 2023

You can modify answer clean like this:

def clean_answer(model_pred):
    model_pred = model_pred.lower()
    preds = model_pred.split(ANSWER_TRIGGER.lower())
    answer_flag = True if len(preds) > 1 else False
    if answer_flag:
        # Pick first answer with flag
        pred = preds[1]
    else:
        # Pick last number without flag
        pred = preds[-1]

    pred = pred.replace(",", "")
    pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)]

    if len(pred) == 0:
        return INVALID_ANS

    if answer_flag:
        # choose the first element in list
        pred = pred[0]
    else:
        # choose the last element in list
        pred = pred[-1]

    # (For arithmetic tasks) if a word ends with period, it will be omitted ...
    if pred[-1] == ".":
        pred = pred[:-1]

    return pred

which takes the first answer as the model answer will help you to get ~11.41% ACC. Btw, I used a modified tokenizer from alpaca team.

Thanks for the hints. I try this new clean_answer() script but the accuracy is still ~7% for llama-7b 8-shot. Could you please share more about the generation, are you using https://github.com/kojima-takeshi188/zero_shot_cot ?

My generation args are:

generate_kwargs = dict(max_new_tokens=512, top_p=0.95, temperature=0.8)

Here is my evaluation code which is built in an FL framework: https://github.com/alibaba/FederatedScope/blob/dev/llm/federatedscope/llm/eval/eval_for_gsm8k/eval.py

Have you tried to modify tokenizer following alpaca team?

@SAOHPRWHG
Copy link

Same issue here.
Thanks for your information! How about performance on 13b, 30b and 65b?

@JiayiFu
Copy link

JiayiFu commented Jun 30, 2023

Hi @rayrayraykk , thanks for the information you shared!
But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right?
If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!

@rayrayraykk
Copy link

rayrayraykk commented Jul 4, 2023

Hi @rayrayraykk , thanks for the information you shared! But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right? If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!

I use the llama tokenizer with some special tokens added instead of tokenizer whose weights that have been modified through alpaca model.

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        cache_dir=cache_dir,
        model_max_length=tok_len,
        padding_side="right",
        use_fast=False,
    )

    special_tokens = dict()
    if tokenizer.pad_token is None:
        special_tokens["pad_token"] = DefaultToken.PAD_TOKEN.value
    if tokenizer.eos_token is None:
        special_tokens["eos_token"] = DefaultToken.EOS_TOKEN.value
    if tokenizer.bos_token is None:
        special_tokens["bos_token"] = DefaultToken.BOS_TOKEN.value
    if tokenizer.unk_token is None:
        special_tokens["unk_token"] = DefaultToken.UNK_TOKEN.value

    num_new_tokens = tokenizer.add_special_tokens(special_tokens)

@JiayiFu
Copy link

JiayiFu commented Jul 5, 2023

Hi @rayrayraykk , thanks for the information you shared! But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right? If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!

I use the llama tokenizer with some special tokens added instead of tokenizer whose weights that have been modified through alpaca model.

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        cache_dir=cache_dir,
        model_max_length=tok_len,
        padding_side="right",
        use_fast=False,
    )

    special_tokens = dict()
    if tokenizer.pad_token is None:
        special_tokens["pad_token"] = DefaultToken.PAD_TOKEN.value
    if tokenizer.eos_token is None:
        special_tokens["eos_token"] = DefaultToken.EOS_TOKEN.value
    if tokenizer.bos_token is None:
        special_tokens["bos_token"] = DefaultToken.BOS_TOKEN.value
    if tokenizer.unk_token is None:
        special_tokens["unk_token"] = DefaultToken.UNK_TOKEN.value

    num_new_tokens = tokenizer.add_special_tokens(special_tokens)

Thx! Do you have alignment the results on 13b model or other versions?

@rayrayraykk
Copy link

Hi @rayrayraykk , thanks for the information you shared! But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right? If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!

I use the llama tokenizer with some special tokens added instead of tokenizer whose weights that have been modified through alpaca model.

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        cache_dir=cache_dir,
        model_max_length=tok_len,
        padding_side="right",
        use_fast=False,
    )

    special_tokens = dict()
    if tokenizer.pad_token is None:
        special_tokens["pad_token"] = DefaultToken.PAD_TOKEN.value
    if tokenizer.eos_token is None:
        special_tokens["eos_token"] = DefaultToken.EOS_TOKEN.value
    if tokenizer.bos_token is None:
        special_tokens["bos_token"] = DefaultToken.BOS_TOKEN.value
    if tokenizer.unk_token is None:
        special_tokens["unk_token"] = DefaultToken.UNK_TOKEN.value

    num_new_tokens = tokenizer.add_special_tokens(special_tokens)

Thx! Do you have alignment the results on 13b model or other versions?

Sry, I've not tried 13b model.

@kkchenlight
Copy link

Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:

  1. just get the max prob output, no set tempareture and top_k
  2. for zero_shot, i just set max_new_tokens to 16, i think it don's need 512 seq len for zero_shot.
    is there anything that helps to improve the acc closer with paper?
    thanks for your time

@HuangOwen
Copy link
Author

Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:

  1. just get the max prob output, no set tempareture and top_k
  2. for zero_shot, i just set max_new_tokens to 16, i think it don's need 512 seq len for zero_shot.
    is there anything that helps to improve the acc closer with paper?
    thanks for your time

I think the problem is that the max_new_tokens, which should be at least 256 even if the setting is zero-shot. (LLaMa needs to validate and conduct reasoning.) In addition, if you add a cot prefix to the prompt ("Let's think step by step") and do not add any few shot exemplars.. ~5% accuracy seems ok with 7B

@HuangOwen
Copy link
Author

Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:

  1. just get the max prob output, no set tempareture and top_k
  2. for zero_shot, i just set max_new_tokens to 16, i think it don's need 512 seq len for zero_shot.
    is there anything that helps to improve the acc closer with paper?
    thanks for your time

To be more specific, the setting of the original llama paper is the few-shot cot (8-shot, I guess). You could not reproduce the results or improve the accuracy closer with a zero-shot setting.

@kkchenlight
Copy link

@HuangOwen,so, you can get 11% acc by 8-shot-cot? i will try it now. in llama paper, it gives two type results, GSM8K && GSM8k+maj1@k, the sencod type will use k samples with k = 40, that means 40 shot? do you have any idea?

@kkchenlight
Copy link

Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:

  1. just get the max prob output, no set tempareture and top_k
  2. for zero_shot, i just set max_new_tokens to 16, i think it don's need 512 seq len for zero_shot.
    is there anything that helps to improve the acc closer with paper?
    thanks for your time

To be more specific, the setting of the original llama paper is the few-shot cot (8-shot, I guess). You could not reproduce the results or improve the accuracy closer with a zero-shot setting.

thanks for your quick reply , i have another question in other dialog box。

@HuangOwen
Copy link
Author

HuangOwen commented Jul 12, 2023

@HuangOwen,so, you can get 11% acc by 8-shot-cot? i will try it now. in llama paper, it gives two type results, GSM8K && GSM8k+maj1@k, the sencod type will use k samples with k = 40, that means 40 shot? do you have any idea?

Yes, you can achieve ~11% accuracy with an 8-shot cot. Remember to add special tokens of alpaca, which is mentioned above. The maj1@k indicates "majority voting", which means that you generate the answer with the same question for k times and use majority voting to select the answer.

@JianqiaoLu
Copy link

@HuangOwen,so, you can get 11% acc by 8-shot-cot? i will try it now. in llama paper, it gives two type results, GSM8K && GSM8k+maj1@k, the sencod type will use k samples with k = 40, that means 40 shot? do you have any idea?

Yes, you can achieve ~11% accuracy with an 8-shot cot. Remember to add special tokens of alpaca, which is mentioned above. The maj1@k indicates "majority voting", which means that you generate the answer with the same question for k times and use majority voting to select the answer.

what is DefaultToken.PAD_TOKEN.value here?

@kushalj001
Copy link

Hi @HuangOwen, do you have the script you used to reproduce llama-1 results on GSM8k? If you have done something similar for llama-2, it would be great!
Thanks

@fernandotenorio
Copy link

fernandotenorio commented Sep 5, 2023

I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%.
I'm using few-shot prompting with 4 prompts (the same found on the repo above).

Model: decapoda-research/llama-7b-hf
temperature=0.8
num_beams=4

I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue.

Also I'm using windows and getting the following warning:
"do_sample is set fo False. However, temperature is set to 0.8 -- this flag is only used in sample-based gneration models. You should set do_sample=True or unset temperature.

@JianqiaoLu
Copy link

I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. I'm using few-shot prompting with 4 prompts (the same found on the repo above).

Model: decapoda-research/llama-7b-hf temperature=0.8 num_beams=4

I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue.

Also I'm using windows and getting the following warning: "do_sample is set fo False. However, temperature is set to 0.8 -- this flag is only used in sample-based gneration models. You should set do_sample=True or unset temperature.

  1. close the beam and use do_sample = True
  2. lower temperature means better result actually, you shall try temperature = 0.1, for example
  3. gpt3.5-turbo for extracting answer is unnecessary, you can just find the last number (include fraction, int, float) to act as your final output.
  4. it looks like the 8-shot is a classical setting, why you use 4-shot ?
  5. in my setting, zero-shot-cot of llama1 on gsmtest is 7% approximately, while 8-shot-cot is 12%.

@fernandotenorio
Copy link

I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. I'm using few-shot prompting with 4 prompts (the same found on the repo above).
Model: decapoda-research/llama-7b-hf temperature=0.8 num_beams=4
I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue.
Also I'm using windows and getting the following warning: "do_sample is set fo False. However, temperature is set to 0.8 -- this flag is only used in sample-based gneration models. You should set do_sample=True or unset temperature.

  1. close the beam and use do_sample = True
  2. lower temperature means better result actually, you shall try temperature = 0.1, for example
  3. gpt3.5-turbo for extracting answer is unnecessary, you can just find the last number (include fraction, int, float) to act as your final output.
  4. it looks like the 8-shot is a classical setting, why you use 4-shot ?
  5. in my setting, zero-shot-cot of llama1 on gsmtest is 7% approximately, while 8-shot-cot is 12%.

Thanks for the reply!

  1. I Will try that.
  2. Is that true for reasoning tasks?
  3. llama7b is messy and outputs crazy stuff, so I'm using gpt api just to make sure.
  4. Memory limits, I'm running on a rtx 3060, I'll try increasing to 6.
  5. can you give an example of a prompt for both cases? For the zero-shot-cot do you just add "let think in steps" after the question?

My tests are still running, I've sampled ~200 answers that give ~5% on the 4-shot prompt. I seriously doubt it will make 12% after it is finished. Note that for each question I randomly sample 4 prompts to compose the 4-shot prompt. My max_new_tokens is set to 256.

@JianqiaoLu
Copy link

I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. I'm using few-shot prompting with 4 prompts (the same found on the repo above).
Model: decapoda-research/llama-7b-hf temperature=0.8 num_beams=4
I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue.
Also I'm using windows and getting the following warning: "do_sample is set fo False. However, temperature is set to 0.8 -- this flag is only used in sample-based gneration models. You should set do_sample=True or unset temperature.

  1. close the beam and use do_sample = True
  2. lower temperature means better result actually, you shall try temperature = 0.1, for example
  3. gpt3.5-turbo for extracting answer is unnecessary, you can just find the last number (include fraction, int, float) to act as your final output.
  4. it looks like the 8-shot is a classical setting, why you use 4-shot ?
  5. in my setting, zero-shot-cot of llama1 on gsmtest is 7% approximately, while 8-shot-cot is 12%.

Thanks for the reply!

  1. I Will try that.
  2. Is that true for reasoning tasks?
  3. llama7b is messy and outputs crazy stuff, so I'm using gpt api just to make sure.
  4. Memory limits, I'm running on a rtx 3060, I'll try increasing to 6.
  5. can you give an example of a prompt for both cases? For the zero-shot-cot do you just add "let think in steps" after the question?

My tests are still running, I've sampled ~200 answers that give ~5% on the 4-shot prompt. I seriously doubt it will make 12% after it is finished. Note that for each question I randomly sample 4 prompts to compose the 4-shot prompt. My max_new_tokens is set to 256.

max_new_tokens 512 maybe better even though math answers are usually short.

I agree with the fact that reasoning tasks fit in a lower temperature, but in general setting (QA, creating), temperature for llama/vicuna is 0.7.

prompt for zero-shot-cot like : "How many bolts in total does it take? Let's think step by step."

prompt for 8-shot-cot like : "USER: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now? ASSISTANT: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9USER: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot? ASSISTANT: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5USER: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny? ASSISTANT: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8USER: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room? ASSISTANT: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29USER: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total? ASSISTANT: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39USER: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday? ASSISTANT: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33USER: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? ASSISTANT: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6USER: Olivia has $23. She bought five bagels for $3 each. How much money does she have left? ASSISTANT: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8USER: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? ASSISTANT:" (just ignore the USER, ASSISTANT and replaca it to your role setting, like Q, A for example)

for 3, llama7b prone to output repetitions and you can add repetition_penalty during generation actually,

finally, recommend to use VLLM for accelerating your generation although the acc may get lower.

@fernandotenorio
Copy link

fernandotenorio commented Sep 5, 2023

Based on your suggestions I'm running an experiment with temperature=0.1, repetition_penalty=1.2, do_sample=True and top_p=0.75 (num_beams is commented out). I'm also using 8-shot prompting with max_new_tokens = 512.

Well... it is prepending all the 8 prompt examples before the main answer and sometimes it adds stuff like "### ADVANCED..." after the answer.
I can't see how one would extract the answer from this mess, even chatgpt may suffer, since the prompts already contain "the answer is x". After the first 100 answers I ran the accuracy script, it got only 1 correct...so I don't really know what I'm doing wrong.

Any ideas? Commenting out num_beams seems to free some memory, since I can now use the 8-shot (I'm not sure what this parameter does).

@JianqiaoLu
Copy link

Based on your suggestions I'm running an experiment with temperature=0.1, repetition_penalty=1.2, do_sample=True and top_p=0.75 (num_beams is commented out). I'm also using 8-shot prompting with max_new_tokens = 512.

Well... it is prepending all the 8 prompt examples before the main answer and sometimes it adds stuff like "### ADVANCED..." after the answer. I can't see how one would extract the answer from this mess, even chatgpt may suffer, since the prompts already contain "the answer is x". After the first 100 answers I ran the accuracy script, it got only 1 correct...so I don't really know what I'm doing wrong.

Any ideas? Commenting out num_beams seems to free some memory, since I can now use the 8-shot (I'm not sure what this parameter does).

what about using zero_shot_cot and see the result.

@albertodepaola albertodepaola added research-paper Issues and questions relating to the published architecture or methodology question General questions about using Llama2 labels Sep 6, 2023
@fernandotenorio
Copy link

fernandotenorio commented Sep 6, 2023

I'm running some experiments. I found that using top_k=1 with top_p=0 gives a more stable model. So I fixed those and I'm varying max_new_tokens and n in n-few-shot. So far best performance is 5.2% with n=4 (I'm running n=8, not seeing too much improvement).
What is the tokenizer thing that is discussed here? My tokenizer setup is just:

tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")

@fernandotenorio
Copy link

I got 7.5% with

temperature=0.1
top_p = 0,
top_k = 1,
max_new_tokens=1*512,
repetition_penalty=1.2

How can I setup the special tokenizer mentioned here? What is DefaultToken?

@surya-narayanan
Copy link

surya-narayanan commented Mar 1, 2024

Hi, if I understand correctly, is it that we have to take the last token in the generated prompt and assume that to be the answer?

Or rather, we check the last element of the list generated by pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)] if it doesn't have a period at the end?

I still can't find the prompt anywhere, does anyone know what the prompt is?

@Praful932
Copy link

@surya-narayanan
You can do a simple regex search as done in llm harness

For the prompt you can refer the same for the 8 shot prompt - https://github.com/EleutherAI/lm-evaluation-harness/blob/ae79b1217aad7738b91e88a4017c86a5d5e45aa7/lm_eval/tasks/gsm8k/gsm8k-cot.yaml#L8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-discussion question General questions about using Llama2 research-paper Issues and questions relating to the published architecture or methodology
Projects
None yet
Development

No branches or pull requests