Model | Mode | Acc | No answer | Total | Reason Lens |
---|---|---|---|---|---|
claude-3-5-sonnet-20241022 | greedy | 96.66 | 0 | 1319 | 352.89 |
gpt-4o-2024-08-06 | greedy | 96.21 | 0 | 1319 | 462.06 |
o1-mini-2024-09-12 | greedy | 96.06 | 0 | 1319 | 335.77 |
Llama-3.1-405B-Inst@hyperbolic | greedy | 95.98 | 0.08 | 1319 | 421.83 |
Llama-3.1-405B-Inst@sambanova | greedy | 95.91 | 0.08 | 1319 | 464.76 |
Llama-3.1-405B-Inst-fp8@together | greedy | 95.91 | 0.08 | 1319 | 365.07 |
claude-3-5-sonnet-20240620 | greedy | 95.6 | 0 | 1319 | 465.19 |
claude-3-opus-20240229 | greedy | 95.6 | 0 | 1319 | 410.62 |
Mistral-Large-2 | greedy | 95.53 | 0 | 1319 | 391.07 |
gpt-4o-2024-05-13 | greedy | 95.38 | 0 | 1319 | 479.98 |
gemini-1.5-flash-exp-0827 | greedy | 95.01 | 0 | 681 | 515.13 |
gemini-1.5-pro-exp-0801 | greedy | 95 | 0 | 1319 | 298.8 |
claude-3-5-haiku-20241022 | greedy | 94.47 | 0 | 1319 | 455.78 |
gpt-4o-mini-2024-07-18 | greedy | 94.24 | 0 | 1319 | 463.71 |
Meta-Llama-3.1-70B-Instruct | greedy | 94.16 | 0.08 | 1319 | 453.94 |
deepseek-v2-chat-0628 | greedy | 93.93 | 0 | 1319 | 495.52 |
deepseek-v2-coder-0614 | greedy | 93.78 | 0 | 1319 | 566.89 |
gemini-1.5-pro | greedy | 93.4 | 0 | 1319 | 389.17 |
Meta-Llama-3-70B-Instruct | greedy | 93.03 | 0 | 1319 | 352.05 |
Qwen2-72B-Instruct | greedy | 92.65 | 0 | 1319 | 375.96 |
deepseek-v2.5-0908 | greedy | 92.49 | 0 | 1319 | 490.46 |
deepseek-v2-coder-0724 | greedy | 91.51 | 0 | 1319 | 494.62 |
claude-3-sonnet-20240229 | greedy | 91.51 | 0 | 1319 | 762.69 |
gemini-1.5-flash | greedy | 91.36 | 0 | 1319 | 344.61 |
gemma-2-27b-it | greedy | 90.22 | 0 | 1319 | 364.68 |
claude-3-haiku-20240307 | greedy | 88.78 | 0 | 1319 | 587.65 |
gemma-2-9b-it | greedy | 87.41 | 0 | 1319 | 394.83 |
reka-core-20240501 | greedy | 87.41 | 0.08 | 1319 | 414.7 |
Athene-70B | greedy | 86.66 | 0.3 | 1319 | 253.53 |
Yi-1.5-34B-Chat | greedy | 84.08 | 0.08 | 1319 | 553.47 |
Meta-Llama-3.1-8B-Instruct | greedy | 84 | 0.38 | 1319 | 511.97 |
Mistral-Nemo-Instruct-2407 | greedy | 82.79 | 0 | 1319 | 349.81 |
yi-large-preview | greedy | 82.64 | 0 | 1319 | 514.25 |
Phi-3.5-mini-instruct | greedy | 82.03 | 1.21 | 1319 | 665.69 |
gpt-3.5-turbo-0125 | greedy | 80.36 | 0 | 1319 | 350.97 |
command-r-plus | greedy | 80.14 | 0.08 | 1319 | 294.08 |
Qwen2-7B-Instruct | greedy | 80.06 | 0 | 1319 | 452.6 |
yi-large | greedy | 80.06 | 0 | 1319 | 479.87 |
Meta-Llama-3-8B-Instruct | greedy | 78.47 | 0 | 1319 | 429.39 |
Yi-1.5-9B-Chat | greedy | 76.42 | 0.08 | 1319 | 485.39 |
Phi-3-mini-4k-instruct | greedy | 75.51 | 0 | 1319 | 462.53 |
reka-flash-20240226 | greedy | 74.68 | 0.45 | 1319 | 460.06 |
Mixtral-8x7B-Instruct-v0.1 | greedy | 70.13 | 2.27 | 1319 | 361.12 |
Llama-3-Instruct-8B-SimPO-v0.2 | greedy | 57.54 | 2.05 | 1319 | 505.25 |
command-r | greedy | 52.99 | 0 | 1319 | 294.43 |
gemma-2-2b-it | greedy | 51.63 | 0.38 | 1319 | 420.05 |
Qwen2-1.5B-Instruct | greedy | 43.37 | 4.78 | 1319 | 301.67 |