This benchmark evaluates large language models (LLMs) using 436 NYT Connections puzzles. Three different prompts, not optimized for LLMs through prompt engineering, are used. Both uppercase and lowercase puzzles are assessed.
Model | Score |
---|---|
o1-preview | 87.1 |
o1-mini | 42.2 |
Multi-turn ensemble | 37.8 |
GPT-4 Turbo | 28.3 |
GPT-4o | 26.5 |
Llama 3.1 405B | 26.3 |
Claude 3.5 Sonnet (2024-10-22) | 25.9 |
Claude 3 Opus | 24.8 |
Grok Beta | 23.7 |
Gemini 1.5 Pro (Sept) | 22.7 |
Gemma 2 27B | 18.8 |
Mistral Large 2 | 17.4 |
Qwen 2.5 72B | 14.8 |
Claude 3.5 Haiku | 13.7 |
DeepSeek-V2.5 | 9.9 |
- A temperature setting of 0 was used
- Partial credit is awarded if the puzzle isn't completely solved.
- Only one attempt is allowed per puzzle. Humans solving puzzles on the NYT website get four attempts and a notification when they're one step away from the solution.
- Multi-turn ensemble is my unpublished system. It utilizes multiple LLMs, multi-turn dialogues, and other proprietary techniques. It is slower and more costly to run but it does very well. It outperforms non-o1 LLMs on MMLU-Pro and GPQA.
- Claude 3.5 Haiku added. 13.7.
- Claude 3.5 Sonnet (2024-10-22) added. Improves from 25.9 from 24.4.
- Grok Beta added. Improves from 21.3 to 23.7. It's described as "experimental language model with state-of-the-art reasoning capabilities, best for complex and multi-step use cases. It is the successor of Grok 2 with enhanced context length."
- Follow @lechmazur on X (Twitter) for other upcoming benchmarks and more.