Skip to content

Benchmark that evaluates LLMs using 436 NYT Connections puzzles

Notifications You must be signed in to change notification settings

lechmazur/nyt-connections

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

NYT Connections LLM Benchmark

This benchmark evaluates large language models (LLMs) using 436 NYT Connections puzzles. Three different prompts, not optimized for LLMs through prompt engineering, are used. Both uppercase and lowercase puzzles are assessed.

Chart

NYT Connections (436 puzzles)_ (10)

Leaderboard

Model Score
o1-preview 87.1
o1-mini 42.2
Multi-turn ensemble 37.8
GPT-4 Turbo 28.3
GPT-4o 26.5
Llama 3.1 405B 26.3
Claude 3.5 Sonnet (2024-10-22) 25.9
Claude 3 Opus 24.8
Grok Beta 23.7
Gemini 1.5 Pro (Sept) 22.7
Gemma 2 27B 18.8
Mistral Large 2 17.4
Qwen 2.5 72B 14.8
Claude 3.5 Haiku 13.7
DeepSeek-V2.5 9.9

Notes

  • A temperature setting of 0 was used
  • Partial credit is awarded if the puzzle isn't completely solved.
  • Only one attempt is allowed per puzzle. Humans solving puzzles on the NYT website get four attempts and a notification when they're one step away from the solution.
  • Multi-turn ensemble is my unpublished system. It utilizes multiple LLMs, multi-turn dialogues, and other proprietary techniques. It is slower and more costly to run but it does very well. It outperforms non-o1 LLMs on MMLU-Pro and GPQA.

Updates and Other Benchmarks

  • Claude 3.5 Haiku added. 13.7.
  • Claude 3.5 Sonnet (2024-10-22) added. Improves from 25.9 from 24.4.
  • Grok Beta added. Improves from 21.3 to 23.7. It's described as "experimental language model with state-of-the-art reasoning capabilities, best for complex and multi-step use cases. It is the successor of Grok 2 with enhanced context length."
  • Follow @lechmazur on X (Twitter) for other upcoming benchmarks and more.