NYT Connections LLM Benchmark

This benchmark evaluates large language models (LLMs) using 436 NYT Connections puzzles. Three different prompts, not optimized for LLMs through prompt engineering, are used. Both uppercase and lowercase puzzles are assessed.

Chart

Leaderboard

Model	Score
o1-preview	87.1
o1-mini	42.2
Multi-turn ensemble	37.8
GPT-4 Turbo	28.3
GPT-4o	26.5
Llama 3.1 405B	26.3
Claude 3.5 Sonnet (2024-10-22)	25.9
Claude 3 Opus	24.8
Grok Beta	23.7
Gemini 1.5 Pro (Sept)	22.7
Gemma 2 27B	18.8
Mistral Large 2	17.4
Qwen 2.5 72B	14.8
Claude 3.5 Haiku	13.7
DeepSeek-V2.5	9.9

Notes

A temperature setting of 0 was used
Partial credit is awarded if the puzzle isn't completely solved.
Only one attempt is allowed per puzzle. Humans solving puzzles on the NYT website get four attempts and a notification when they're one step away from the solution.
Multi-turn ensemble is my unpublished system. It utilizes multiple LLMs, multi-turn dialogues, and other proprietary techniques. It is slower and more costly to run but it does very well. It outperforms non-o1 LLMs on MMLU-Pro and GPQA.

Updates and Other Benchmarks

Claude 3.5 Haiku added. 13.7.
Claude 3.5 Sonnet (2024-10-22) added. Improves from 25.9 from 24.4.
Grok Beta added. Improves from 21.3 to 23.7. It's described as "experimental language model with state-of-the-art reasoning capabilities, best for complex and multi-step use cases. It is the successor of Grok 2 with enhanced context length."
Follow @lechmazur on X (Twitter) for other upcoming benchmarks and more.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
functions		functions
prompts		prompts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYT Connections LLM Benchmark

Chart

Leaderboard

Notes

Updates and Other Benchmarks

About

Languages

lechmazur/nyt-connections

Folders and files

Latest commit

History

Repository files navigation

NYT Connections LLM Benchmark

Chart

Leaderboard

Notes

Updates and Other Benchmarks

About

Topics

Resources

Stars

Watchers

Forks

Languages