Reflect Leaderboard
Which model should I use for reflect() operations?
Want to see another model here?Open a GitHub issue
Rank | Model | Provider | Total Score Quality + Speed + Cost | Quality LoComo accuracy | Speed End-to-end per agentic call | Cost $ per 1M tokens |
|---|---|---|---|---|---|---|
1🏆 | openai/gpt-oss-120b | Groq | 86.6 | 94.2 94% accuracy | 69.4 2.2s/call | 84.7 $0.10/$0.16 |
2🥈 | openai/gpt-oss-20b | Groq | 86.3 | 94.2 94% accuracy | 64.3 2.8s/call | 91.7 $0.05/$0.08 |
3🥉 | gemini-2.5-flash-lite | Google | 81.5 | 88.4 88% accuracy | 67.7 2.4s/call | 76.9 $0.10/$0.40 |
4 | gpt-4o-mini | OpenAI | 77.0 | 93.8 94% accuracy | 41.7 7.0s/call | 69.0 $0.15/$0.60 |
5 | gpt-4.1-nano | OpenAI | 75.2 | 80.2 80% accuracy | 59.5 3.4s/call | 81.6 $0.07/$0.30 |
6 | gpt-4.1-mini | OpenAI | 74.9 | 88.0 88% accuracy | 47.2 5.6s/call | 69.0 $0.15/$0.60 |
7 | gpt-5-nano | OpenAI | 71.6 | 91.7 92% accuracy | 18.2 22.5s/call | 80.0 $0.05/$0.40 |
8 | gemini-2.5-flash | Google | 70.4 | 85.5 86% accuracy | 52.9 4.4s/call | 39.2 $0.30/$2.50 |
9 | gemini-3-flash-preview | Google | 67.6 | 93.0 93% accuracy | 27.2 13.4s/call | 33.3 $0.50/$3.00 |
10 | llama-3.1-8b-instant | Groq | 65.0 | 82.2 82% accuracy | 7.6 61.1s/call | 91.7 $0.05/$0.08 |
11 | gpt-5.4 | OpenAI | 56.0 | 76.9 77% accuracy | 33.9 9.8s/call | 9.1 $2.50/$15.00 |
12 | gpt-5-mini | OpenAI | 53.6 | 70.7 71% accuracy | 18.0 22.8s/call | 44.4 $0.25/$2.00 |
13 | gpt-5.2 | OpenAI | 53.2 | 71.9 72% accuracy | 34.1 9.7s/call | 10.3 $1.75/$14.00 |
14 | llama-3.3-70b-versatile | Groq | 53.0 | 73.1 73% accuracy | 6.2 76.0s/call | 50.4 $0.59/$0.79 |
About This Benchmark
Hindsight v0.4.13| Metric | Value shown | How it is measured |
|---|---|---|
| Quality | % accuracy | Accuracy on the LoComo long-term conversation benchmark (conv-43, 242 questions). Each question is answered using a single reflect() call and judged by gemini-2.5-flash-lite. The retain/ingestion step always uses gemini-2.5-flash as a constant baseline — only the reflect step tests the model under evaluation. |
| Speed | avg latency (s/call) | Mean end-to-end wall-clock time per reflect() call (arithmetic average across all 242 questions). This is not a single LLM inference time — each reflect() call runs a full agentic pipeline internally with potentially several model calls. The model under test powers the answer generation step; the retain model (gemini-2.5-flash) is fixed.Tests were run on a local MacBook with a 1 Gbps internet connection from Europe. Results will vary depending on your network conditions and server-side load. |
| Cost | $ input / $ output per 1M tokens | Published list prices (USD per million tokens) as advertised by each provider. Prices may have changed since testing. |
| Total Score | 0 – 100 | Weighted composite: Quality 60% + Speed 25% + Cost 15%. Reliability is not included since the reflect endpoint does not involve schema conformance risk. Speed uses 100 × 5 / (5 + latency_s) (5s reference — calibrated for reflect() call durations); Cost uses 100 × 0.001 / (0.001 + cost_per_req). |


