Hindsight/Benchmarks

Reflect Leaderboard

Which model should I use for reflect() operations?

Want to see another model here?Open a GitHub issue
Rank
Model
Provider
Total Score
Quality + Speed + Cost
Quality
LoComo accuracy
Speed
End-to-end per agentic call
Cost
$ per 1M tokens
1🏆
openai/gpt-oss-120b
GroqGroq
86.6
94.2
94% accuracy
69.4
2.2s/call
84.7
$0.10/$0.16
2🥈
openai/gpt-oss-20b
GroqGroq
86.3
94.2
94% accuracy
64.3
2.8s/call
91.7
$0.05/$0.08
3🥉
gemini-2.5-flash-lite
GoogleGoogle
81.5
88.4
88% accuracy
67.7
2.4s/call
76.9
$0.10/$0.40
4
gpt-4o-mini
OpenAIOpenAI
77.0
93.8
94% accuracy
41.7
7.0s/call
69.0
$0.15/$0.60
5
gpt-4.1-nano
OpenAIOpenAI
75.2
80.2
80% accuracy
59.5
3.4s/call
81.6
$0.07/$0.30
6
gpt-4.1-mini
OpenAIOpenAI
74.9
88.0
88% accuracy
47.2
5.6s/call
69.0
$0.15/$0.60
7
gpt-5-nano
OpenAIOpenAI
71.6
91.7
92% accuracy
18.2
22.5s/call
80.0
$0.05/$0.40
8
gemini-2.5-flash
GoogleGoogle
70.4
85.5
86% accuracy
52.9
4.4s/call
39.2
$0.30/$2.50
9
gemini-3-flash-preview
GoogleGoogle
67.6
93.0
93% accuracy
27.2
13.4s/call
33.3
$0.50/$3.00
10
llama-3.1-8b-instant
GroqGroq
65.0
82.2
82% accuracy
7.6
61.1s/call
91.7
$0.05/$0.08
11
gpt-5.4
OpenAIOpenAI
56.0
76.9
77% accuracy
33.9
9.8s/call
9.1
$2.50/$15.00
12
gpt-5-mini
OpenAIOpenAI
53.6
70.7
71% accuracy
18.0
22.8s/call
44.4
$0.25/$2.00
13
gpt-5.2
OpenAIOpenAI
53.2
71.9
72% accuracy
34.1
9.7s/call
10.3
$1.75/$14.00
14
llama-3.3-70b-versatile
GroqGroq
53.0
73.1
73% accuracy
6.2
76.0s/call
50.4
$0.59/$0.79

About This Benchmark

Hindsight v0.4.13
MetricValue shownHow it is measured
Quality% accuracyAccuracy on the LoComo long-term conversation benchmark (conv-43, 242 questions). Each question is answered using a single reflect() call and judged by gemini-2.5-flash-lite. The retain/ingestion step always uses gemini-2.5-flash as a constant baseline — only the reflect step tests the model under evaluation.
Speedavg latency (s/call)Mean end-to-end wall-clock time per reflect() call (arithmetic average across all 242 questions). This is not a single LLM inference time — each reflect() call runs a full agentic pipeline internally with potentially several model calls. The model under test powers the answer generation step; the retain model (gemini-2.5-flash) is fixed.

Tests were run on a local MacBook with a 1 Gbps internet connection from Europe. Results will vary depending on your network conditions and server-side load.
Cost$ input / $ output per 1M tokensPublished list prices (USD per million tokens) as advertised by each provider. Prices may have changed since testing.
Total Score0 – 100Weighted composite: Quality 60% + Speed 25% + Cost 15%. Reliability is not included since the reflect endpoint does not involve schema conformance risk. Speed uses 100 × 5 / (5 + latency_s) (5s reference — calibrated for reflect() call durations); Cost uses 100 × 0.001 / (0.001 + cost_per_req).