LongMemEvalS LoComo10 PersonaMem32K BEAM100K BEAM1M LifeBenchEN BEAM500K BEAM10M

View full comparison

Community Docs GitHub

Reflect Leaderboard

Which model should I use for reflect() operations?

Want to see another model here?Open a GitHub issue

Rank	Model	Provider	Total Score Quality + Speed + Cost ↓	Quality LoComo accuracy	Speed End-to-end per agentic call	Cost $ per 1M tokens
1🏆	openai/gpt-oss-120b	Groq	86.6	94.2 94% accuracy	69.4 2.2s/call	84.7 $0.10/$0.16
2🥈	openai/gpt-oss-20b	Groq	86.3	94.2 94% accuracy	64.3 2.8s/call	91.7 $0.05/$0.08
3🥉	gemini-2.5-flash-lite	Google	81.5	88.4 88% accuracy	67.7 2.4s/call	76.9 $0.10/$0.40
4	gpt-4o-mini	OpenAI	77.0	93.8 94% accuracy	41.7 7.0s/call	69.0 $0.15/$0.60
5	gpt-4.1-nano	OpenAI	74.5	80.2 80% accuracy	59.5 3.4s/call	76.9 $0.10/$0.40
6	gpt-5-nano	OpenAI	71.6	91.7 92% accuracy	18.2 22.5s/call	80.0 $0.05/$0.40
7	gpt-4.1-mini	OpenAI	71.4	88.0 88% accuracy	47.2 5.6s/call	45.5 $0.40/$1.60
8	gemini-2.5-flash	Google	70.4	85.5 86% accuracy	52.9 4.4s/call	39.2 $0.30/$2.50
9	gpt-5.4-nano	OpenAI	68.1	79.3 79% accuracy	49.2 5.2s/call	54.8 $0.20/$1.25
10	gemini-3-flash-preview	Google	67.6	93.0 93% accuracy	27.2 13.4s/call	33.3 $0.50/$3.00
11	llama-3.1-8b-instant	Groq	65.0	82.2 82% accuracy	7.6 61.1s/call	91.7 $0.05/$0.08
12	gpt-5.4-mini	OpenAI	63.9	83.5 84% accuracy	40.1 7.5s/call	25.0 $0.75/$4.50
13	gpt-5.4	OpenAI	56.0	76.9 77% accuracy	33.9 9.8s/call	9.1 $2.50/$15.00
14	gpt-5-mini	OpenAI	53.6	70.7 71% accuracy	18.0 22.8s/call	44.4 $0.25/$2.00
15	gpt-5.2	OpenAI	53.2	71.9 72% accuracy	34.1 9.7s/call	10.3 $1.75/$14.00
16	llama-3.3-70b-versatile	Groq	53.0	73.1 73% accuracy	6.2 76.0s/call	50.4 $0.59/$0.79

About This Benchmark

Hindsight v0.4.13

Metric	Value shown	How it is measured
Quality	% accuracy	Accuracy on the LoComo long-term conversation benchmark (conv-43, 242 questions). Each question is answered using a single `reflect()` call and judged by gemini-2.5-flash-lite. The retain/ingestion step always uses gemini-2.5-flash as a constant baseline — only the reflect step tests the model under evaluation.
Speed	avg latency (s/call)	Mean end-to-end wall-clock time per `reflect()` call (arithmetic average across all 242 questions). This is not a single LLM inference time — each `reflect()` call runs a full agentic pipeline internally with potentially several model calls. The model under test powers the answer generation step; the retain model (gemini-2.5-flash) is fixed. Tests were run on a local MacBook with a 1 Gbps internet connection from Europe. Results will vary depending on your network conditions and server-side load.
Cost	$ input / $ output per 1M tokens	Published list prices (USD per million tokens) as advertised by each provider. Prices may have changed since testing.
Total Score	0 – 100	Weighted composite: Quality 60% + Speed 25% + Cost 15%. Reliability is not included since the reflect endpoint does not involve schema conformance risk. Speed uses `100 × 5 / (5 + latency_s)` (5s reference — calibrated for reflect() call durations); Cost uses `100 × 0.001 / (0.001 + cost_per_req)`.