LongMemEvalS LoComo10 PersonaMem32K BEAM100K BEAM1M LifeBenchEN BEAM500K BEAM10M

View full comparison

Community Docs GitHub

Retain Leaderboard

Which model should I use for retain() and observation consolidation?

Want to see another model here?Open a GitHub issue

Rank	Model	Provider	Total Score Quality + Speed + Cost + Reliability ↓	Quality LoComo accuracy	Speed Latency + Throughput	Cost $ per 1M tokens	Reliability Schema conformance
1🏆	openai/gpt-oss-20b	Groq	81.2	83.9 84% accuracy	57.0 7.5s · 2434 tok/s	91.7 $0.05/$0.08	100.0 50/50 tests
2🥈	openai/gpt-oss-120b	Groq	79.7	84.7 85% accuracy	55.4 8.1s · 1604 tok/s	84.7 $0.10/$0.16	100.0 50/50 tests
3🥉	gpt-4.1-nano	OpenAI	78.8	87.2 87% accuracy	54.0 8.5s · 263 tok/s	76.9 $0.10/$0.40	100.0 50/50 tests
4	gemma4:31b	Ollama Cloud	76.0	86.0 86% accuracy	26.3 28.0s · 44 tok/s	100.0 Free	100.0 12/12 tests
5	gpt-5.4-nano	OpenAI	74.7	83.9 84% accuracy	60.9 6.4s · 686 tok/s	54.8 $0.20/$1.25	100.0 50/50 tests
6	gpt-4o-mini	OpenAI	74.3	81.0 81% accuracy	52.5 9.0s · 183 tok/s	69.0 $0.15/$0.60	100.0 50/50 tests
7	llama-3.3-70b-versatile	Groq	73.7	85.5 86% accuracy	67.4 4.8s · 306 tok/s	50.4 $0.59/$0.79	84.0 42/50 tests
8	gemini-2.5-flash-lite	Google	73.3	84.7 85% accuracy	38.7 15.8s · 621 tok/s	76.9 $0.10/$0.40	96.0 48/50 tests
9	gpt-5.4-mini	OpenAI	70.8	86.4 86% accuracy	64.9 5.4s · 705 tok/s	25.0 $0.75/$4.50	100.0 50/50 tests
10	gpt-4.1-mini	OpenAI	68.5	86.4 86% accuracy	39.3 15.4s · 229 tok/s	45.5 $0.40/$1.60	100.0 50/50 tests
11	gpt-5-nano	OpenAI	66.5	83.9 84% accuracy	7.9 117.3s · 342 tok/s	80.0 $0.05/$0.40	100.0 50/50 tests
12	gpt-5-mini	OpenAI	63.0	89.7 90% accuracy	12.7 68.5s · 373 tok/s	44.4 $0.25/$2.00	100.0 50/50 tests
13	llama-3.1-8b-instant	Groq	62.2	84.3 84% accuracy	34.6 18.9s · 29 tok/s	91.7 $0.05/$0.08	10.0 5/50 tests
14	gemini-2.5-flash	Google	60.7	85.5 86% accuracy	14.5 58.9s · 314 tok/s	39.2 $0.30/$2.50	100.0 50/50 tests
15	gpt-5.4	OpenAI	57.1	86.8 87% accuracy	22.3 34.8s · 260 tok/s	9.1 $2.50/$15.00	100.0 50/50 tests
16	gemini-3-flash-preview	Google	56.6	83.5 84% accuracy	8.4 109.3s · 41 tok/s	33.3 $0.50/$3.00	96.0 48/50 tests
17	gpt-5.2	OpenAI	56.2	83.5 84% accuracy	23.1 33.4s · 239 tok/s	10.3 $1.75/$14.00	100.0 50/50 tests

Models Not Viable for This Task

These models were tested but scored 0 successes — they could not follow the required JSON schema.

✗gemma3:1b
✗gemma3:12b
✗qwen2.5:0.5b
✗qwen2.5:3b
✗smollm2:1.7b
✗gemma3:270m
✗deepseek-r1:1.5b
✗granite3.1-dense:2b
✗llama3.2:latest
✗ministral-3:3b

About This Benchmark

Hindsight v0.4.13

Metric	Value shown	How it is measured
Quality	% accuracy	Accuracy on the LoComo long-term conversation benchmark (conv-43, 242 questions). The model under test powers the retain step (fact extraction and schema structuring). Recall and answer generation use a fixed gemini-2.5-flash-lite baseline, and answers are judged by gemini-2.5-flash-lite. The score represents the percentage of questions answered correctly out of 242, including adversarial questions that test whether the model hallucinates information that was never mentioned.
Speed	latency (s) · tok/s	Mean end-to-end latency per request (arithmetic average across all successful requests) and output throughput (tokens/second), measured during the fact-extraction benchmark. Tests were run on a local MacBook with a 1 Gbps internet connection from Europe. Results will vary depending on your network conditions, geographic proximity to the provider's servers, and server-side load at the time of testing. Note: this benchmark does not enforce or simulate rate limits. Actual throughput may be lower in production depending on your subscription tier and the provider's rate-limiting policies.
Cost	$ input / $ output per 1M tokens	Published list prices (USD per million tokens) for input and output tokens, as advertised by each provider at the time of testing. Prices may have changed since then. Subscription-based models are scored separately on value relative to their monthly fee. Local models have no per-token cost and always score 100 on this dimension.
Reliability	success / total tests	The fraction of fact-extraction requests that returned valid JSON conforming to the required schema. A failed request is one that timed out, returned an HTTP error, or produced malformed / schema-invalid JSON. Models are tested on 20 diverse conversation scenarios; a perfect score means 20/20 valid responses.
Total Score	0 – 100	Weighted composite: Quality 40% + Speed 25% + Cost 20% + Reliability 15%. Each dimension is normalised to a 0–100 scale before weighting. Speed uses the formula`100 × 10 / (10 + latency_s)`so a 10-second response scores 50; Cost uses`100 × 0.001 / (0.001 + cost_per_req)`so only genuinely free models approach 100.