Retain Leaderboard
Which model should I use for retain() and observation consolidation?
Want to see another model here?Open a GitHub issue
Rank | Model | Provider | Total Score Quality + Speed + Cost + Reliability | Quality LoComo accuracy | Speed Latency + Throughput | Cost $ per 1M tokens | Reliability Schema conformance |
|---|---|---|---|---|---|---|---|
1🏆 | openai/gpt-oss-20b | Groq | 81.2 | 83.9 84% accuracy | 57.0 7.5s · 2434 tok/s | 91.7 $0.05/$0.08 | 100.0 50/50 tests |
2🥈 | gpt-4.1-nano | OpenAI | 79.7 | 87.2 87% accuracy | 54.0 8.5s · 263 tok/s | 81.6 $0.07/$0.30 | 100.0 50/50 tests |
3🥉 | openai/gpt-oss-120b | Groq | 79.7 | 84.7 85% accuracy | 55.4 8.1s · 1604 tok/s | 84.7 $0.10/$0.16 | 100.0 50/50 tests |
4 | gpt-4o-mini | OpenAI | 74.3 | 81.0 81% accuracy | 52.5 9.0s · 183 tok/s | 69.0 $0.15/$0.60 | 100.0 50/50 tests |
5 | llama-3.3-70b-versatile | Groq | 73.7 | 85.5 86% accuracy | 67.4 4.8s · 306 tok/s | 50.4 $0.59/$0.79 | 84.0 42/50 tests |
6 | gemini-2.5-flash-lite | Google | 73.3 | 84.7 85% accuracy | 38.7 15.8s · 621 tok/s | 76.9 $0.10/$0.40 | 96.0 48/50 tests |
7 | gpt-4.1-mini | OpenAI | 73.2 | 86.4 86% accuracy | 39.3 15.4s · 229 tok/s | 69.0 $0.15/$0.60 | 100.0 50/50 tests |
8 | gpt-5-nano | OpenAI | 66.5 | 83.9 84% accuracy | 7.9 117.3s · 342 tok/s | 80.0 $0.05/$0.40 | 100.0 50/50 tests |
9 | gpt-5-mini | OpenAI | 63.0 | 89.7 90% accuracy | 12.7 68.5s · 373 tok/s | 44.4 $0.25/$2.00 | 100.0 50/50 tests |
10 | llama-3.1-8b-instant | Groq | 62.2 | 84.3 84% accuracy | 34.6 18.9s · 29 tok/s | 91.7 $0.05/$0.08 | 10.0 5/50 tests |
11 | gemini-2.5-flash | Google | 60.7 | 85.5 86% accuracy | 14.5 58.9s · 314 tok/s | 39.2 $0.30/$2.50 | 100.0 50/50 tests |
12 | gpt-5.4 | OpenAI | 57.1 | 86.8 87% accuracy | 22.3 34.8s · 260 tok/s | 9.1 $2.50/$15.00 | 100.0 50/50 tests |
13 | gemini-3-flash-preview | Google | 56.6 | 83.5 84% accuracy | 8.4 109.3s · 41 tok/s | 33.3 $0.50/$3.00 | 96.0 48/50 tests |
14 | gpt-5.2 | OpenAI | 56.2 | 83.5 84% accuracy | 23.1 33.4s · 239 tok/s | 10.3 $1.75/$14.00 | 100.0 50/50 tests |
Models Not Viable for This Task
These models were tested but scored 0 successes — they could not follow the required JSON schema.
- ✗gemma3:1b
- ✗gemma3:12b
- ✗qwen2.5:0.5b
- ✗qwen2.5:3b
- ✗smollm2:1.7b
- ✗gemma3:270m
- ✗deepseek-r1:1.5b
- ✗granite3.1-dense:2b
- ✗llama3.2:latest
- ✗ministral-3:3b
About This Benchmark
Hindsight v0.4.13| Metric | Value shown | How it is measured |
|---|---|---|
| Quality | % accuracy | Accuracy on the LoComo long-term conversation benchmark (conv-43, 242 questions). The model under test powers the retain step (fact extraction and schema structuring). Recall and answer generation use a fixed gemini-2.5-flash-lite baseline, and answers are judged by gemini-2.5-flash-lite. The score represents the percentage of questions answered correctly out of 242, including adversarial questions that test whether the model hallucinates information that was never mentioned. |
| Speed | latency (s) · tok/s | Mean end-to-end latency per request (arithmetic average across all successful requests) and output throughput (tokens/second), measured during the fact-extraction benchmark. Tests were run on a local MacBook with a 1 Gbps internet connection from Europe. Results will vary depending on your network conditions, geographic proximity to the provider's servers, and server-side load at the time of testing. Note: this benchmark does not enforce or simulate rate limits. Actual throughput may be lower in production depending on your subscription tier and the provider's rate-limiting policies. |
| Cost | $ input / $ output per 1M tokens | Published list prices (USD per million tokens) for input and output tokens, as advertised by each provider at the time of testing. Prices may have changed since then. Subscription-based models are scored separately on value relative to their monthly fee. Local models have no per-token cost and always score 100 on this dimension. |
| Reliability | success / total tests | The fraction of fact-extraction requests that returned valid JSON conforming to the required schema. A failed request is one that timed out, returned an HTTP error, or produced malformed / schema-invalid JSON. Models are tested on 20 diverse conversation scenarios; a perfect score means 20/20 valid responses. |
| Total Score | 0 – 100 | Weighted composite: Quality 40% + Speed 25% + Cost 20% + Reliability 15%. Each dimension is normalised to a 0–100 scale before weighting. Speed uses the formula100 × 10 / (10 + latency_s)so a 10-second response scores 50; Cost uses100 × 0.001 / (0.001 + cost_per_req)so only genuinely free models approach 100. |


