Hindsight/Benchmarks

Retain Leaderboard

Which model should I use for retain() and observation consolidation?

Want to see another model here?Open a GitHub issue
Rank
Model
Provider
Total Score
Quality + Speed + Cost + Reliability
Quality
LoComo accuracy
Speed
Latency + Throughput
Cost
$ per 1M tokens
Reliability
Schema conformance
1🏆
openai/gpt-oss-20b
GroqGroq
81.2
83.9
84% accuracy
57.0
7.5s · 2434 tok/s
91.7
$0.05/$0.08
100.0
50/50 tests
2🥈
gpt-4.1-nano
OpenAIOpenAI
79.7
87.2
87% accuracy
54.0
8.5s · 263 tok/s
81.6
$0.07/$0.30
100.0
50/50 tests
3🥉
openai/gpt-oss-120b
GroqGroq
79.7
84.7
85% accuracy
55.4
8.1s · 1604 tok/s
84.7
$0.10/$0.16
100.0
50/50 tests
4
gpt-4o-mini
OpenAIOpenAI
74.3
81.0
81% accuracy
52.5
9.0s · 183 tok/s
69.0
$0.15/$0.60
100.0
50/50 tests
5
llama-3.3-70b-versatile
GroqGroq
73.7
85.5
86% accuracy
67.4
4.8s · 306 tok/s
50.4
$0.59/$0.79
84.0
42/50 tests
6
gemini-2.5-flash-lite
GoogleGoogle
73.3
84.7
85% accuracy
38.7
15.8s · 621 tok/s
76.9
$0.10/$0.40
96.0
48/50 tests
7
gpt-4.1-mini
OpenAIOpenAI
73.2
86.4
86% accuracy
39.3
15.4s · 229 tok/s
69.0
$0.15/$0.60
100.0
50/50 tests
8
gpt-5-nano
OpenAIOpenAI
66.5
83.9
84% accuracy
7.9
117.3s · 342 tok/s
80.0
$0.05/$0.40
100.0
50/50 tests
9
gpt-5-mini
OpenAIOpenAI
63.0
89.7
90% accuracy
12.7
68.5s · 373 tok/s
44.4
$0.25/$2.00
100.0
50/50 tests
10
llama-3.1-8b-instant
GroqGroq
62.2
84.3
84% accuracy
34.6
18.9s · 29 tok/s
91.7
$0.05/$0.08
10.0
5/50 tests
11
gemini-2.5-flash
GoogleGoogle
60.7
85.5
86% accuracy
14.5
58.9s · 314 tok/s
39.2
$0.30/$2.50
100.0
50/50 tests
12
gpt-5.4
OpenAIOpenAI
57.1
86.8
87% accuracy
22.3
34.8s · 260 tok/s
9.1
$2.50/$15.00
100.0
50/50 tests
13
gemini-3-flash-preview
GoogleGoogle
56.6
83.5
84% accuracy
8.4
109.3s · 41 tok/s
33.3
$0.50/$3.00
96.0
48/50 tests
14
gpt-5.2
OpenAIOpenAI
56.2
83.5
84% accuracy
23.1
33.4s · 239 tok/s
10.3
$1.75/$14.00
100.0
50/50 tests

Models Not Viable for This Task

These models were tested but scored 0 successes — they could not follow the required JSON schema.

  • gemma3:1b
  • gemma3:12b
  • qwen2.5:0.5b
  • qwen2.5:3b
  • smollm2:1.7b
  • gemma3:270m
  • deepseek-r1:1.5b
  • granite3.1-dense:2b
  • llama3.2:latest
  • ministral-3:3b

About This Benchmark

Hindsight v0.4.13
MetricValue shownHow it is measured
Quality% accuracyAccuracy on the LoComo long-term conversation benchmark (conv-43, 242 questions). The model under test powers the retain step (fact extraction and schema structuring). Recall and answer generation use a fixed gemini-2.5-flash-lite baseline, and answers are judged by gemini-2.5-flash-lite. The score represents the percentage of questions answered correctly out of 242, including adversarial questions that test whether the model hallucinates information that was never mentioned.
Speedlatency (s) · tok/sMean end-to-end latency per request (arithmetic average across all successful requests) and output throughput (tokens/second), measured during the fact-extraction benchmark. Tests were run on a local MacBook with a 1 Gbps internet connection from Europe. Results will vary depending on your network conditions, geographic proximity to the provider's servers, and server-side load at the time of testing.

Note: this benchmark does not enforce or simulate rate limits. Actual throughput may be lower in production depending on your subscription tier and the provider's rate-limiting policies.
Cost$ input / $ output per 1M tokensPublished list prices (USD per million tokens) for input and output tokens, as advertised by each provider at the time of testing. Prices may have changed since then. Subscription-based models are scored separately on value relative to their monthly fee. Local models have no per-token cost and always score 100 on this dimension.
Reliabilitysuccess / total testsThe fraction of fact-extraction requests that returned valid JSON conforming to the required schema. A failed request is one that timed out, returned an HTTP error, or produced malformed / schema-invalid JSON. Models are tested on 20 diverse conversation scenarios; a perfect score means 20/20 valid responses.
Total Score0 – 100Weighted composite: Quality 40% + Speed 25% + Cost 20% + Reliability 15%. Each dimension is normalised to a 0–100 scale before weighting. Speed uses the formula100 × 10 / (10 + latency_s)so a 10-second response scores 50; Cost uses100 × 0.001 / (0.001 + cost_per_req)so only genuinely free models approach 100.