Hindsight/Benchmarks

Embeddings Leaderboard

Which embedding model works best with Hindsight? Rankings cover retrieval quality, speed, and cost β€” the embedding model affects both memory storage (retain()) and search (recall()).

Want to see another embedding model here?Open a GitHub issue
Rank
Embedding Model
Provider
Total Score
MRR + Speed + Cost
↓
MRR
Mean Reciprocal Rank
R@1
Recall at 1
R@3
Recall at 3
R@5
Recall at 5
Speed
avg s/recall
Cost
$ per 1M tokens
1πŸ†
BGE Small EN v1.5
BAAI/bge-small-en-v1.5
384-dim
Local CPULocal CPU
73.8
0.770
65.3%
86.8%
91.0%
32.4
2.08s/call
100.0
Free
2πŸ₯ˆ
BGE Base EN v1.5
BAAI/bge-base-en-v1.5
768-dim
Local CPULocal CPU
73.6
0.775
66.7%
85.7%
92.9%
28.7
2.48s/call
100.0
Free
3πŸ₯‰
all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L6-v2
384-dim
Local CPULocal CPU
72.9
0.754
62.6%
86.8%
92.8%
34.5
1.90s/call
100.0
Free
4
BGE Large EN v1.5
BAAI/bge-large-en-v1.5
1024-dim
Local CPULocal CPU
69.7
0.736
61.7%
82.6%
91.0%
21.0
3.77s/call
100.0
Free
5
text-embedding-3-small
text-embedding-3-small
1536-dim
OpenAIOpenAI
69.6
0.764
63.2%
89.5%
94.7%
23.9
3.19s/call
83.3
$0.02/1M tok
6
embed-english-light-v3.0
embed-english-light-v3.0
384-dim
CohereCohere
67.9
0.790
69.5%
88.0%
92.2%
33.5
1.99s/call
50.0
$0.10/1M tok
7
embed-english-v3.0
embed-english-v3.0
1024-dim
CohereCohere
66.1
0.764
64.5%
86.1%
91.6%
33.9
1.95s/call
50.0
$0.10/1M tok

About This Benchmark

MetricValue shownHow it is measured
MRR0 – 1Mean Reciprocal Rank β€” for each question the rank of the first relevant fact in the recall() results is recorded, then averaged as 1/rank across all questions. Higher is better; 1.0 means the relevant fact is always the top result.
Total Score0 – 100Weighted composite: MRR 70% + Speed 15% + Cost 15%. MRR is scaled directly to 0–100 (MRR Γ— 100). Speed uses 100 Γ— 1 / (1 + latency_s) (1s reference). Cost uses 100 Γ— 0.10 / (0.10 + price_per_1m_tokens) ($0.10/1M reference). R@K metrics are shown for reference but not included in the total score.
R@K% of questionsRecall at K β€” the fraction of questions where at least one relevant fact appears in the top-K results returned by recall(). R@1 asks: β€œis the very first result relevant?”R@5 asks: β€œis any of the top 5 results relevant?”
Cost$ per 1M tokensPublished list price per 1M tokens β€” the same rate applies to both ingestion (embedding facts during retain()) and queries (embedding the search query on each recall()). Local models score 100 (free). OpenAI text-embedding-3-small: $0.02/1M tokens. Cohere embed-english-*-v3.0: $0.10/1M tokens. Score formula: 100 Γ— 0.10 / (0.10 + price) β€” $0.10/1M reference, so Cohere scores ~50 and OpenAI small scores ~83.
Latencyavg s/recallMean wall-clock time per recall() call, including reranking. All models run with MiniLM-L6 cross-encoder reranker on CPU inside Docker β€” latency would be lower with GPU support.
Setupβ€”Benchmark uses the LoComo long-term conversation dataset (conv-43). 178 non-adversarial questions are annotated per model; questions with zero relevant facts are skipped, leaving ~165–171 evaluated per model. Each embedding model ingests the conversation into its own bank (separate volume, since dimensions are fixed at schema creation). Models above 2000 dimensions are excluded due to pgvector's HNSW index limit. Fixed reranker: MiniLM-L6 (cross-encoder/ms-marco-MiniLM-L-6-v2) for all runs. Candidates per recall: 300 (budget=mid). Ground truth is annotated per model β€” each model gets its own GT file, because the retain LLM is non-deterministic and phrases facts differently each run. Annotation uses gemini-2.5-flash to identify relevant facts from a high-budget recall on each model's own bank.