LongMemEvalS LoComo10 PersonaMem32K BEAM100K BEAM1M LifeBenchEN BEAM500K BEAM10M

View full comparison

Community Docs GitHub

Embeddings Leaderboard

Which embedding model works best with Hindsight? Rankings cover retrieval quality, speed, and cost — the embedding model affects both memory storage (retain()) and search (recall()).

Want to see another embedding model here?Open a GitHub issue

Rank	Embedding Model	Provider	Total Score MRR + Speed + Cost ↓	MRR Mean Reciprocal Rank	R@1 Recall at 1	R@3 Recall at 3	R@5 Recall at 5	Speed avg s/recall	Cost $ per 1M tokens
1🏆	BGE Small EN v1.5 BAAI/bge-small-en-v1.5 384-dim	Local CPU	73.8	0.770	65.3%	86.8%	91.0%	32.4 2.08s/call	100.0 Free
2🥈	BGE Base EN v1.5 BAAI/bge-base-en-v1.5 768-dim	Local CPU	73.6	0.775	66.7%	85.7%	92.9%	28.7 2.48s/call	100.0 Free
3🥉	all-MiniLM-L6-v2 sentence-transformers/all-MiniLM-L6-v2 384-dim	Local CPU	72.9	0.754	62.6%	86.8%	92.8%	34.5 1.90s/call	100.0 Free
4	BGE Large EN v1.5 BAAI/bge-large-en-v1.5 1024-dim	Local CPU	69.7	0.736	61.7%	82.6%	91.0%	21.0 3.77s/call	100.0 Free
5	text-embedding-3-small text-embedding-3-small 1536-dim	OpenAI	69.6	0.764	63.2%	89.5%	94.7%	23.9 3.19s/call	83.3 $0.02/1M tok
6	embed-english-light-v3.0 embed-english-light-v3.0 384-dim	Cohere	67.9	0.790	69.5%	88.0%	92.2%	33.5 1.99s/call	50.0 $0.10/1M tok
7	embed-english-v3.0 embed-english-v3.0 1024-dim	Cohere	66.1	0.764	64.5%	86.1%	91.6%	33.9 1.95s/call	50.0 $0.10/1M tok

About This Benchmark

Metric	Value shown	How it is measured
MRR	0 – 1	Mean Reciprocal Rank — for each question the rank of the first relevant fact in the `recall()` results is recorded, then averaged as 1/rank across all questions. Higher is better; 1.0 means the relevant fact is always the top result.
Total Score	0 – 100	Weighted composite: MRR 70% + Speed 15% + Cost 15%. MRR is scaled directly to 0–100 (MRR × 100). Speed uses `100 × 1 / (1 + latency_s)` (1s reference). Cost uses `100 × 0.10 / (0.10 + price_per_1m_tokens)` ($0.10/1M reference). R@K metrics are shown for reference but not included in the total score.
R@K	% of questions	Recall at K — the fraction of questions where at least one relevant fact appears in the top-K results returned by `recall()`. R@1 asks: “is the very first result relevant?”R@5 asks: “is any of the top 5 results relevant?”
Cost	$ per 1M tokens	Published list price per 1M tokens — the same rate applies to both ingestion (embedding facts during `retain()`) and queries (embedding the search query on each `recall()`). Local models score 100 (free). OpenAI `text-embedding-3-small`: $0.02/1M tokens. Cohere `embed-english-*-v3.0`: $0.10/1M tokens. Score formula: `100 × 0.10 / (0.10 + price)` — $0.10/1M reference, so Cohere scores ~50 and OpenAI small scores ~83.
Latency	avg s/recall	Mean wall-clock time per `recall()` call, including reranking. All models run with MiniLM-L6 cross-encoder reranker on CPU inside Docker — latency would be lower with GPU support.
Setup	—	Benchmark uses the LoComo long-term conversation dataset (conv-43). 178 non-adversarial questions are annotated per model; questions with zero relevant facts are skipped, leaving ~165–171 evaluated per model. Each embedding model ingests the conversation into its own bank (separate volume, since dimensions are fixed at schema creation). Models above 2000 dimensions are excluded due to pgvector's HNSW index limit. Fixed reranker: MiniLM-L6 (cross-encoder/ms-marco-MiniLM-L-6-v2) for all runs. Candidates per recall: 300 (budget=mid). Ground truth is annotated per model — each model gets its own GT file, because the retain LLM is non-deterministic and phrases facts differently each run. Annotation uses gemini-2.5-flash to identify relevant facts from a high-budget recall on each model's own bank.