Embeddings Leaderboard
Which embedding model works best with Hindsight? Rankings cover retrieval quality, speed, and cost β the embedding model affects both memory storage (retain()) and search (recall()).
Want to see another embedding model here?Open a GitHub issue
Rank | Embedding Model | Provider | Total Score MRR + Speed + Cost | MRR Mean Reciprocal Rank | R@1 Recall at 1 | R@3 Recall at 3 | R@5 Recall at 5 | Speed avg s/recall | Cost $ per 1M tokens |
|---|---|---|---|---|---|---|---|---|---|
1π | BGE Small EN v1.5 BAAI/bge-small-en-v1.5 384-dim | Local CPU | 73.8 | 0.770 | 65.3% | 86.8% | 91.0% | 32.4 2.08s/call | 100.0 Free |
2π₯ | BGE Base EN v1.5 BAAI/bge-base-en-v1.5 768-dim | Local CPU | 73.6 | 0.775 | 66.7% | 85.7% | 92.9% | 28.7 2.48s/call | 100.0 Free |
3π₯ | all-MiniLM-L6-v2 sentence-transformers/all-MiniLM-L6-v2 384-dim | Local CPU | 72.9 | 0.754 | 62.6% | 86.8% | 92.8% | 34.5 1.90s/call | 100.0 Free |
4 | BGE Large EN v1.5 BAAI/bge-large-en-v1.5 1024-dim | Local CPU | 69.7 | 0.736 | 61.7% | 82.6% | 91.0% | 21.0 3.77s/call | 100.0 Free |
5 | text-embedding-3-small text-embedding-3-small 1536-dim | OpenAI | 69.6 | 0.764 | 63.2% | 89.5% | 94.7% | 23.9 3.19s/call | 83.3 $0.02/1M tok |
6 | embed-english-light-v3.0 embed-english-light-v3.0 384-dim | Cohere | 67.9 | 0.790 | 69.5% | 88.0% | 92.2% | 33.5 1.99s/call | 50.0 $0.10/1M tok |
7 | embed-english-v3.0 embed-english-v3.0 1024-dim | Cohere | 66.1 | 0.764 | 64.5% | 86.1% | 91.6% | 33.9 1.95s/call | 50.0 $0.10/1M tok |
About This Benchmark
| Metric | Value shown | How it is measured |
|---|---|---|
| MRR | 0 β 1 | Mean Reciprocal Rank β for each question the rank of the first relevant fact in the recall() results is recorded, then averaged as 1/rank across all questions. Higher is better; 1.0 means the relevant fact is always the top result. |
| Total Score | 0 β 100 | Weighted composite: MRR 70% + Speed 15% + Cost 15%. MRR is scaled directly to 0β100 (MRR Γ 100). Speed uses 100 Γ 1 / (1 + latency_s) (1s reference). Cost uses 100 Γ 0.10 / (0.10 + price_per_1m_tokens) ($0.10/1M reference). R@K metrics are shown for reference but not included in the total score. |
| R@K | % of questions | Recall at K β the fraction of questions where at least one relevant fact appears in the top-K results returned by recall(). R@1 asks: βis the very first result relevant?βR@5 asks: βis any of the top 5 results relevant?β |
| Cost | $ per 1M tokens | Published list price per 1M tokens β the same rate applies to both ingestion (embedding facts during retain()) and queries (embedding the search query on each recall()). Local models score 100 (free). OpenAI text-embedding-3-small: $0.02/1M tokens. Cohere embed-english-*-v3.0: $0.10/1M tokens. Score formula: 100 Γ 0.10 / (0.10 + price) β $0.10/1M reference, so Cohere scores ~50 and OpenAI small scores ~83. |
| Latency | avg s/recall | Mean wall-clock time per recall() call, including reranking. All models run with MiniLM-L6 cross-encoder reranker on CPU inside Docker β latency would be lower with GPU support. |
| Setup | β | Benchmark uses the LoComo long-term conversation dataset (conv-43). 178 non-adversarial questions are annotated per model; questions with zero relevant facts are skipped, leaving ~165β171 evaluated per model. Each embedding model ingests the conversation into its own bank (separate volume, since dimensions are fixed at schema creation). Models above 2000 dimensions are excluded due to pgvector's HNSW index limit. Fixed reranker: MiniLM-L6 (cross-encoder/ms-marco-MiniLM-L-6-v2) for all runs. Candidates per recall: 300 (budget=mid). Ground truth is annotated per model β each model gets its own GT file, because the retain LLM is non-deterministic and phrases facts differently each run. Annotation uses gemini-2.5-flash to identify relevant facts from a high-budget recall on each model's own bank. |


