Reranker Leaderboard
Which reranker surfaces the most relevant facts at the top of recall() results?
Rank | Reranker | Provider | Total Score MRR + Speed + Cost | MRR Mean Reciprocal Rank | R@1 Recall at 1 | R@3 Recall at 3 | R@5 Recall at 5 | Cost $ per recall() | Speed avg s/recall |
|---|---|---|---|---|---|---|---|---|---|
1🏆 | MiniLM-L6 cross-encoder/ms-marco-MiniLM-L-6-v2 | Local CPU | 73.2 | 0.788 | 69.7% | 85.5% | 90.3% | 100.0 Free | 20.4 3.91s/call |
2🥈 | Jina Reranker v2 Multilingual jina_ai/jina-reranker-v2-base-multilingual | Jina AI | 69.2 | 0.670 | 54.5% | 74.6% | 81.8% | 100.0 $0.0000/search | 48.5 1.06s/call |
3🥉 | Cohere Rerank Multilingual v3 rerank-multilingual-v3.0 | Cohere | 68.6 | 0.784 | 69.1% | 86.1% | 89.7% | 33.3 $0.0020/search | 57.7 0.73s/call |
4 | Cohere Rerank English v3 rerank-english-v3.0 | Cohere | 68.4 | 0.784 | 70.3% | 83.0% | 88.5% | 33.3 $0.0020/search | 56.8 0.76s/call |
5 | FlashRank MiniLM-L12 ms-marco-MiniLM-L-12-v2 | Local CPU | 68.1 | 0.707 | 60.6% | 75.8% | 83.0% | 100.0 Free | 24.1 3.16s/call |
6 | Cohere Rerank v4 Fast rerank-v4.0-fast | Cohere | 63.6 | 0.747 | 65.5% | 81.2% | 87.3% | 28.6 $0.0025/search | 46.6 1.14s/call |
7 | Voyage Rerank 2 voyage/rerank-2 | Voyage AI | 62.0 | 0.545 | 40.0% | 60.6% | 73.9% | 100.0 $0.0000/search | 59.4 0.68s/call |
8 | Cohere Rerank v4 Pro rerank-v4.0-pro | Cohere | 57.5 | 0.678 | 55.1% | 77.0% | 81.8% | 16.7 $0.0050/search | 50.0 1.00s/call |
9 | Voyage Rerank 2 Lite voyage/rerank-2-lite | Voyage AI | 54.1 | 0.430 | 28.5% | 48.5% | 62.4% | 100.0 $0.0000/search | 60.1 0.66s/call |
10 | No reranker (RRF baseline) | 38.5 | 0.183 | 6.7% | 14.5% | 29.7% | 100.0 Free | 71.3 0.40s/call |
About This Benchmark
| Metric | Value shown | How it is measured |
|---|---|---|
| MRR | 0 – 1 | Mean Reciprocal Rank — for each question the rank of the first relevant fact in the recall() results is recorded, then averaged as 1/rank across all questions. Higher is better; 1.0 means the relevant fact is always the top result. |
| Total Score | 0 – 100 | Weighted composite: MRR 70% + Speed 15% + Cost 15%. MRR is scaled directly to 0–100 (MRR × 100). Speed uses 100 × 1 / (1 + latency_s) (1s reference). Cost uses 100 × 0.001 / (0.001 + price_per_search) — free rerankers score 100, Cohere ($0.002/search) scores ~33. R@K metrics are shown for reference but not included in the total score, since MRR already captures ranking quality. |
| Cost | $ per recall() | Published list price per reranker API call (search). Free and local rerankers score 100. Cohere charges $2.00 per 1,000 searches ($0.002/call). Prices may have changed since testing. |
| R@K | % of questions | Recall at K — the fraction of questions where at least one relevant fact appears in the top-K results returned by recall(). R@1 asks: "is the very first result relevant?" — the reranker must put the right fact in position 1 to score a point. R@5 asks: "is any of the top 5 results relevant?" — giving the reranker more chances to surface the right fact. A good reranker scores high on all three; a poor one only improves when K is large. |
| Latency | avg s/recall | Mean wall-clock time per recall() call, including reranking. Local models (MiniLM, FlashRank) use sentence-transformers cross-encoders running on CPU inside Docker — latency would be significantly lower on a machine with GPU support. |
| Setup | — | Benchmark uses the LoComo long-term conversation dataset (conv-43, 165 questions with annotated ground truth). All rerankers share the same ingested bank (retained with gemini-2.5-flash). Candidates per recall: 300 (budget=mid). Ground truth was annotated by having gemini-2.5-flash identify relevant facts from a high-budget recall on the same bank. |
Interpreting the results
Why does a small local model (MiniLM-L6) rank at the top?
A few factors combine here. First, the ground truth has implicit lexical bias: relevant facts were annotated from a high-budget RRF recall (RRF is partly BM25-based, favouring lexical overlap), then labelled by an LLM. MiniLM-L6 is a cross-encoder trained on MS MARCO QA pairs — short factual sentences with strong lexical signal — which closely matches the style of LoComo personal memory facts ("John plays basketball", "Tim collects vinyl").
Second, commercial models like Cohere and Voyage are optimised for large-scale document retrieval (legal, financial, technical corpora). They may be over-calibrated for longer, denser texts and underperform on short conversational facts at a pool of only 300 candidates.
Third, with only 165 questions from a single conversation the confidence intervals are wide — a handful of rankings on common questions can swing MRR by several points. The gap between MiniLM-L6 (0.788), Cohere v3 (0.784), and FlashRank (0.707) is likely not statistically significant.
Bottom line: these numbers reflect performance on one specific task — conversational personal-memory retrieval with short facts. They should not be read as a general reranker ranking. For document retrieval or technical corpora, commercial models will likely perform better.
Why does Cohere v4 Fast score higher than v4 Pro?
Counterintuitive but consistent with the domain mismatch above. v4 Pro is optimised for complex, nuanced relevance judgements across long documents. On short conversational facts it may over-think the ranking and penalise simple lexical matches that are actually correct. v4 Fast uses a lighter scoring head that behaves more like a cross-encoder on short texts — closer to what this benchmark rewards. This pattern (smaller/faster variant winning on short-text tasks) is common in cross-encoder literature.
Why do Voyage models score lower than expected?
Voyage Rerank 2 and 2-Lite are trained primarily on passage and document retrieval tasks (BEIR, MTEB). The LoComo facts are much shorter and more conversational than their training distribution. Voyage models also apply strong semantic generalisation which can hurt precision when the relevant fact is nearly verbatim in the retrieved candidates — exactly what MRR@1 measures.



