Hindsight/Benchmarks

Reranker Leaderboard

Which reranker surfaces the most relevant facts at the top of recall() results?

Want to see another reranker here?Open a GitHub issue
Rank
Reranker
Provider
Total Score
MRR + Speed + Cost
MRR
Mean Reciprocal Rank
R@1
Recall at 1
R@3
Recall at 3
R@5
Recall at 5
Cost
$ per recall()
Speed
avg s/recall
1🏆
MiniLM-L6
cross-encoder/ms-marco-MiniLM-L-6-v2
Local CPULocal CPU
73.2
0.788
69.7%
85.5%
90.3%
100.0
Free
20.4
3.91s/call
2🥈
Jina Reranker v2 Multilingual
jina_ai/jina-reranker-v2-base-multilingual
Jina AIJina AI
69.2
0.670
54.5%
74.6%
81.8%
100.0
$0.0000/search
48.5
1.06s/call
3🥉
Cohere Rerank Multilingual v3
rerank-multilingual-v3.0
CohereCohere
68.6
0.784
69.1%
86.1%
89.7%
33.3
$0.0020/search
57.7
0.73s/call
4
Cohere Rerank English v3
rerank-english-v3.0
CohereCohere
68.4
0.784
70.3%
83.0%
88.5%
33.3
$0.0020/search
56.8
0.76s/call
5
FlashRank MiniLM-L12
ms-marco-MiniLM-L-12-v2
Local CPU
68.1
0.707
60.6%
75.8%
83.0%
100.0
Free
24.1
3.16s/call
6
Cohere Rerank v4 Fast
rerank-v4.0-fast
CohereCohere
63.6
0.747
65.5%
81.2%
87.3%
28.6
$0.0025/search
46.6
1.14s/call
7
Voyage Rerank 2
voyage/rerank-2
Voyage AIVoyage AI
62.0
0.545
40.0%
60.6%
73.9%
100.0
$0.0000/search
59.4
0.68s/call
8
Cohere Rerank v4 Pro
rerank-v4.0-pro
CohereCohere
57.5
0.678
55.1%
77.0%
81.8%
16.7
$0.0050/search
50.0
1.00s/call
9
Voyage Rerank 2 Lite
voyage/rerank-2-lite
Voyage AIVoyage AI
54.1
0.430
28.5%
48.5%
62.4%
100.0
$0.0000/search
60.1
0.66s/call
10
No reranker (RRF baseline)
38.5
0.183
6.7%
14.5%
29.7%
100.0
Free
71.3
0.40s/call

About This Benchmark

MetricValue shownHow it is measured
MRR0 – 1Mean Reciprocal Rank — for each question the rank of the first relevant fact in the recall() results is recorded, then averaged as 1/rank across all questions. Higher is better; 1.0 means the relevant fact is always the top result.
Total Score0 – 100Weighted composite: MRR 70% + Speed 15% + Cost 15%. MRR is scaled directly to 0–100 (MRR × 100). Speed uses 100 × 1 / (1 + latency_s) (1s reference). Cost uses 100 × 0.001 / (0.001 + price_per_search) — free rerankers score 100, Cohere ($0.002/search) scores ~33. R@K metrics are shown for reference but not included in the total score, since MRR already captures ranking quality.
Cost$ per recall()Published list price per reranker API call (search). Free and local rerankers score 100. Cohere charges $2.00 per 1,000 searches ($0.002/call). Prices may have changed since testing.
R@K% of questionsRecall at K — the fraction of questions where at least one relevant fact appears in the top-K results returned by recall(). R@1 asks: "is the very first result relevant?" — the reranker must put the right fact in position 1 to score a point. R@5 asks: "is any of the top 5 results relevant?" — giving the reranker more chances to surface the right fact. A good reranker scores high on all three; a poor one only improves when K is large.
Latencyavg s/recallMean wall-clock time per recall() call, including reranking. Local models (MiniLM, FlashRank) use sentence-transformers cross-encoders running on CPU inside Docker — latency would be significantly lower on a machine with GPU support.
SetupBenchmark uses the LoComo long-term conversation dataset (conv-43, 165 questions with annotated ground truth). All rerankers share the same ingested bank (retained with gemini-2.5-flash). Candidates per recall: 300 (budget=mid). Ground truth was annotated by having gemini-2.5-flash identify relevant facts from a high-budget recall on the same bank.

Interpreting the results

Why does a small local model (MiniLM-L6) rank at the top?

A few factors combine here. First, the ground truth has implicit lexical bias: relevant facts were annotated from a high-budget RRF recall (RRF is partly BM25-based, favouring lexical overlap), then labelled by an LLM. MiniLM-L6 is a cross-encoder trained on MS MARCO QA pairs — short factual sentences with strong lexical signal — which closely matches the style of LoComo personal memory facts ("John plays basketball", "Tim collects vinyl").

Second, commercial models like Cohere and Voyage are optimised for large-scale document retrieval (legal, financial, technical corpora). They may be over-calibrated for longer, denser texts and underperform on short conversational facts at a pool of only 300 candidates.

Third, with only 165 questions from a single conversation the confidence intervals are wide — a handful of rankings on common questions can swing MRR by several points. The gap between MiniLM-L6 (0.788), Cohere v3 (0.784), and FlashRank (0.707) is likely not statistically significant.

Bottom line: these numbers reflect performance on one specific task — conversational personal-memory retrieval with short facts. They should not be read as a general reranker ranking. For document retrieval or technical corpora, commercial models will likely perform better.

Why does Cohere v4 Fast score higher than v4 Pro?

Counterintuitive but consistent with the domain mismatch above. v4 Pro is optimised for complex, nuanced relevance judgements across long documents. On short conversational facts it may over-think the ranking and penalise simple lexical matches that are actually correct. v4 Fast uses a lighter scoring head that behaves more like a cross-encoder on short texts — closer to what this benchmark rewards. This pattern (smaller/faster variant winning on short-text tasks) is common in cross-encoder literature.

Why do Voyage models score lower than expected?

Voyage Rerank 2 and 2-Lite are trained primarily on passage and document retrieval tasks (BEIR, MTEB). The LoComo facts are much shorter and more conversational than their training distribution. Voyage models also apply strong semantic generalisation which can hurt precision when the relevant fact is nearly verbatim in the retrieved candidates — exactly what MRR@1 measures.