Reranker Leaderboard

Which reranker surfaces the most relevant facts at the top of recall() results?

Want to see another reranker here?Open a GitHub issue

Rank	Reranker	Provider	Total Score MRR + Speed + Cost ↓	MRR Mean Reciprocal Rank	R@1 Recall at 1	R@3 Recall at 3	R@5 Recall at 5	Cost $ per recall()	Speed avg s/recall
1🏆	MiniLM-L6 cross-encoder/ms-marco-MiniLM-L-6-v2	Local CPU	73.2	0.788	69.7%	85.5%	90.3%	100.0 Free	20.4 3.91s/call
2🥈	Jina Reranker v2 Multilingual jina_ai/jina-reranker-v2-base-multilingual	Jina AI	69.2	0.670	54.5%	74.6%	81.8%	100.0 $0.0000/search	48.5 1.06s/call
3🥉	Cohere Rerank Multilingual v3 rerank-multilingual-v3.0	Cohere	68.6	0.784	69.1%	86.1%	89.7%	33.3 $0.0020/search	57.7 0.73s/call
4	Cohere Rerank English v3 rerank-english-v3.0	Cohere	68.4	0.784	70.3%	83.0%	88.5%	33.3 $0.0020/search	56.8 0.76s/call
5	FlashRank MiniLM-L12 ms-marco-MiniLM-L-12-v2	Local CPU	68.1	0.707	60.6%	75.8%	83.0%	100.0 Free	24.1 3.16s/call
6	Cohere Rerank v4 Fast rerank-v4.0-fast	Cohere	63.6	0.747	65.5%	81.2%	87.3%	28.6 $0.0025/search	46.6 1.14s/call
7	Voyage Rerank 2 voyage/rerank-2	Voyage AI	62.0	0.545	40.0%	60.6%	73.9%	100.0 $0.0000/search	59.4 0.68s/call
8	Cohere Rerank v4 Pro rerank-v4.0-pro	Cohere	57.5	0.678	55.1%	77.0%	81.8%	16.7 $0.0050/search	50.0 1.00s/call
9	Voyage Rerank 2 Lite voyage/rerank-2-lite	Voyage AI	54.1	0.430	28.5%	48.5%	62.4%	100.0 $0.0000/search	60.1 0.66s/call
10	No reranker (RRF baseline)		38.5	0.183	6.7%	14.5%	29.7%	100.0 Free	71.3 0.40s/call

About This Benchmark

Metric	Value shown	How it is measured
MRR	0 – 1	Mean Reciprocal Rank — for each question the rank of the first relevant fact in the `recall()` results is recorded, then averaged as 1/rank across all questions. Higher is better; 1.0 means the relevant fact is always the top result.
Total Score	0 – 100	Weighted composite: MRR 70% + Speed 15% + Cost 15%. MRR is scaled directly to 0–100 (MRR × 100). Speed uses `100 × 1 / (1 + latency_s)` (1s reference). Cost uses `100 × 0.001 / (0.001 + price_per_search)` — free rerankers score 100, Cohere ($0.002/search) scores ~33. R@K metrics are shown for reference but not included in the total score, since MRR already captures ranking quality.
Cost	$ per recall()	Published list price per reranker API call (search). Free and local rerankers score 100. Cohere charges $2.00 per 1,000 searches ($0.002/call). Prices may have changed since testing.
R@K	% of questions	Recall at K — the fraction of questions where at least one relevant fact appears in the top-K results returned by `recall()`. R@1 asks: "is the very first result relevant?" — the reranker must put the right fact in position 1 to score a point. R@5 asks: "is any of the top 5 results relevant?" — giving the reranker more chances to surface the right fact. A good reranker scores high on all three; a poor one only improves when K is large.
Latency	avg s/recall	Mean wall-clock time per `recall()` call, including reranking. Local models (MiniLM, FlashRank) use sentence-transformers cross-encoders running on CPU inside Docker — latency would be significantly lower on a machine with GPU support.
Setup	—	Benchmark uses the LoComo long-term conversation dataset (conv-43, 165 questions with annotated ground truth). All rerankers share the same ingested bank (retained with gemini-2.5-flash). Candidates per recall: 300 (budget=mid). Ground truth was annotated by having gemini-2.5-flash identify relevant facts from a high-budget recall on the same bank.

Interpreting the results

Why does a small local model (MiniLM-L6) rank at the top?

A few factors combine here. First, the ground truth has implicit lexical bias: relevant facts were annotated from a high-budget RRF recall (RRF is partly BM25-based, favouring lexical overlap), then labelled by an LLM. MiniLM-L6 is a cross-encoder trained on MS MARCO QA pairs — short factual sentences with strong lexical signal — which closely matches the style of LoComo personal memory facts ("John plays basketball", "Tim collects vinyl").

Second, commercial models like Cohere and Voyage are optimised for large-scale document retrieval (legal, financial, technical corpora). They may be over-calibrated for longer, denser texts and underperform on short conversational facts at a pool of only 300 candidates.

Third, with only 165 questions from a single conversation the confidence intervals are wide — a handful of rankings on common questions can swing MRR by several points. The gap between MiniLM-L6 (0.788), Cohere v3 (0.784), and FlashRank (0.707) is likely not statistically significant.

Bottom line: these numbers reflect performance on one specific task — conversational personal-memory retrieval with short facts. They should not be read as a general reranker ranking. For document retrieval or technical corpora, commercial models will likely perform better.

Why does Cohere v4 Fast score higher than v4 Pro?

Counterintuitive but consistent with the domain mismatch above. v4 Pro is optimised for complex, nuanced relevance judgements across long documents. On short conversational facts it may over-think the ranking and penalise simple lexical matches that are actually correct. v4 Fast uses a lighter scoring head that behaves more like a cross-encoder on short texts — closer to what this benchmark rewards. This pattern (smaller/faster variant winning on short-text tasks) is common in cross-encoder literature.

Why do Voyage models score lower than expected?

Voyage Rerank 2 and 2-Lite are trained primarily on passage and document retrieval tasks (BEIR, MTEB). The LoComo facts are much shorter and more conversational than their training distribution. Voyage models also apply strong semantic generalisation which can hurt precision when the relevant fact is nearly verbatim in the retrieved candidates — exactly what MRR@1 measures.