Reranking is the step most RAG pipelines skip, then regret.
A retrieval step, vector search, BM25, or both, pulls candidate chunks from your document store. Fast, but imprecise. A reranker takes that candidate list and re-scores it using a more expensive cross-attention computation, putting the most relevant results at the top before they reach the LLM. Skip the reranker, and your LLM context window fills with the approximately-right chunks instead of the actually-right ones. Add a reranker, and you pay in latency.
The Ettin Reranker Family is designed to reduce that latency cost significantly, without the licensing fees or API dependency of commercial alternatives. Whether it succeeds depends on your hardware and your tolerance for author-reported benchmarks.
What Was Released
Hugging Face published six CrossEncoder models on May 19, 2026, developed by Tom Aarsen, the Sentence Transformers library maintainer. The six sizes: 17M, 32M, 68M, 150M, 400M, and 1B parameters. All six are built on the Ettin ModernBERT encoder architecture – the same family described in arXiv:2412.13663 (December 2024), and all were distilled from a single teacher model: `mixedbread-ai/mxbai-rerank-large-v2`, a 1.54B parameter model, using pointwise MSE distillation.
Apache 2.0 license. No API call required. Deploy the weights locally.
Training data: `cross-encoder/ettin-reranker-v1-data`, an open dataset released alongside the models. That last detail matters more than the license alone, it means teams can inspect what the model was optimized for, and adapt the distillation approach for domain-specific ranking requirements using the published training recipe.
The context window is 8,192 tokens, inherited from ModernBERT. That’s sufficient for most retrieval chunks but worth confirming against your average document size if you’re working with long-form technical documents or contracts.
The Six Models: Size, Speed, and What the Release Post Claims
Here’s the size-versus-performance trade-off as reported by the releasing team. These figures are from the official Hugging Face release post and have not been independently evaluated by Epoch AI or any third party as of this publication.
The 17M model is the throughput story. Per the release post, it processes approximately 7,517 pairs per second on an H100 GPU, and reportedly outperforms the widely-used 33M MiniLM-L12-v2 by 0.051 NDCG@10 on MTEB Retrieval benchmarks, while running at half the parameter count. That’s the edge case you want for high-volume, latency-sensitive pipelines where you’re scoring hundreds of document pairs per query.
The 150M model targets the middle ground. According to the release post, it’s 2.3x faster than comparable ModernBERT-base rerankers. If you’re already running a ModernBERT-based reranker and paying the latency cost, this is the variant to benchmark first.
The 1B model is the accuracy ceiling. Per Hugging Face, it achieves performance within 0.0001 NDCG@10 of the 1.54B teacher model while running 2.4x faster. That’s essentially matching a significantly larger model with a smaller footprint. If your pipeline currently uses mxbai-rerank-large-v2 and you want the same quality at reduced compute cost, this is the variant to test.
The catch is that every performance figure in this section comes from the releasing team’s own evaluation. That’s not unusual for open-source model releases. It does mean you should run your own domain-specific benchmark before committing to a production deployment.
The Architecture Insight That Matters for Latency
ModernBERT’s unpadded input propagation is the specific architectural feature responsible for the throughput gains, not raw parameter efficiency alone. Traditional BERT-based encoders pad all sequences in a batch to the same length, wasting compute on padding tokens. ModernBERT removes padding before the transformer layers and reintroduces positional information after, which means you only run attention on real tokens. At high batch sizes with variable-length inputs, which is exactly what a reranker sees in a production RAG pipeline, the throughput gains compound.
For teams running on CPU or lower-end GPU hardware: the H100 throughput figures in the release post don’t translate directly. The relative advantages of unpadded propagation hold across hardware, but the absolute throughput numbers will be substantially lower. If you’re running on A100 or below, benchmark on your actual hardware before making deployment decisions based on the 7,517 pairs/sec figure.
The Training Recipe Opportunity
Most model releases give you a model. Ettin gives you the training recipe too.
The `cross-encoder/ettin-reranker-v1-data` dataset and the published distillation approach mean that teams with domain-specific ranking requirements, legal document retrieval, scientific literature search, enterprise knowledge bases with specialist terminology, can apply the same pointwise MSE distillation against their own labeled data. You don’t have to use the general-domain model as-is.
This is genuinely unusual. Commercial reranker APIs don’t expose their training pipelines. Most open-source model releases publish weights without methodology. The full distillation recipe being open means Ettin isn’t just a model, it’s a starting point for fine-tuned domain specialists.
The practical constraint: you need a teacher model signal, either from mxbai-rerank-large-v2 itself or from human-labeled relevance judgments. That’s non-trivial to produce for domain specialists. But if your organization has labeled relevance data from a prior retrieval project, the distillation pipeline is now open-source and accessible.
Deployment Decision Framework
The choice across six sizes comes down to three variables: your latency budget, your accuracy floor, and your inference hardware.
For edge or local deployment with tight latency requirements, the 17M or 32M models are the practical range. You’ll trade some NDCG@10 points but stay within interactive response times on commodity hardware.
For server-side deployment where accuracy matters more than throughput per query, the 400M or 1B models are the range to evaluate. The 1B model’s near-teacher performance makes it the default recommendation if you have H100-class hardware and accuracy is the priority.
For general-purpose RAG pipelines on mid-tier GPU hardware, the 150M variant is the most defensible starting point. The 2.3x speed advantage over ModernBERT-base rerankers gives you room to add the reranking step without blowing your latency budget, and the parameter count is small enough to co-locate with your embedding model on the same GPU.
TJS Synthesis
Run your own benchmark before deploying in production. The Ettin models are from a credible releasing team, Tom Aarsen maintains Sentence Transformers, one of the most widely-used retrieval libraries in the Python ecosystem, and the architecture choices have a sound technical rationale. But the performance figures are self-reported. “Epoch evaluation pending” isn’t a hedge; it’s a genuine gap. A model that claims +0.051 NDCG@10 over MiniLM-L12-v2 on MTEB Retrieval might perform differently on your document distribution, your query types, or your hardware configuration. The 150M variant is the recommended starting point for most pipelines, it’s the size where the speed/accuracy trade-off is most defensible without independent verification. Wait for Epoch AI or a reproducible community benchmark before committing the 1B model to a high-stakes production workflow.