Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning AI updates on arXiv.org

_ December 31, 2025_ Tech Jacks Solutions_ 0 Comments

arXiv:2512.23087v1 Announce Type: cross
Abstract: Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned “safe” vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning. Read More

Author

Gallery

Contacts

Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone

Gallery

Contacts

Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning AI updates on arXiv.org

Tech Jacks Solutions

ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators AI updates on arXiv.org

Logic Sketch Prompting (LSP): A Deterministic and Interpretable Prompting Method AI updates on arXiv.org

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone