Enabling MoE on the Edge via Importance-Driven Expert Scheduling AI updates on arXiv.org

_ November 21, 2025_ Tech Jacks Solutions_ 0 Comments

arXiv:2508.18983v2 Announce Type: replace
Abstract: The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy. Read More

Author

Gallery

Contacts

Enabling MoE on the Edge via Importance-Driven Expert Scheduling AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone

Gallery

Contacts

Enabling MoE on the Edge via Importance-Driven Expert Scheduling AI updates on arXiv.org

Tech Jacks Solutions

STREAM-VAE: Dual-Path Routing for Slow and Fast Dynamics in Vehicle Telemetry Anomaly Detection AI updates on arXiv.org

SEC Drops SolarWinds Case After Years of High-Stakes Cybersecurity Scrutiny The Hacker Newsinfo@thehackernews.com (The Hacker News)

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone