Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

News
AI News & Insights Featured Image

 arXiv:2512.09427v2 Announce Type: replace-cross
Abstract: Device-memory management is a key bottleneck for serving large language models (LLMs) on accelerators whose memory has poor small-granularity random-access bandwidth (e.g., LPDDR5-class). Existing approaches either statically pre-allocate worst-case KV-cache per request, wasting substantial device memory, or rely on fine-grained paging that assumes high random-access tolerance and is therefore ill-suited to LPDDR-style systems. We present ODMA, an on-demand memory allocation framework for LLM serving on random-access-constrained device memory (RACM) platforms such as LPDDR5-based Cambricon MLUs. ODMA builds on generation-length prediction while addressing distribution drift and heavy-tailed request lengths via dynamic bucket partitioning and a large-bucket safeguard: bucket boundaries are periodically re-learned from online histograms, and high-uncertainty or overflowed requests fall back to a reserved large bucket for robustness. On Alpaca and Google-NQ, ODMA improves S3’s predictor accuracy from 98.60% to 99.55% and from 82.68% to 93.36%, respectively. Serving DeepSeek-R1-Distill-Qwen-7B on four Cambricon MLU370-X4 accelerators, ODMA increases device-memory utilization from 55.05% to 72.45% on Alpaca and from 42.54% to 61.79% on Google-NQ, and boosts throughput by 23% and 27% over a static pre-allocation baseline. These results show that predictor-driven, hardware-aware allocation can unlock efficient LLM serving on RACM accelerators without hardware changes, complementing paging-centric designs tailored to HBM systems. Read More  

Author

Tech Jacks Solutions

Leave a comment

Your email address will not be published. Required fields are marked *