PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression AI updates on arXiv.org

_ January 1, 2026_ Tech Jacks Solutions_ 0 Comments

arXiv:2512.24449v1 Announce Type: cross
Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements of the key-value (KV) cache, which can scale to several gigabytes as sequence length and batch size increase. In this paper, we present textbf{PackKV}, a generic and efficient KV cache management framework optimized for long-context generation. %, which synergistically supports both latency-critical and throughput-critical inference scenarios. PackKV introduces novel lossy compression techniques specifically tailored to the characteristics of KV cache data, featuring a careful co-design of compression algorithms and system architecture. Our approach is compatible with the dynamically growing nature of the KV cache while preserving high computational efficiency. Experimental results show that, under the same and minimum accuracy drop as state-of-the-art quantization methods, PackKV achieves, on average, textbf{153.2}% higher memory reduction rate for the K cache and textbf{179.6}% for the V cache. Furthermore, PackKV delivers extremely high execution throughput, effectively eliminating decompression overhead and accelerating the matrix-vector multiplication operation. Specifically, PackKV achieves an average throughput improvement of textbf{75.7}% for K and textbf{171.7}% for V across A100 and RTX Pro 6000 GPUs, compared to cuBLAS matrix-vector multiplication kernels, while demanding less GPU memory bandwidth. Code available on https://github.com/BoJiang03/PackKV Read More

Author

Gallery

Contacts

PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone

Gallery

Contacts

PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression AI updates on arXiv.org

Tech Jacks Solutions

Empower Low-Altitude Economy: A Reliability-Aware Dynamic Weighting Allocation for Multi-modal UAV Beam Prediction AI updates on arXiv.org

LoongFlow: Directed Evolutionary Search via a Cognitive Plan-Execute-Summarize Paradigm AI updates on arXiv.org

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone