FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design AI updates on arXiv.org

_ January 23, 2026_ Tech Jacks Solutions_ 0 Comments

arXiv:2601.15710v1 Announce Type: cross
Abstract: We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$times$ end-to-end speedup, 1.64$times$ higher decode throughput, and 3.14$times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71$times$, 6.55$times$, and 4.13$times$, respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23$times$ and extends the context window by 64$times$, delivering 1.10$times$/4.86$times$ lower end-to-end latency and 5.21$times$/6.27$times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort. Read More

Author

Gallery

Contacts

FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Services

Learn

Company

Gallery

Contacts

FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design AI updates on arXiv.org

Tech Jacks Solutions

Integrating Knowledge Distillation Methods: A Sequential Multi-Stage Framework AI updates on arXiv.org

PromptHelper: A Prompt Recommender System for Encouraging Creativity in AI Chatbot Interactions AI updates on arXiv.org

Leave a comment Cancel reply

Services

Learn

Company