Loading
Large language models (LLMs) are increasingly being deployed on edge devices—hardware that processes data locally near the data source, such as smartphones, laptops, and robots. Running LLMs on these devices supports advanced AI and real-time services, but their massive size, with hundreds of millions of parameters, requires significant memory and computational power, limiting widespread adoption. Low-bit quantization, a technique that compresses models and reduces memory demands, offers a solution by enabling more efficient operation.
Recent advances in low-bit quantization have made mixed-precision matrix multiplication (mpGEMM) viable for LLMs. This deep learning technique allows data of the same or different formats to be multiplied, such as int8*int1, int8*int2, or FP16*int4. By combining a variety of precision levels, mpGEMM strikes a balance among speed, memory efficiency, and computational accuracy.
However, most hardware supports only symmetric computations—operations on data of similar formats—creating challenges for mixed-precision calculations during General Matrix Multiplication (GEMM), a critical operation for LLMs. Overcoming these hardware limitations is essential to fully benefit from mpGEMM and support asymmetrical computations.
To unlock the potential of low-bit quantization on resource-constrained edge devices, hardware must natively support mpGEMM. To address this, we developed the following three approaches for computing kernels and hardware architectures:
The following sections describe these techniques in detail.
Cutting-edge hardware accelerators, such as GPUs, TPUs, and specialized chips, are designed to speed up computationally intensive tasks like deep learning by efficiently handling large-scale operations. These accelerators now integrate lower-bit computing units, such as FP32, FP16, and even FP8, into their architectures.
However, constraints in chip area and hardware costs limit the availability of these units for standard data types. For instance, the NVIDIA V100 Tensor Core GPU supports only FP16, while the A100 supports int2, int4, and int8 but not newer formats like FP8 or OCP-MXFP. Additionally, the rapid development of LLMs often outpaces hardware upgrades, leaving many new data formats unsupported and complicating deployment.
Additionally, while hardware accelerators may lack direct support for custom data types, their memory systems can convert these types into fixed-width data blocks that store any data format. For instance, NF4 tensors can be converted into FP16 or FP32 for floating-point operations.
Building on these insights, we developed the Ladder data type compiler, a method to separate data storage from computation, enabling broader support for custom data types. It bridges the gap between emerging custom data formats with the precision types supported by current hardware.
Ladder offers a flexible system for converting between algorithm-specific and hardware-supported data types without data loss. For low-bit applications, it optimizes performance by translating low-bit data into the most efficient formats for the hardware being used. As shown in Figure 1, this includes mapping low-bit computations to supported instructions and efficiently managing data storage across the memory hierarchy.
Evaluations of Ladder on NVIDIA and AMD GPUs show that it outperforms existing deep neural network (DNN) compilers for natively supported data types. It also handles custom data types not supported by GPUs, achieving speedups of up to 14.6 times.
As the first system to support custom low-precision data types for running DNNs on modern hardware accelerators, Ladder provides researchers with flexibility in optimizing data types. It also enables hardware developers to support a wider range of data types without requiring hardware modifications.
Deploying low-bit quantized LLMs on edge devices often requires dequantizing models to ensure hardware compatibility. However, this approach has two major drawbacks:
To address these challenges, we introduce T-MAC, a novel LUT-based method that enables mpGEMM without dequantization or multiplication.
T-MAC replaces traditional multiplication operations with bit-wise table lookups, offering a unified and scalable solution for mpGEMM. It incorporates techniques to reduce the size of tables and store them directly on the chip, minimizing the overhead of accessing data from memory. By eliminating dequantization and lowering computational costs, T-MAC enables efficient inference of low-bit LLMs on resource-constrained edge devices. Figure 2 illustrates T-MAC’s architecture.
Performance evaluations of T-MAC on low-bit models demonstrated substantial benefits in efficiency and speed. On the Surface Laptop 7 with the Qualcomm Snapdragon X Elite chipset, T-MAC achieved:
These speeds far exceed average human reading rates, outperforming llama.cpp by 4–5 times and doubling the speed of a dedicated NPU accelerator. Even on lower-end devices like the Raspberry Pi 5, T-MAC made it possible for the 3B BitNet-b1.58 model to generate 11 tokens per second. It also proved highly power-efficient, matching llama.cpp’s generation rate while using only 1/4 to 1/6 of the CPU cores.
These results establish T-MAC as a practical solution for deploying LLMs on edge devices with standard CPUs, without relying on GPUs or NPUs. T-MAC allows LLMs to run efficiently on resource-constrained devices, expanding their applicability across a wider range of scenarios.
While T-MAC and Ladder optimize mpGEMM on existing CPU and GPU architectures, improving computational efficiency, they cannot match the performance of dedicated hardware accelerators with built-in LUT support. Achieving significant improvements in performance, power, and area (PPA) requires overcoming four key challenges:
In response, we developed LUT Tensor Core, a software-hardware codesign for low-bit LLM inference. To address precomputation overhead in conventional LUT-based methods, we introduce techniques like software-based DFG transformation, operator fusion, and table symmetrization to optimize table precomputation and storage. Additionally, we propose a hardware design with an elongated tiling shape to support table reuse and a bit-serial design to handle various precision combinations in mpGEMM.
To integrate with existing GPU microarchitectures and software stacks, we extended the MMA instruction set, added new LMMA instructions, and developed a cuBLAS-like software stack for easy integration into existing DNN frameworks. We also created a compiler for end-to-end execution planning on GPUs with LUT Tensor Core. This design and workflow, illustrated in Figure 3, enabled the quick and seamless adoption of LUT Tensor Core.
Testing LUT Tensor Core on low-bit LLMs, such as BitNet and Llama, showed significant performance gains, achieving 6.93 times the inference speed while using just 38.3% of the area of a traditional Tensor Core. With nearly identical model accuracy, this results in a 20.9-fold increase in computational density and an 11.2-fold boost in energy efficiency. As AI models grow in scale and complexity, LUT Tensor Core enables low-bit LLMs to be applied in new and diverse scenarios.
We believe the LUT technique could drive a paradigm shift in AI model inference. Traditional methods rely on multiplication and accumulation operations, whereas LUT implementations provide higher transistor density, greater throughput per chip area, lower energy costs, and better scalability. As large models adopt low-bit quantization, the LUT method could become the standard for system and hardware design, advancing the next generation of AI hardware innovation.
Low-bit quantization improves the efficiency of running large models on edge devices while also enabling model scaling by reducing the bits used to represent each parameter. This scaling enhances model capabilities, generality, and expressiveness, as shown by the BitNet model, which starts with a low-bit configuration and expands.
Technologies like T-MAC, Ladder, and LUT Tensor Core provide solutions for running low-bit quantized LLMs, supporting efficient operation across edge devices and encouraging researchers to design and optimize LLMs using low-bit quantization. By reducing memory and computational demands, low-bit LLMs could power embodied AI systems, such as robots, enabling dynamic perception and real-time environmental interaction.
T-MAC (opens in new tab) and Ladder (opens in new tab) are open source and available on GitHub. We invite you to test and explore these innovations in AI technology with Microsoft Research.
The post Advances to low-bit quantization enable LLMs on edge devices appeared first on Microsoft Research.