Skip to content

Feature Request: T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge #8485

Closed
@sorasoras

Description

@sorasoras

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

https://arxiv.org/pdf/2407.00088
Answer
T-MAC (Table-based Matrix-Activation Computation) is an innovative method designed to enable efficient deployment of low-bit Large Language Models (LLMs) on edge devices using CPUs. Here are the key aspects of T-MAC:
Purpose: T-MAC addresses the challenge of deploying weight-quantized LLMs on edge devices with limited resources, focusing on efficient mixed-precision matrix multiplication (mpGEMM) without relying on GPUs.

Core Technique: It uses a lookup table (LUT)-based approach to directly support mpGEMM without the need for weight dequantization. This method transforms traditional data-type-centric multiplication into bit-wise table lookup operations.
Performance Improvements:
Up to 4x increase in throughput compared to llama.cpp
70% reduction in energy consumption
For BitNet-b1.58-3B model:
30 tokens/s with a single core on M2-Ultra
71 tokens/s with eight cores on M2-Ultra
11 tokens/s on Raspberry Pi 5

Key Features:
Scales linearly with weight bit-width
Eliminates multiplications and reduces additions
Supports various activation types (fp8, fp16, int8) using fast table lookup and add instructions
Implementation Techniques:
LUT-centric data layout for efficient on-chip memory usage
Table quantization and mirror consolidation to reduce table size
Utilization of tbl/pshuf instructions for fast table lookup on CPUs
Evaluation:
Tested on various edge devices including Apple M2 Ultra, Jetson AGX Orin, Surface Book 3, and Raspberry Pi 5
Achieved up to 6.6x speedup (average 3.6x) compared to llama.cpp
End-to-end LLM inference speedup of 2.8x for Llama-2-7B-2bit model

Significance: T-MAC provides a practical solution for deploying LLMs on edge devices using widely available CPUs, making LLM inference speed on CPUs comparable or even superior to GPUs on the same devices in some cases.

Availability: The T-MAC system is open-sourced and available on GitHub for further development and implementation.

Screenshot 2024-07-15 125242

Motivation

Looks like a good addition to current Bitnet 1.58bit to speed it up even further

Possible Implementation

https://github.com/microsoft/T-MAC

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions