Feature Request: T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

### Prerequisites

- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

https://arxiv.org/pdf/2407.00088
Answer
T-MAC (Table-based Matrix-Activation Computation) is an innovative method designed to enable efficient deployment of low-bit Large Language Models (LLMs) on edge devices using CPUs. Here are the key aspects of T-MAC:
Purpose: T-MAC addresses the challenge of deploying weight-quantized LLMs on edge devices with limited resources, focusing on efficient mixed-precision matrix multiplication (mpGEMM) without relying on GPUs.

Core Technique: It uses a lookup table (LUT)-based approach to directly support mpGEMM without the need for weight dequantization. This method transforms traditional data-type-centric multiplication into bit-wise table lookup operations.
Performance Improvements:
Up to 4x increase in throughput compared to llama.cpp
70% reduction in energy consumption
For BitNet-b1.58-3B model:
30 tokens/s with a single core on M2-Ultra
71 tokens/s with eight cores on M2-Ultra
11 tokens/s on Raspberry Pi 5


Key Features:
Scales linearly with weight bit-width
Eliminates multiplications and reduces additions
Supports various activation types (fp8, fp16, int8) using fast table lookup and add instructions
Implementation Techniques:
LUT-centric data layout for efficient on-chip memory usage
Table quantization and mirror consolidation to reduce table size
Utilization of tbl/pshuf instructions for fast table lookup on CPUs
Evaluation:
Tested on various edge devices including Apple M2 Ultra, Jetson AGX Orin, Surface Book 3, and Raspberry Pi 5
Achieved up to 6.6x speedup (average 3.6x) compared to llama.cpp
End-to-end LLM inference speedup of 2.8x for Llama-2-7B-2bit model

Significance: T-MAC provides a practical solution for deploying LLMs on edge devices using widely available CPUs, making LLM inference speed on CPUs comparable or even superior to GPUs on the same devices in some cases.

Availability: The T-MAC system is open-sourced and available on GitHub for further development and implementation.


![Screenshot 2024-07-15 125242](https://github.com/user-attachments/assets/92335e60-fe82-4ad8-b57b-9cd0cab65e45)


### Motivation

Looks like a good addition to current Bitnet 1.58bit  to speed it up even further

### Possible Implementation

https://github.com/microsoft/T-MAC


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge #8485

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge #8485

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions