How to make mobile phones run large AI models 4 to 5 times faster

In the rapidly evolving world of AI, the demand to run large AI models on edge devices like mobile phones, computers, and even Raspberry Pi is increasing. However, the difficulty of efficiently deploying these models on resource-limited devices like CPUs remains a significant hurdle. Traditionally, dedicated hardware accelerators like NPUs and GPUs have been the go-to solution for this task. But what if we could achieve similar or superior performance using just a CPU? That’s where T-MAC, a new technology from Microsoft Research Asia, comes in. T-MAC technology can accelerate the speed of large AI models on phones, enabling them to run 4-5 times faster on just a CPU.

Artificial Intelligence Computer
Image source: Wccftech

The Problem: Running Large AI Models on Phones

When we try to run AI on phones or small computers, we run into two big problems: space and power. These models need a lot of space and power to work well. To help with this, we often use a trick called model “quantization.” This means that we reduce the size of the model by reducing the number of bits in its parts. While this helps save space, it can slow down the model because of how the math is done. Normally, you have to change these low-bit parts back to high-bit parts for the model to work, which is slow and not good for speed.

Solution: T-MAC Technology

Instead of the old, slow method, T-MAC technology uses a “lookup table” (LUT) method to do the calculation. This means the model doesn’t need to change the bits back to the higher bits first. This saves time and power, allowing the model to run faster and use less energy. With T-MAC, phones and small devices can run AI models at speeds that can even beat specialized hardware like NPUs.

How T-MAC Works: Innovation Behind Speed

At the core of T-MAC’s innovation is the use of a lookup table (LUT)-based computing paradigm that replaces the traditional multiply-accumulate (MAC) approach. This paradigm shift allows T-MAC to perform low-bit computations directly using lookup tables, thus eliminating the need for inefficient dequantization operations required by other systems. This reduction in the number of multiply and add operations is key to the speed improvements seen with T-MAC.

Apple Artificial Intelligence

For example, when large models were run on a Surface AI PC equipped with the latest Qualcomm Snapdragon X Elite chipset, T-MAC showed impressive results: the 3D BitNet-b1.58 model was able to produce up to 48 tokens per second, the 2-bit 7B llama model was able to produce up to 30 tokens per second, and the 4-bit 7B llama model was able to produce up to 20 tokens per second. These figures not only highlight the efficiency of T-MAC, but also show that it can outperform NPUs in certain scenarios. For example, when the llama-2-7B-4bit model is deployed, the NPU can produce 10.4 tokens per second, while the CPU using T-MAC can produce up to 12.6 tokens per second with just two cores and up to 22 tokens per second with additional cores.

Gizchina News of the Week


Technical Details: How T-MAC Optimizes Performance

The efficiency of T-MAC lies in its ability to handle low-bit matrix multiplication computations from a bit-centric perspective. Unlike traditional methods that require individual customization for different data types, T-MAC designs an optimal data structure for a single bit and then scales it to higher bit levels through stacking. This approach simplifies the computational process and reduces the complexity associated with mixed-precision operations.

Additionally, T-MAC uses high-efficiency table lookup instructions (TBL/PSHUF) in the CPU, which significantly improves random memory access performance. The technology also optimizes data flow and memory usage by storing lookup tables in fast on-chip memory, rearranging weights for better cache hit ratios, and designing an optimal matrix tiling method to maximize data reuse.

Performance Measures: T-MAC and Traditional Methods

When we look at how T-MAC compares to legacy methods (like llama.cpp), the speed gains are clear. T-MAC can make 4-to-1 bit math up to 11 times faster than llama.cpp, depending on the device used. T-MAC also scales well as the number of bits decreases. This means that even if the model uses fewer bits, it can still get faster in a way that the legacy method can’t.

For a low-end device like Raspberry Pi 5, T-MAC can generate 11 tokens per second for the 3D BitNet-b1.58 model. This shows that T-MAC can work well on both high-end PCs and low-end devices, making it a flexible and powerful tool for AI.

That Time I Got Reincarnated As A Slime ISEKAI Chronicles Nintendo Switch

Power Efficiency: Reducing Energy Consumption with T-MAC

In addition to speed advantages, T-MAC also offers significant power efficiency benefits. The technology reduces the number of cores required to achieve the same throughput rate by 1/4 to 1/6 compared to traditional methods, thereby reducing energy consumption. This efficiency is particularly important for mobile and edge devices, where battery life and power consumption are critical considerations.

Conclusion: The Future of AI on Edge Devices

T-MAC is a big step forward for AI on small devices. Using a clever lookup table method, it allows large AI models to run faster and use less power. This opens up new ways to use AI on phones, small PCs, and other devices that don’t have the space or power for large GPUs or NPUs.

Microsoft Research Asia has made T-MAC open source so anyone can try it out and use it in their own AI work. As AI continues to grow, tools like T-MAC will help bring AI to more places and make it faster and easier to use on all types of devices. The future of AI in phones looks bright with faster speeds and smarter power usage thanks to new technologies like T-MAC.

Disclaimer: We may be compensated by some of the companies that mention the products, but our articles and reviews are always our honest opinions. For more details, you can check out our editorial guidelines and learn how we use affiliate links.

Leave a Reply

Your email address will not be published. Required fields are marked *