GPU vs CPU: CPU is a better choice, at least for certain use cases

Owning your own infrastructure offers numerous advantages, but when it comes to fine-tuning a 7 billion parameters language model (or bigger), costs can escalate rapidly. Unless employing quantization techniques, one would typically require high-end GPUs like the Nvidia H100, which come at a significant expense. Fine-tuning a 7B parameters model demands around 160 GB of RAM, necessitating the purchase of multiple H100 GPUs, each equipped with 80 GB of RAM, or opting for the H100 NVL variant with 188 GB of RAM, albeit at a higher cost ranging between 25,000 EUR and 35,000 EUR.

GPUs inherently excel in parallel computation compared to CPUs, yet CPUs offer the advantage of managing larger amounts of relatively inexpensive RAM. Although CPU RAM operates at a slower speed than GPU RAM, fine-tuning a 7B parameters model within a reasonable timeframe is achievable.

We successfully fine-tuned a Mistral AI 7B model in a comparable timeframe to our previous endeavor with a Nvidia RTX 4090 utilizing 4-bit quantization. Despite the potential of obtaining satisfactory results with quantization, our CPU-fine-tuned model outperformed in quality, while maintaining acceptable inference times.

Our setup comprises two Intel Xeon 4516Y+ processors (Emerald Rapids), equipped with 24 cores/48 threads, operating at 2.20–3.70GHz, and boasting 45MB cache, albeit consuming 185W of power. Cooling posed challenges, particularly for the memory, but remained manageable.

Recent Intel Xeon processors, such as Sapphire Rapid and successors, feature new instruction set extensions like Intel Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel AVX512-VNNI), and Intel Advanced Matrix Extensions (Intel AMX), enhancing performance on Intel CPUs. Leveraging Intel Extension for PyTorch (IPEX), a PyTorch library designed to exploit these extensions, further optimizes our operations.

While CPUs may not be feasible for high-load scenarios like chatbots catering to millions of users due to performance disparities, they offer unmatched flexibility for enterprise use cases. These scenarios, though less intensive, still demand fine-tuning of language models with proprietary data.

Thanks to their general-purpose nature, CPUs are easier to virtualize and share, making them more adaptable to diverse workloads. Additionally, we can now affordably fine-tune larger models, such as the Mistral AI 22B parameters, by expanding our server’s RAM capacity.

From a cost perspective, the disparity is significant; our server costs one-third of a H100 NVL, making it a more economical choice.


Conclusion

While high-end GPUs like the Nvidia H100 offer powerful computation for fine-tuning large language models, their cost can be prohibitive. Alternatively, CPUs provide a cost-effective solution with sufficient RAM capacity. Powered by Intel Xeon processors and enhanced with Intel AVX512-VNNI and Intel AMX, showcases the evolving capabilities of CPU-based systems. While GPUs excel in high-load scenarios like real-time chatbots, CPUs offer flexibility and cost-effectiveness. By balancing performance and cost factors, businesses can effectively leverage AI technologies for innovation and insights.


This article was originally published on Medium on 8 May. 2024 and is republished here with permission.

The Role of regulatory bodies in the AI Act
Older post

The Role of regulatory bodies in the AI Act

Owning your own infrastructure offers numerous advantages, but when it comes to fine-tuning a 7 billion parameters language model (or bigger), costs can …

Newer post

The Interplay of GPT Popularity and Data Usage Consent

This article explores the correlation between the rising popularity of GPT-based products and the increasing implementation of data usage consent measures.

The Interplay of GPT Popularity and Data Usage Consent