Papers
arxiv:2407.07304

Inference Performance Optimization for Large Language Models on CPUs

Published on Jul 10, 2024
· Submitted by akhaliq on Jul 11, 2024
#2 Paper of the day

Abstract

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

Community

Paper submitter

Would these performance gains be useful on a single CPU with a batch size of 1, or would that have insignificant gains compared to MULTI-CPU high batch count. Cheers

·
Paper author

Yes, you are right, this solution can benefit both single CPU and CPU server clusters. if your model is small, you can just leverage one socket, if your model is big, like 70B, and you have a good network connection, you can leverage the whole CPU cluster to do the inference across multi-servers. BTW, the CPU server has a large memory capacity so that large batch can be supported too.

What types of CPUs and configurations will you be focusing on in your future research?

·
Paper author

We will work on Granite Rapids (Intel's successor to Emerald Rapids, an Intel 3 process microarchitecture for enthusiasts and servers) + MCR DIMM.

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.07304 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.07304 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.07304 in a Space README.md to link it from this page.

Collections including this paper 9