Skip to main content
Sizing Servers for LLM Inference: GPU vs CPU Considerations
Whitepaper8 min read

Sizing Servers for LLM Inference: GPU vs CPU Considerations

S
Sarah van der Merwe
Chief Executive Officer

Training models requires supercomputers. Running them (inference) can often be done on standard enterprise gear.

VRAM is King

To load a 70B parameter model (int4 quantized), you need ~40GB of VRAM. A single NVIDIA A100 40GB or two L40S cards is sufficient. You don't need NVLink for inference unless the model is split across many GPUs.

CPU Inference?

With AVX-512 and AMX extensions on Intel Sapphire Rapids, CPU-based inference is viable for batch processing where latency <100ms isn't critical. This allows you to use cheap system RAM (1TB+) instead of expensive HBM.

Need advice on this topic?

Our solution architects can help you implement the technologies discussed in this article for your unique environment.

Related Insights