Hotline: +27 10 085 1127 Email: sales@cobraconnect.co.za

Systems

API Health Status

Status: Issue Detected

Latency: ms

Last Check:

Font Size

Cobra ConnectTech Solutions

R0.00

Sizing Servers for LLM Inference: GPU vs CPU Considerations

Whitepaper8 min read

Sizing Servers for LLM Inference: GPU vs CPU Considerations

S

Sarah van der Merwe

Chief Executive Officer

Training models requires supercomputers. Running them (inference) can often be done on standard enterprise gear.

VRAM is King

To load a 70B parameter model (int4 quantized), you need ~40GB of VRAM. A single NVIDIA A100 40GB or two L40S cards is sufficient. You don't need NVLink for inference unless the model is split across many GPUs.

CPU Inference?

With AVX-512 and AMX extensions on Intel Sapphire Rapids, CPU-based inference is viable for batch processing where latency <100ms isn't critical. This allows you to use cheap system RAM (1TB+) instead of expensive HBM.

Need advice on this topic?

Our solution architects can help you implement the technologies discussed in this article for your unique environment.

Related Insights

The Shift to Edge: Industrial IoT Architecture Guide 2024

WhitepaperOct 12, 2023

The Shift to Edge: Industrial IoT Architecture Guide 2024

Why decentralized processing is becoming the standard for manufacturing and logistics, and how to spec the right hardware for harsh environments.

How Standard Bank Scaled their VDI Infrastructure with Dell VxRail

Case StudySep 28, 2023

How Standard Bank Scaled their VDI Infrastructure with Dell VxRail

A deep dive into the hyper-converged infrastructure that enabled secure remote work for 4,000 employees during the shift to hybrid operations.