Sizing Servers for LLM Inference: GPU vs CPU Considerations
Training models requires supercomputers. Running them (inference) can often be done on standard enterprise gear.
VRAM is King
To load a 70B parameter model (int4 quantized), you need ~40GB of VRAM. A single NVIDIA A100 40GB or two L40S cards is sufficient. You don't need NVLink for inference unless the model is split across many GPUs.
CPU Inference?
With AVX-512 and AMX extensions on Intel Sapphire Rapids, CPU-based inference is viable for batch processing where latency <100ms isn't critical. This allows you to use cheap system RAM (1TB+) instead of expensive HBM.