A practical GPU buying guide for on-premise AI
VRAM, throughput, power and budget: how to buy the right GPUs the first time.
Buying GPUs for on-premise AI is one of the most consequential infrastructure decisions an organisation can make. Get it right and you have a self-sufficient, cost-efficient inference and fine-tuning platform that compounds in value over time. Get it wrong and you spend months in GPU return queues or, worse, run models that are too large to fit in memory. This guide walks through every dimension you need to evaluate — VRAM, throughput, power, cooling and total cost of ownership — so you can choose with confidence the first time.
VRAM is the first and hardest constraint
Before any other specification, ask: how many gigabytes of VRAM does my target model require? A 7-billion-parameter model in 16-bit precision occupies roughly 14 GB; a 70-billion-parameter model needs approximately 140 GB. Quantisation to 4-bit can cut those figures by 75 %, but quantisation introduces quality trade-offs that must be validated for your use case. The cardinal rule is simple: if the model does not fit in VRAM, the GPU will spill to system RAM and throughput collapses by one to two orders of magnitude. Always size VRAM with headroom — at least 20 % free — for the key-value cache that grows with context length.
Consumer vs data-centre GPUs
The GPU market bifurcates into consumer cards and data-centre accelerators, and the distinction matters for on-premise AI. Consumer GPUs such as the NVIDIA RTX 4090 offer 24 GB of GDDR6X at extraordinary price-per-VRAM ratios and can run models like Llama 3 70B in 4-bit on a two-card setup. They are excellent for small teams, R&D labs and budget-first deployments. However, they lack ECC memory, are not designed for 24/7 rack operation, and carry limitations on commercial inference use in some jurisdictions. Data-centre GPUs — the L4, L40S, A100 and H100/H200 — are built for continuous duty cycles, carry ECC memory for numerical integrity, and are supported by enterprise SLAs. The L4 (24 GB) is cost-efficient for inference; the L40S (48 GB) handles mid-size models well; the A100 80 GB and H100/H200 (80 GB+) are the standard for large-model fine-tuning and high-throughput serving. Privonis designs deployments around data-centre GPUs precisely because European enterprise clients require that reliability guarantee.
- RTX 4090 — 24 GB GDDR6X, ~1 008 GB/s bandwidth, best price-per-VRAM for dev workloads.
- L4 — 24 GB GDDR6, PCIe form factor, low power (72 W), ideal for inference appliances.
- L40S — 48 GB GDDR6, high FP8 throughput, the workhorse for mid-size models at scale.
- A100 80 GB — 80 GB HBM2e, NVLink support, the proven production standard for large models.
- H100 / H200 — 80–141 GB HBM3/3e, transformer engine with FP8, maximum throughput available.
Single-GPU vs multi-GPU strategies
A single high-VRAM GPU keeps the stack simple: no tensor-parallelism configuration, no NVLink fabric to manage, lower failure surface. Start with a single GPU whenever the model fits and your throughput target is reachable. When it is not — either because the model is too large or because you need to serve dozens of concurrent users — you will need to span across multiple GPUs. NVLink dramatically outperforms PCIe for inter-GPU bandwidth (900 GB/s vs ~64 GB/s bidirectional on PCIe 5.0), which is critical for tensor parallelism. If your budget forces PCIe-only multi-GPU, prefer pipeline parallelism over tensor parallelism to minimise cross-device traffic.
Power, cooling and rack planning
Data-centre GPUs draw between 72 W (L4) and 700 W (H100 SXM5). An eight-H100 DGX system can pull 10 kW from the wall under sustained load. Before ordering hardware, confirm that your data centre or server room can deliver the necessary power circuits and provide adequate cooling — typically 12–15 °C supply air or direct liquid cooling for the densest configurations. Overlooking power density is the single most common cause of deployment delays in on-premise AI projects.
Buy vs rent: the TCO calculation
Cloud GPU rental is operationally convenient but expensive at scale. An H100 instance on a major cloud provider costs roughly €3–4 per GPU-hour, which translates to over €26 000 per GPU per year at continuous utilisation. The same GPU purchased outright costs €25 000–35 000 and typically has a three-to-five-year useful life. The break-even point for high-utilisation workloads falls between twelve and eighteen months — after which on-premise is strictly cheaper. Privonis helps clients build this TCO model before committing to either path, because the right answer depends on utilisation rate, amortisation period, and the value of data sovereignty to the business.
The GPU you can afford to run continuously will always outperform the GPU you rent sporadically. Utilisation is the true performance multiplier.
Practical buying checklist
- Define your largest target model and compute VRAM requirement at your desired precision.
- Add 20 % VRAM headroom for the KV cache and future model updates.
- Verify power circuit capacity and cooling before specifying GPU count.
- Prefer ECC data-centre GPUs for 24/7 production; consumer cards are acceptable for R&D.
- Model multi-GPU interconnect (NVLink vs PCIe) before deciding on parallelism strategy.
- Build a 24-month TCO comparing purchase, depreciation, power and maintenance against cloud rental.
- Engage a vendor — such as Privonis — that can validate the full stack: GPU, server, OS, inference runtime and monitoring.
GPU procurement is not a one-time purchase; it is the foundation of your AI infrastructure roadmap. Investing the time to model VRAM requirements, power constraints and total cost of ownership before you buy will save months of rework and tens of thousands of euros. If you would like a free architecture review for your on-premise AI project, the Privonis team is ready to help.
Nitkellmu dwar il-proġett AI tiegħek
Ibbukkja telefonata