Pāriet uz saturu
← Atpakaļ uz blogu
Technology 2026. gada 16. aprīlis · 7 min lasīšana

Quantization and fast inference on your own hardware

How to fit bigger models on smaller GPUs and serve them fast.

Quantization and fast inference on your own hardware

The first reaction many engineers have when they look at the hardware requirements for a state-of-the-art large language model is sticker shock. A 70-billion-parameter model in its native FP32 format would need roughly 280 GB of GPU memory — more than most organisations have in a single server, and far more than they want to provision just to answer employee queries. Quantization is the technique that makes these numbers tractable, and understanding it is essential for anyone designing an on-premise AI stack.

What quantization actually does

A neural network is ultimately a very large collection of numbers — the weights learned during training. By default those weights are stored as 32-bit floating-point values (FP32), each consuming 4 bytes of memory. Quantization replaces high-precision numbers with lower-precision representations: 16-bit floats (FP16 or BF16), 8-bit integers (INT8) or even 4-bit integers (INT4). The memory footprint shrinks proportionally, and on hardware with native support for lower-precision arithmetic, inference also gets faster.

  • FP16 / BF16 — half-precision floats. Virtually lossless for most tasks; the go-to choice for production deployments where accuracy is critical. Memory savings: 2x versus FP32.
  • INT8 — 8-bit integers, typically produced by post-training quantization (PTQ) methods such as GPTQ or llm.int8(). Modest quality degradation on complex reasoning; significant on most practical tasks. Memory savings: 4x versus FP32.
  • INT4 — 4-bit integers, the frontier of aggressive quantization. Tools like GGUF Q4_K_M and AWQ deliver surprisingly good quality for their size. Memory savings: 8x versus FP32, with acceptable degradation for chat and summarisation workloads.

The quality versus size trade-off

Quantization is not free. Every bit you remove is information discarded, and at some point that shows up as degraded output — hallucinations, reasoning errors or loss of nuance. The practical finding from Privonis deployments is that the trade-off is surprisingly favourable for most enterprise tasks. A 70B model quantized to INT4 typically outperforms a 13B model at FP16, even though both fit in similar GPU memory. When in doubt, use the largest model that fits at the highest precision your hardware supports.

GPU memory usage comparison across quantization levels for a 70B parameter model
Memory requirements for a 70B model at different precision levels — INT4 makes it possible to run on a single high-end workstation GPU.
Choosing the right quantization is less about the number of bits and more about matching the model capacity to the task: a well-chosen INT4 70B beats a careless FP16 13B every time.

Inference servers: where the throughput comes from

Running a quantized model is only half the story. Serving it efficiently under concurrent load requires an inference server that understands the structure of transformer attention. The dominant open-source option today is vLLM, which introduced PagedAttention — a memory management technique borrowed from operating-system virtual memory that allows the server to interleave many requests simultaneously without wasting GPU memory on pre-allocated KV-cache blocks. The practical effect is a 10–30x improvement in throughput over a naive single-request loop.

Other notable options include llama.cpp (CPU-friendly, excellent for smaller models on commodity hardware), Ollama (developer-friendly wrapper around llama.cpp), TGI from Hugging Face (strong support for Hugging Face model formats) and TensorRT-LLM from NVIDIA (highest throughput on NVIDIA hardware, at the cost of a more complex compilation pipeline). Privonis evaluates and benchmarks all of these for each customer configuration.

Batching and throughput

GPUs achieve peak efficiency when processing many operations simultaneously — that is what they were designed for. Continuous batching (also called dynamic batching or iteration-level scheduling) allows an inference server to group tokens from multiple concurrent requests into a single GPU kernel call, dramatically improving utilisation. Without batching, a single user query might use 5 % of your GPU capacity; with continuous batching, you can push utilisation to 70–80 % under real-world traffic patterns. For an enterprise with dozens of concurrent users, the difference between a batching-aware server and a naive one can mean the difference between needing one GPU server or four.

Cost per token as a function of concurrent users, comparing batching versus no-batching inference
Continuous batching flattens the cost-per-token curve as concurrent users scale — a critical factor in on-premise TCO calculations.

Picking the right quantization for your GPU

The decision tree is simpler than it looks. Start with your GPU memory budget, subtract headroom for the OS and the inference server (typically 4–8 GB), then find the largest model that fits at the highest precision level. A few practical reference points:

  • 24 GB VRAM (e.g. RTX 4090, A5000) — comfortably runs a 13B model at FP16, or a 34B model at INT4.
  • 48 GB VRAM (e.g. RTX 6000 Ada, A6000) — runs a 34B model at FP16, or a 70B model at INT4.
  • 2 × 80 GB (e.g. A100 pair via NVLink) — runs a 70B model at FP16, or a 140B model at INT4 with tensor parallelism.
  • CPU-only (no GPU) — llama.cpp with a Q4_K_M 7B or 13B model is viable for low-concurrency developer tooling; expect 5–15 tokens/s.

Putting it together with Privonis

Selecting a quantization format and an inference server is engineering work that requires profiling on your specific hardware with your specific workload. Privonis handles that benchmarking as part of every deployment: we run throughput tests, measure output quality on a representative sample of your real prompts, and deliver a configuration that maximises performance within your hardware budget. The result is a production inference stack that your team can operate without a specialist ML engineer on call. If you are ready to explore what fits your environment, our team is happy to run the numbers with you.

Parunāsim par jūsu AI projektu

Rezervēt zvanu