Pāriet uz saturu
← Atpakaļ uz blogu
Technology 2026. gada 19. maijs · 7 min lasīšana

Fine-tuning open models on your own data

When prompting is not enough: how to specialise an open model on your domain — privately.

Fine-tuning open models on your own data

Large language models arrive pre-trained on vast swathes of the public internet. That breadth makes them impressively general-purpose — but general-purpose is not the same as expert. When your business needs a model that understands your internal taxonomy, writes in your house style, or reasons about proprietary processes, three adaptation paths open up: prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. Each has its place, and choosing the right one — or the right combination — can make the difference between a prototype and a production system. Privonis helps European organisations navigate that choice and execute it entirely within their own infrastructure.

Three paths to domain adaptation

Prompt engineering costs nothing beyond trial and error, but it runs into a hard wall: you can only fit so much context in a window, and the model may simply lack the domain knowledge you need. RAG sidesteps the context limit by retrieving relevant chunks from a knowledge base at query time and handing them to the model. It is powerful and surprisingly cheap, but retrieval quality caps answer quality — if the right chunk is not found, the model cannot reason about it.

Diagram comparing prompt engineering, RAG, and fine-tuning workflows
Retrieval-augmented generation adds a search step before inference; fine-tuning bakes knowledge into the weights.

Fine-tuning takes a different approach: it updates the model’s weights on your curated dataset so that domain knowledge becomes intrinsic. The result is a model that answers from internalised expertise rather than retrieved snippets. It typically performs better on style-sensitive tasks, structured outputs, and latency-critical pipelines where you cannot afford an extra retrieval round-trip. The downside is cost — both in GPU time and in data preparation — so it is worth reaching for when the other two methods have plateaued.

When fine-tuning is the right call

  • Your outputs must follow a precise format (clinical notes, legal clauses, structured JSON) that prompt templates cannot reliably enforce.
  • The model consistently lacks domain vocabulary, acronyms, or product names that never appeared in its pre-training corpus.
  • Latency requirements rule out a retrieval hop on every request.
  • You want to compress a complex, multi-shot prompt into zero-shot behaviour for cost and speed.
  • You are distilling a larger model into a smaller, cheaper one for edge or on-premise deployment.

LoRA and QLoRA: fine-tuning without a data-centre budget

Full fine-tuning updates every weight in the model, which is prohibitively expensive for models with tens of billions of parameters. Low-rank adaptation (LoRA) sidesteps this by injecting small trainable matrices into the attention layers while freezing the original weights. The number of trainable parameters drops by a factor of 100 or more, yet the resulting model matches or exceeds full fine-tuning quality on most tasks. QLoRA adds quantisation to the mix — the frozen base model is loaded in 4-bit precision, slashing GPU memory requirements so dramatically that a 70-billion-parameter model can be fine-tuned on a single A100.

Illustration of GPU memory savings from QLoRA compared with full fine-tuning
QLoRA reduces peak GPU memory by up to 75 %, making fine-tuning accessible on a single high-end GPU.
With QLoRA, a team that owns one A100 can fine-tune a state-of-the-art open model in an afternoon — no cloud account, no data leaving the building.

Data preparation: the make-or-break step

Model quality is bounded by data quality. Before any training run, Privonis works with clients to curate a supervised dataset of input-output pairs that represent the exact behaviour they want. Typical sources include: reviewed customer interactions, corrected model outputs, expert-annotated documents, and synthetic data generated by a stronger teacher model and then filtered. Volume matters less than diversity and correctness — a thousand carefully vetted examples often outperform ten thousand noisy ones. Data cleaning pipelines handle deduplication, length trimming, and format normalisation before training begins.

Evaluation: knowing when you are done

Fine-tuning without rigorous evaluation is optimisation in the dark. A held-out evaluation set — never seen during training — measures whether the model has generalised or merely memorised. Metrics depend on the task: exact match and F1 for extraction tasks, ROUGE for summarisation, human preference ratings for open-ended generation. Privonis runs automated evals after every checkpoint and flags catastrophic forgetting — cases where the model gains domain skill but loses general reasoning — by including a standard benchmark sample in every evaluation suite.

The weights are yours

This is the point that often gets lost in discussions of cloud-hosted fine-tuning APIs: when you fine-tune through a third-party service, the resulting weights may be locked to that provider. With Privonis, the base model is open-weight, the training run happens on hardware you control, and the LoRA adapter or merged checkpoint is yours to keep, version, and deploy wherever you choose. That means no vendor lock-in, no per-token fee on a model you paid to train, and no risk of the provider retraining on your data. For European companies handling sensitive information, keeping the weights is not a nice-to-have — it is a governance requirement.

Parunāsim par jūsu AI projektu

Rezervēt zvanu