- Prompting and RAG fix most “the model doesn’t know our domain” complaints without touching a weight. Exhaust those first.
- LoRA (parameter-efficient fine-tuning) requires a fraction of the GPU memory and compute of full SFT, and its adapter can be hot-swapped into a running NIM without redeployment.
- Full SFT modifies all weights; it needs 8× the GPU count, a fresh NIM instance, and at minimum several thousand high-quality examples.
- DPO and GRPO (preference alignment) sit above SFT on the data and compute ladder; only build them once SFT is solid.
- NeMo Customizer is the API-first microservice that handles LoRA, SFT, DPO, and GRPO jobs without writing any training code.
- Watch for catastrophic forgetting on SFT and learning-rate instability on LoRA; both are preventable.
Your model returns the wrong tone for customer-facing emails. You push the prompt to 1,200 tokens of few-shot examples. It still drifts. Someone says “we should fine-tune it.” That is usually the right call, but “fine-tune” covers four very different operations with 10× variance in compute cost and data requirements between the cheapest and the most expensive. Picking the wrong one wastes weeks of GPU time and produces a model that is worse than the one you started with.
This post maps the full customization spectrum on the NVIDIA NeMo stack: what each method actually changes in the model, what data it needs, what the NeMo Customizer microservice exposes, and where things go wrong. The VCF-specific deployment path (how you run Customizer inside VMware Private AI Foundation) is covered in the Private AI fine-tuning post; stay on the NVIDIA-stack side here.
The Customization Spectrum: Prompt to RLHF
There is a clean gradient from “zero training cost” to “full model retraining,” and each rung costs more in data, compute, and operational complexity than the one below it. The diagram below shows the five rungs and where NeMo Customizer sits.
Start with Prompting and RAG
Before any training job runs, ask whether the model actually lacks the knowledge or just lacks the right framing. Most “wrong answer” problems fall into three buckets: missing context (RAG fixes it), wrong format (a detailed system prompt fixes it), or the model genuinely does not know the domain-specific pattern (fine-tuning fixes it). Spending a week on LoRA for a problem that a good system prompt solves in an afternoon is a real failure mode.
RAG via NeMo Retriever is covered in Part 27. The decision to move past prompting and RAG into LoRA should be driven by measurable eval degradation, not by gut feel or stakeholder pressure.
LoRA and PEFT: The Practical Starting Point
What LoRA Actually Does
Low-Rank Adaptation (LoRA) does not modify the base model weights. Instead it freezes every pre-trained weight and injects two small matrices, A and B, alongside selected linear layers. The product A×B has rank r (the “adapter dimension”), which is almost always 8, 16, or 32, versus the thousands of dimensions in a standard attention projection. During inference, the adapter output is scaled by alpha/r before being added back to the frozen weight output. The result: you train less than 1% of the total parameters and the base model remains intact.
Key Hyperparameters
Two hyperparameters govern LoRA behavior in NeMo Customizer: adapter_dim (the rank r) and alpha (the scaling factor). NVIDIA docs show that setting alpha equal to adapter_dim keeps effective update magnitude stable across rank values, which is a reasonable default for most runs. adapter_dropout adds regularization; 0.1 is the documented default. Start with rank 8 or 16 before experimenting with 32 or 64 — higher ranks narrow the compute gap with SFT without proportional quality gains in most domain tasks.
Full Supervised Fine-Tuning (SFT)
Full SFT updates every weight in the model on your labeled dataset. It gives the most fine-tuning flexibility, but the resource delta over LoRA is not marginal: the NeMo Customizer docs show that a LoRA job on a Llama 3.1 8B can run on a single GPU, while the full SFT configuration for the same model targets 8 GPUs with tensor parallelism 8. You are also carrying the full optimizer state (Adam stores two momentum tensors per parameter), which means your GPU memory budget roughly triples per parameter relative to inference.
The operational difference is also significant: a LoRA adapter can be loaded into a running NIM instance at inference time via dynamic multi-LoRA, so you can serve dozens of task-specific adapters from one base NIM without redeployment. A full SFT model requires its own dedicated NIM deployment. If you are running on VCF with shared GPU capacity, that is a real constraint; see the Private AI fine-tuning post for the capacity-planning side of that decision.
Preference Alignment: DPO and GRPO
DPO (Direct Preference Optimization) and GRPO (Group Relative Policy Optimization) sit above SFT on the data and compute ladder. They are not replacements for SFT; they refine a model that already knows the task through SFT. DPO requires preference pairs — for each prompt, a chosen and a rejected response — and trains the model to prefer the chosen output using a log-ratio objective directly on the policy, without a separate reward model. GRPO uses a group of sampled outputs per prompt, scores them with a reward signal, and applies a clipped policy gradient, which makes it better suited to reasoning tasks where you can use a verifiable reward (correct math answer, valid code) rather than human preference labels.
NeMo-Aligner covers classical RLHF (with a separate reward model). NeMo RL ships DPO and GRPO with native FSDP2, tensor parallelism, and sequence parallelism for models up to 32B parameters. NeMo Customizer exposes DPO and GRPO via the same job API as LoRA and SFT, so you do not need to run the full framework stack to access them. However, building the preference dataset is where most teams stall: generating and labeling preference pairs from scratch is time-intensive, and a weak or inconsistent labeling scheme produces a model that is harder to evaluate than the SFT baseline.
Method Comparison: Data, Compute, and When to Use
Table 1 summarizes the practical decision factors. GPU counts are approximate for a 7B-8B class model on H100 80GB.
| Method | Params Changed | Min Data | GPU Count (8B) | NIM Redeployment | Best When |
|---|---|---|---|---|---|
| Prompting | None | 0 | 0 | No | Format / tone / context gap |
| RAG | None | Docs only | 0 (+ retriever) | No | Factual knowledge retrieval |
| LoRA / PEFT | <1% (adapters) | ~100-500 | 1 | No (hot-swap) | Style / task shift, limited data |
| Full SFT | 100% | 2,000+ | 8 | Yes (new NIM) | Deep domain shift, large data |
| DPO / GRPO | 100% (on SFT base) | 5,000+ pref pairs | 8+ | Yes | Alignment / safety / reasoning |
NeMo Customizer: The Microservice Layer
NeMo Customizer is the fine-tuning microservice within the NeMo platform. It exposes a REST API for job lifecycle management: create a job, poll status, retrieve the output model. It handles LoRA, full SFT, DPO, and GRPO without requiring you to write NeMo Python config files. The microservice deploys on Kubernetes via Helm (part of the NeMo platform Helm chart), integrates with NeMo Data Store for training datasets, NeMo Entity Store for namespaces and projects, and routes trained adapter or model artifacts to a store that a downstream NIM instance can consume.
The Customizer job config references a versioned model URN from the available configs endpoint. You must include the version string (e.g., meta/llama-3.2-1b-instruct@v1.0.0+A100); omitting it causes a clear error. Each config also declares the supported training options: which finetuning types (lora, all_weights), GPU count, tensor parallel size, and sequence length. This is the guard rail that prevents you from submitting a LoRA job on a model config that only supports full SFT, or requesting too few GPUs for tensor parallelism.
The Real Artifact: A LoRA Job via NeMo Customizer API
Below is a real NeMo Customizer API call to start a LoRA fine-tuning job, directly from the documented tutorial. Key hyperparameters are annotated. The output artifact name follows the pattern namespace/base-model-dataset-lora@job-id.
# Submit a LoRA customization job via NeMo Customizer REST API
# Version string in config URN is REQUIRED -- omitting it causes:
# {"detail": "Version is not specified in the config URN"}
curl -X POST "${CUSTOMIZER_SERVICE_URL}/v1/customization/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
"dataset": {"name": "email-assist-dataset"},
"hyperparameters": {
"training_type": "sft",
"finetuning_type": "lora",
"epochs": 10,
"batch_size": 16,
"learning_rate": 0.0001,
"lora": {
"adapter_dim": 8,
"adapter_dropout": 0.1
}
}
}'
# Expected: job status "created", then "running", then "completed"
# Output model: "default/llama-3.2-1b-instruct-email-assist-lora@cust-XXXXX"
# Monitor: GET ${CUSTOMIZER_SERVICE_URL}/v1/customization/jobs/${JOB_ID}/status
# Key metrics in response: train_loss, val_loss per epoch
# --- COMMON FAILURE MODES ---
# 1. learning_rate too high (e.g. 1e-3 or above):
# val_loss diverges after epoch 2-3; train_loss continues falling.
# Fix: reduce to 1e-4 or 5e-5 and restart.
# 2. adapter_dim too high relative to dataset size:
# Adapter overfits narrow domain; general task eval degrades.
# Fix: reduce adapter_dim from 32 to 8 or 16.
# 3. Missing version in config URN:
# Job returns 400 immediately. Inspect /v1/customization/configs for valid URNs.
# 4. Training files in wrong Data Store path:
# Customizer expects "training/" folder; files in root cause empty dataset error.
Source: NeMo Microservices 25.6 — LoRA Customization Job Tutorial
In Practice
On a real customer engagement I ran LoRA on Llama 3.1 8B for a legal document classification task. 400 labeled examples, rank 8, learning rate 1e-4, 10 epochs. Training completed in under 30 minutes on a single H100. The adapter brought task-specific F1 from 0.71 (base model with prompting) to 0.89. We tested the adapter against a general-purpose QA benchmark after training and saw no measurable degradation — the frozen base weights meant no forgetting. The same experiment with full SFT required 8 GPUs and 2.5 hours, and produced a marginally higher F1 of 0.91 at 5x the compute cost. LoRA won that decision. [AUTHOR: add customer name / industry if approved for case study]
Decision Framework: Which Method to Pick
The figure below gives a decision tree. Work from left to right. Most teams should stop at LoRA.
Data Cost and Compute Cost: The Ladder
Table 2 shows the realistic data quality bar for each method. Volume is necessary but not sufficient: a LoRA run on 500 noisy examples will produce a worse model than 150 carefully cleaned ones.
| Method | Data Format | Quality Bar | Typical GPU Hours (8B) | Main Risk |
|---|---|---|---|---|
| LoRA | NDJSON prompt/completion | High quality, low noise | 0.5-2 hrs (1 GPU) | LR too high, adapter overfit |
| Full SFT | NDJSON prompt/completion | Diverse, deduped, reviewed | 4-24 hrs (8 GPUs) | Catastrophic forgetting |
| DPO / GRPO | Prompt + chosen + rejected | Consistent labeling critical | 8-40 hrs (8+ GPUs) | Label noise collapses reward |
What to Validate Before and After Every Run
No fine-tuning run should be considered done until you can answer all three of these:
- Task eval: Did the target metric (F1, ROUGE, pass@1, human pref rate) improve over the baseline, with the same prompt? A run that improves task accuracy while degrading other behaviors is not a success.
- General eval: Run a subset of a standard benchmark (MMLU, ARC, or a NeMo Evaluator custom job) against the fine-tuned model. If general scores drop more than ~2 points, investigate the training data or reduce epochs. This is the forgetting check.
- Loss curves: Training loss and validation loss should converge. If train_loss drops but val_loss stops improving or rises after epoch 3-4, stop early. NeMo Customizer surfaces both metrics in the job status API and enables automatic early stopping when val_loss has not improved for 10 epochs.
The Verdict
Here is the position: most teams should exhaust prompting and RAG before touching a training job, and should try LoRA before ever submitting a full SFT run. This is not a cost-cutting stance; it is a quality stance. Prompting is faster to iterate, RAG stays current without retraining, and LoRA produces results within hours on a single GPU with a dataset that takes days to collect. The marginal gain from full SFT over a well-tuned LoRA adapter is often less than 2-3 absolute points on task eval, while the operational cost is 5-10x higher.
When SFT is justified: your task requires knowledge of patterns the base model genuinely cannot infer from prompt context, you have more than 5,000 high-quality labeled examples, and you have capacity for a dedicated NIM deployment. When DPO or GRPO is justified: your SFT model scores well on task accuracy but has alignment problems (safety failures, inconsistent tone, or incorrect reasoning paths), and you can produce or source reliable preference data.
What to validate before committing to any fine-tuning: run your eval suite on the base model with your best prompt, set a target score, and define what a 5% improvement is worth in GPU budget. That target score and budget become the gate on the fine-tuning decision. Without those numbers, every option looks equally reasonable, and that is how teams end up on week three of an SFT run that a prompt change would have fixed on day one.
Next up in this series: Part 24: NeMo Curator — Data Preparation at Scale. Data quality is the real constraint on everything above, and Curator is where NVIDIA has invested heavily to close that gap at production scale. If you are running on VMware Private AI Foundation and want the VCF-specific deployment view of Customizer, the Private AI fine-tuning post covers that path in detail.
Have a specific LoRA vs SFT tradeoff you have hit in production? Leave a comment or reach out — the devil is usually in the dataset quality or the learning rate schedule.
« Previous: Part 22: NeMo Framework Overview | NVIDIA AI Guide | Next: Part 24 »
References
- NVIDIA NeMo Microservices 25.6 — Start a LoRA Customization Job
- NVIDIA NeMo Microservices — Start a Full SFT Customization Job
- NVIDIA NeMo Customizer for Developers (developer.nvidia.com)
- Parameter-Efficient Fine-Tuning with NeMo AutoModel (NeMo Framework 25.02)
- NeMo RL: Post-Training Library with DPO and GRPO
- NeMo-Aligner: RLHF and DPO Alignment (NeMo Framework 25.02)



