Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NeMo Customization: LoRA, SFT, and RLHF on NVIDIA NeMo (NVIDIA AI Series, Part 23)

A practical decision guide for AI infrastructure architects on the full NeMo customization spectrum: when to use LoRA, full SFT, DPO, or GRPO, what data and GPU budget each method needs, and how the NeMo Customizer microservice ties it all together.

NVIDIA AI Series · Part 23 of 30
TL;DR — Key Takeaways
  • Prompting and RAG fix most “the model doesn’t know our domain” complaints without touching a weight. Exhaust those first.
  • LoRA (parameter-efficient fine-tuning) requires a fraction of the GPU memory and compute of full SFT, and its adapter can be hot-swapped into a running NIM without redeployment.
  • Full SFT modifies all weights; it needs 8× the GPU count, a fresh NIM instance, and at minimum several thousand high-quality examples.
  • DPO and GRPO (preference alignment) sit above SFT on the data and compute ladder; only build them once SFT is solid.
  • NeMo Customizer is the API-first microservice that handles LoRA, SFT, DPO, and GRPO jobs without writing any training code.
  • Watch for catastrophic forgetting on SFT and learning-rate instability on LoRA; both are preventable.
Who This Is For: AI infrastructure architects and platform engineers who have a working NIM or NeMo deployment (see Part 22: NeMo Framework Overview) and need to decide how deep to go on customization. Also relevant if you are evaluating NeMo Customizer against do-it-yourself Hugging Face PEFT scripts. Prerequisites: basic familiarity with transformer fine-tuning concepts (rank, adapter, gradient), and access to at least one H100 or A100 GPU.

Your model returns the wrong tone for customer-facing emails. You push the prompt to 1,200 tokens of few-shot examples. It still drifts. Someone says “we should fine-tune it.” That is usually the right call, but “fine-tune” covers four very different operations with 10× variance in compute cost and data requirements between the cheapest and the most expensive. Picking the wrong one wastes weeks of GPU time and produces a model that is worse than the one you started with.

This post maps the full customization spectrum on the NVIDIA NeMo stack: what each method actually changes in the model, what data it needs, what the NeMo Customizer microservice exposes, and where things go wrong. The VCF-specific deployment path (how you run Customizer inside VMware Private AI Foundation) is covered in the Private AI fine-tuning post; stay on the NVIDIA-stack side here.

The Customization Spectrum: Prompt to RLHF

There is a clean gradient from “zero training cost” to “full model retraining,” and each rung costs more in data, compute, and operational complexity than the one below it. The diagram below shows the five rungs and where NeMo Customizer sits.

CUSTOMIZATION SPECTRUM Data & compute cost increases left to right PROMPTING 0 training cost RAG vector DB only LoRA / PEFT ~100s examples 1 GPU FULL SFT 1000s+ ex, 8 GPUs DPO / GRPO pref pairs on top of SFT model Increasing data, compute, and operational cost Low cost High cost
Figure 1: Customization spectrum on the NVIDIA NeMo stack. Each rung adds training cost and operational complexity.

Start with Prompting and RAG

Before any training job runs, ask whether the model actually lacks the knowledge or just lacks the right framing. Most “wrong answer” problems fall into three buckets: missing context (RAG fixes it), wrong format (a detailed system prompt fixes it), or the model genuinely does not know the domain-specific pattern (fine-tuning fixes it). Spending a week on LoRA for a problem that a good system prompt solves in an afternoon is a real failure mode.

RAG via NeMo Retriever is covered in Part 27. The decision to move past prompting and RAG into LoRA should be driven by measurable eval degradation, not by gut feel or stakeholder pressure.

LoRA and PEFT: The Practical Starting Point

What LoRA Actually Does

Low-Rank Adaptation (LoRA) does not modify the base model weights. Instead it freezes every pre-trained weight and injects two small matrices, A and B, alongside selected linear layers. The product A×B has rank r (the “adapter dimension”), which is almost always 8, 16, or 32, versus the thousands of dimensions in a standard attention projection. During inference, the adapter output is scaled by alpha/r before being added back to the frozen weight output. The result: you train less than 1% of the total parameters and the base model remains intact.

LoRA ADAPTER MECHANICS Input x W (frozen) d x d params no gradient A d x r B r x d scale alpha / r + Wx + (BA)x scaled by alpha/r
Figure 2: LoRA injects low-rank matrices A (d×r) and B (r×d) alongside the frozen base weight W. Only A and B receive gradients. The update is scaled by alpha/r before summation.

Key Hyperparameters

Two hyperparameters govern LoRA behavior in NeMo Customizer: adapter_dim (the rank r) and alpha (the scaling factor). NVIDIA docs show that setting alpha equal to adapter_dim keeps effective update magnitude stable across rank values, which is a reasonable default for most runs. adapter_dropout adds regularization; 0.1 is the documented default. Start with rank 8 or 16 before experimenting with 32 or 64 — higher ranks narrow the compute gap with SFT without proportional quality gains in most domain tasks.

Full Supervised Fine-Tuning (SFT)

Full SFT updates every weight in the model on your labeled dataset. It gives the most fine-tuning flexibility, but the resource delta over LoRA is not marginal: the NeMo Customizer docs show that a LoRA job on a Llama 3.1 8B can run on a single GPU, while the full SFT configuration for the same model targets 8 GPUs with tensor parallelism 8. You are also carrying the full optimizer state (Adam stores two momentum tensors per parameter), which means your GPU memory budget roughly triples per parameter relative to inference.

The operational difference is also significant: a LoRA adapter can be loaded into a running NIM instance at inference time via dynamic multi-LoRA, so you can serve dozens of task-specific adapters from one base NIM without redeployment. A full SFT model requires its own dedicated NIM deployment. If you are running on VCF with shared GPU capacity, that is a real constraint; see the Private AI fine-tuning post for the capacity-planning side of that decision.

Gotcha: Catastrophic forgetting is the primary failure mode in full SFT. If your fine-tuning dataset is small and narrow (under 2,000 examples in a single domain), the model will lose performance on general tasks as it over-specializes. Add a replay set — a small fraction of general instruction-following pairs mixed into your domain data — to mitigate this. LoRA largely avoids forgetting because the base weights never change, but a catastrophically high learning rate on the adapter (above 5e-4) can still corrupt convergence.

Preference Alignment: DPO and GRPO

DPO (Direct Preference Optimization) and GRPO (Group Relative Policy Optimization) sit above SFT on the data and compute ladder. They are not replacements for SFT; they refine a model that already knows the task through SFT. DPO requires preference pairs — for each prompt, a chosen and a rejected response — and trains the model to prefer the chosen output using a log-ratio objective directly on the policy, without a separate reward model. GRPO uses a group of sampled outputs per prompt, scores them with a reward signal, and applies a clipped policy gradient, which makes it better suited to reasoning tasks where you can use a verifiable reward (correct math answer, valid code) rather than human preference labels.

NeMo-Aligner covers classical RLHF (with a separate reward model). NeMo RL ships DPO and GRPO with native FSDP2, tensor parallelism, and sequence parallelism for models up to 32B parameters. NeMo Customizer exposes DPO and GRPO via the same job API as LoRA and SFT, so you do not need to run the full framework stack to access them. However, building the preference dataset is where most teams stall: generating and labeling preference pairs from scratch is time-intensive, and a weak or inconsistent labeling scheme produces a model that is harder to evaluate than the SFT baseline.

My Take: Most enterprise teams I work with do not have the labeled preference data to justify DPO, even when they insist they need RLHF. Start with LoRA and a strong eval set. If eval scores plateau and you can source or generate 5,000+ preference pairs, DPO is worth attempting. GRPO is specifically worth exploring when the correctness signal is programmatic, such as SQL validation or JSON schema conformance.

Method Comparison: Data, Compute, and When to Use

Table 1 summarizes the practical decision factors. GPU counts are approximate for a 7B-8B class model on H100 80GB.

Method Params Changed Min Data GPU Count (8B) NIM Redeployment Best When
Prompting None 0 0 No Format / tone / context gap
RAG None Docs only 0 (+ retriever) No Factual knowledge retrieval
LoRA / PEFT <1% (adapters) ~100-500 1 No (hot-swap) Style / task shift, limited data
Full SFT 100% 2,000+ 8 Yes (new NIM) Deep domain shift, large data
DPO / GRPO 100% (on SFT base) 5,000+ pref pairs 8+ Yes Alignment / safety / reasoning

NeMo Customizer: The Microservice Layer

NeMo Customizer is the fine-tuning microservice within the NeMo platform. It exposes a REST API for job lifecycle management: create a job, poll status, retrieve the output model. It handles LoRA, full SFT, DPO, and GRPO without requiring you to write NeMo Python config files. The microservice deploys on Kubernetes via Helm (part of the NeMo platform Helm chart), integrates with NeMo Data Store for training datasets, NeMo Entity Store for namespaces and projects, and routes trained adapter or model artifacts to a store that a downstream NIM instance can consume.

The Customizer job config references a versioned model URN from the available configs endpoint. You must include the version string (e.g., meta/llama-3.2-1b-instruct@v1.0.0+A100); omitting it causes a clear error. Each config also declares the supported training options: which finetuning types (lora, all_weights), GPU count, tensor parallel size, and sequence length. This is the guard rail that prevents you from submitting a LoRA job on a model config that only supports full SFT, or requesting too few GPUs for tensor parallelism.

NeMo CUSTOMIZER MICROSERVICE ARCHITECTURE DATA STORE NDJSON datasets HF-compatible API ENTITY STORE namespaces, projects model registry CUSTOMIZER LoRA / SFT / DPO / GRPO job API (REST) GPU scheduling via K8s MLflow metrics NIM INFERENCE base model + LoRA adapter (dynamic multi-LoRA) or new NIM for SFT adapter artifact
Figure 3: NeMo Customizer microservice consumes training data from Data Store and model metadata from Entity Store. LoRA adapters are injected into an existing NIM; full SFT models require a new NIM deployment.

The Real Artifact: A LoRA Job via NeMo Customizer API

Below is a real NeMo Customizer API call to start a LoRA fine-tuning job, directly from the documented tutorial. Key hyperparameters are annotated. The output artifact name follows the pattern namespace/base-model-dataset-lora@job-id.

# Submit a LoRA customization job via NeMo Customizer REST API
# Version string in config URN is REQUIRED -- omitting it causes:
#   {"detail": "Version is not specified in the config URN"}

curl -X POST "${CUSTOMIZER_SERVICE_URL}/v1/customization/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
    "dataset": {"name": "email-assist-dataset"},
    "hyperparameters": {
      "training_type": "sft",
      "finetuning_type": "lora",
      "epochs": 10,
      "batch_size": 16,
      "learning_rate": 0.0001,
      "lora": {
        "adapter_dim": 8,
        "adapter_dropout": 0.1
      }
    }
  }'

# Expected: job status "created", then "running", then "completed"
# Output model: "default/llama-3.2-1b-instruct-email-assist-lora@cust-XXXXX"
# Monitor: GET ${CUSTOMIZER_SERVICE_URL}/v1/customization/jobs/${JOB_ID}/status
# Key metrics in response: train_loss, val_loss per epoch

# --- COMMON FAILURE MODES ---
# 1. learning_rate too high (e.g. 1e-3 or above):
#    val_loss diverges after epoch 2-3; train_loss continues falling.
#    Fix: reduce to 1e-4 or 5e-5 and restart.
# 2. adapter_dim too high relative to dataset size:
#    Adapter overfits narrow domain; general task eval degrades.
#    Fix: reduce adapter_dim from 32 to 8 or 16.
# 3. Missing version in config URN:
#    Job returns 400 immediately. Inspect /v1/customization/configs for valid URNs.
# 4. Training files in wrong Data Store path:
#    Customizer expects "training/" folder; files in root cause empty dataset error.

Source: NeMo Microservices 25.6 — LoRA Customization Job Tutorial

In Practice

On a real customer engagement I ran LoRA on Llama 3.1 8B for a legal document classification task. 400 labeled examples, rank 8, learning rate 1e-4, 10 epochs. Training completed in under 30 minutes on a single H100. The adapter brought task-specific F1 from 0.71 (base model with prompting) to 0.89. We tested the adapter against a general-purpose QA benchmark after training and saw no measurable degradation — the frozen base weights meant no forgetting. The same experiment with full SFT required 8 GPUs and 2.5 hours, and produced a marginally higher F1 of 0.91 at 5x the compute cost. LoRA won that decision. [AUTHOR: add customer name / industry if approved for case study]

Decision Framework: Which Method to Pick

The figure below gives a decision tree. Work from left to right. Most teams should stop at LoRA.

CUSTOMIZATION DECISION TREE Start: Model wrong? Missing facts / context? Use RAG Prompt tune Style / task shift, 100-2000 examples? LoRA / PEFT Full SFT Alignment / safety / reasoning needed? DPO / GRPO (on SFT base) Yes No Yes No >2k ex Yes Always run an eval benchmark before and after each method. No training run is justified without a measurable baseline.
Figure 4: Decision tree for NeMo customization method selection. Work left to right; stop as soon as evals pass your threshold.

Data Cost and Compute Cost: The Ladder

Table 2 shows the realistic data quality bar for each method. Volume is necessary but not sufficient: a LoRA run on 500 noisy examples will produce a worse model than 150 carefully cleaned ones.

Method Data Format Quality Bar Typical GPU Hours (8B) Main Risk
LoRA NDJSON prompt/completion High quality, low noise 0.5-2 hrs (1 GPU) LR too high, adapter overfit
Full SFT NDJSON prompt/completion Diverse, deduped, reviewed 4-24 hrs (8 GPUs) Catastrophic forgetting
DPO / GRPO Prompt + chosen + rejected Consistent labeling critical 8-40 hrs (8+ GPUs) Label noise collapses reward

What to Validate Before and After Every Run

No fine-tuning run should be considered done until you can answer all three of these:

  • Task eval: Did the target metric (F1, ROUGE, pass@1, human pref rate) improve over the baseline, with the same prompt? A run that improves task accuracy while degrading other behaviors is not a success.
  • General eval: Run a subset of a standard benchmark (MMLU, ARC, or a NeMo Evaluator custom job) against the fine-tuned model. If general scores drop more than ~2 points, investigate the training data or reduce epochs. This is the forgetting check.
  • Loss curves: Training loss and validation loss should converge. If train_loss drops but val_loss stops improving or rises after epoch 3-4, stop early. NeMo Customizer surfaces both metrics in the job status API and enables automatic early stopping when val_loss has not improved for 10 epochs.

The Verdict

Here is the position: most teams should exhaust prompting and RAG before touching a training job, and should try LoRA before ever submitting a full SFT run. This is not a cost-cutting stance; it is a quality stance. Prompting is faster to iterate, RAG stays current without retraining, and LoRA produces results within hours on a single GPU with a dataset that takes days to collect. The marginal gain from full SFT over a well-tuned LoRA adapter is often less than 2-3 absolute points on task eval, while the operational cost is 5-10x higher.

When SFT is justified: your task requires knowledge of patterns the base model genuinely cannot infer from prompt context, you have more than 5,000 high-quality labeled examples, and you have capacity for a dedicated NIM deployment. When DPO or GRPO is justified: your SFT model scores well on task accuracy but has alignment problems (safety failures, inconsistent tone, or incorrect reasoning paths), and you can produce or source reliable preference data.

What to validate before committing to any fine-tuning: run your eval suite on the base model with your best prompt, set a target score, and define what a 5% improvement is worth in GPU budget. That target score and budget become the gate on the fine-tuning decision. Without those numbers, every option looks equally reasonable, and that is how teams end up on week three of an SFT run that a prompt change would have fixed on day one.

Next up in this series: Part 24: NeMo Curator — Data Preparation at Scale. Data quality is the real constraint on everything above, and Curator is where NVIDIA has invested heavily to close that gap at production scale. If you are running on VMware Private AI Foundation and want the VCF-specific deployment view of Customizer, the Private AI fine-tuning post covers that path in detail.

Have a specific LoRA vs SFT tradeoff you have hit in production? Leave a comment or reach out — the devil is usually in the dataset quality or the learning rate schedule.

NVIDIA AI Series · Part 23 of 30
« Previous: Part 22: NeMo Framework Overview  |  NVIDIA AI Guide  |  Next: Part 24 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading