Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Fine-Tuning Models on VMware Private AI with NeMo Customizer: LoRA, Full SFT and When to Bother (Private AI Series, Part 27)

RAG is not always the answer. Here is how NeMo Customizer fine-tunes models on VMware Private AI, the difference between LoRA and full SFT, and an honest take on when customization beats retrieval.

VMware Private AI Series · Part 27 of 30

TL;DR · Key Takeaways

  • NeMo Customizer is the fine-tuning microservice in the Private AI stack, deployed as a CRD by the NIM Operator alongside Data Store, Entity Store and Evaluator.
  • It supports LoRA, full SFT, DPO and GRPO. LoRA is the right default for almost everyone.
  • Reach for full SFT only on small models (1B to 8B) or when you must inject genuinely new knowledge or change fundamental behavior.
  • Most teams who ask for fine-tuning actually need RAG. Fine-tuning teaches style and format, retrieval supplies facts.
  • The customization workflow is a loop: dataset in Data Store, job in Customizer, score in Evaluator, register in Model Store, serve via NIM.

The most expensive mistake I see on Private AI is a team spending six weeks and a rack of H100 hours full fine-tuning a 70B model to answer questions about their product catalog, when a RAG pipeline would have done it in an afternoon and stayed current automatically. So before the how-to, the most important section: when not to do this at all.

Fine-tune or retrieve? Decide this first

Here is the rule that has never failed me. Fine-tuning changes how a model behaves: its tone, its output format, its willingness to follow a niche instruction style, its grasp of a specialized vocabulary. Retrieval changes what a model knows at query time: your documents, your current prices, this week’s policy. If the requirement is facts that change, that is RAG, full stop. If the requirement is a consistent voice, a structured output, or a domain dialect the base model fumbles, that is fine-tuning. Most real projects need a bit of both, and the order matters: get RAG working first, then fine-tune only the behavior gaps that remain.

What does the model actually need? Knowledge that changes? docs, prices, policy Use RAG do not fine-tune Behavior or format? tone, JSON, dialect LoRA the default choice New knowledge in weights? small model, deep change Full SFT 1B to 8B only The honest order 1. Try the base model with a good prompt. 2. Add RAG for facts. 3. LoRA for the behavior gaps left. 4. Full SFT only if 1 to 3 fail. Skipping straight to step 4 is how budgets disappear with little to show for it.
Work down the ladder. Most requirements are satisfied by step 2 or 3.

The four techniques, and which to pick

NeMo Customizer 25.8 supports four post-training methods. You do not need to master all of them. You need to know which one your problem maps to and skip the rest.

TechniqueWhat it changesGPU costReach for it when
LoRASmall adapter, base weights frozenLowAlmost always, especially 70B+ models
Full SFTEvery parameterVery highSmall models, deep behavior or knowledge change
DPOPreference alignment from chosen/rejected pairsMediumYou have human preference data and want to align tone
GRPOReinforcement-style optimization to a rewardHighReasoning or task-reward tuning, advanced cases

The customization workflow on Private AI

On Private AI the NeMo microservices are deployed by the NIM Operator as custom resources, so the whole fine-tuning loop runs inside the same declarative platform as your serving. The pieces fit together as a cycle, not a one-shot job.

The fine-tuning loop Data Storecurated dataset CustomizerLoRA / SFT job Evaluatorscore vs baseline Model Storeregister version NIMserve adapter failed eval feeds back: more data, new hyperparameters, retrain
Evaluator is the gate. A customization job that does not beat the baseline never reaches the Model Store.

A LoRA job is a single declarative call against the Customizer API. You point it at a customization target, a dataset in the Data Store, and your hyperparameters. The output is a small adapter, often tens of megabytes, that a NIM serves on top of the frozen base model. That is the operational beauty of LoRA on this platform: you can host one base model and many adapters, swapping behaviors without reloading 140GB of weights.

# Launch a LoRA customization job against the Customizer API
curl -X POST http://nemo-customizer/v1/customization/jobs 
  -H "Content-Type: application/json" 
  -d '{
    "config": "meta/llama-3.1-8b-instruct",
    "dataset": {"name": "support-tone-v3"},
    "hyperparameters": {
      "training_type": "sft",
      "finetuning_type": "lora",
      "epochs": 3,
      "lora": {"adapter_dim": 16}
    }
  }'

# Track it, then evaluate before promoting
curl http://nemo-customizer/v1/customization/jobs/{job_id}/status

Disclaimer: fine-tuning runs are GPU-heavy and can starve serving workloads on a shared cluster. Run customization jobs in a separate namespace with its own quota, validate dataset quality before training, always gate promotion on an Evaluator score against a held-out set, and keep the base model and adapter versions pinned together so you can roll back.

My take

Fine-tuning is the most over-requested and under-needed capability in enterprise AI. NeMo Customizer makes it genuinely accessible on Private AI, which is exactly why you should put guardrails around who gets to launch a full SFT job. My standing advice to clients: make LoRA the only self-service option, keep full SFT behind a review gate, and require an Evaluator score on every promotion. Data quality beats technique every time, a clean thousand-example dataset will out-train a noisy hundred-thousand-example one. And tie this back to lifecycle discipline from the MLOps post, because an untracked fine-tuned model is a liability the day someone asks what data went into it.

What is pushing you toward fine-tuning, behavior or knowledge? If you cannot answer that cleanly, you are probably not ready to start.

References

VMware Private AI Series · Part 27 of 30
« Previous: Part 26  |  VMware Private AI Complete Guide  |  Next: Part 28 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading