TL;DR · Key Takeaways
- The built-in path (Model Store plus Model Runtime in Private AI Services 2.1) wins on time to value, governance, and staying inside the VCF support boundary.
- The DIY path (MLflow registry, KServe serving, vLLM on VKS) wins on experiment tracking, custom model formats, and ecosystem freedom, but you own every upgrade and interop break.
- Most teams should start built-in and graft MLflow on only for the training and experiment-tracking side the platform genuinely does not cover.
- The lifecycle loop is identical either way: register, validate, stage, promote, serve, monitor, retire. The tooling only moves the work around.
Two architects can deploy the same 8B model on VMware Private AI and end up with completely different operational lives. One opens Private AI Services, registers the model in the Model Store, and serves it from the Model Runtime before lunch. The other stands up MLflow, wires it to a KServe InferenceService on a VKS cluster, and spends three days on storage initializers, S3 credentials, and namespace RBAC. Both serve tokens. The real question is which lifecycle you want to own for the next two years, and the honest answer depends on how mature your ML practice already is, not on which stack has the longer feature list.
The two paths, at a glance
MLOps on Private AI is really a choice about where the model registry, the promotion gate, and the serving runtime live. Path A keeps all three inside the VMware-managed control plane that ships with Private AI Services. Path B assembles them yourself from the open-source stack the wider Kubernetes world standardised on in 2026. Here is the split before we get into the trade-offs.
What the built-in lifecycle actually gives you
Private AI Services, introduced with Private AI Foundation 9.0 and matured in the 2.1 release, splits the lifecycle into two named services. The Model Store is the secure repository where models are curated, tested, and governed before anything reaches production, so only validated and policy-compliant models get an endpoint. The Model Runtime is the GenAI execution layer that handles active inference. The promotion gate between them is the part you would otherwise build by hand.
The quiet advantage is the observability framework. It reports health and performance across the whole AI footprint, from the inference engine and GPU utilisation down to knowledge base indexing and agent calls, without you deploying a separate metrics stack. Pair that with the 2.1 artifact mirroring tool (AMT) and the same lifecycle runs inside an air-gapped environment, which is the scenario where a DIY stack hurts most. The cost is honesty about the ceiling: you serve what the runtime supports, in the formats it supports, and you do not get a first-class experiment-tracking server or a training-run history out of the box.
What DIY MLflow plus KServe buys you, and what it costs
The DIY path is the consensus 2026 Kubernetes inference stack dropped onto a VKS workload cluster: MLflow for the model registry and experiment tracking, KServe for serving (it wraps vLLM, Triton, and others behind one InferenceService CRD), vLLM as the actual LLM engine, Kueue for GPU and batch scheduling, and the NVIDIA GPU Operator underneath. KServe’s storage initializer pulls the model artifact from S3 and hands it to the predictor. It is powerful and genuinely flexible.
The cost is interop. The single most common way this path breaks is a version skew between the GPU Operator, the vGPU guest driver, and the CUDA build inside the KServe runtime image, which surfaces as a pod stuck in CrashLoopBackOff with a CUDA driver mismatch in the logs. Nothing in the open-source stack validates that matrix for you. The built-in runtime does, because Broadcom and NVIDIA ship and test it together. Here is the same model in both worlds.
# DIY path: a KServe InferenceService backed by vLLM on a VKS cluster
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-1-8b
namespace: genai-prod
spec:
predictor:
model:
modelFormat:
name: vllm
storageUri: s3://models/llama-3-1-8b/v3
resources:
limits:
nvidia.com/gpu: "1"
EOF
# You now own the registry, the S3 bucket, the runtime image, and the GPU matrix.
# Built-in path: the model is registered once in the Model Store,
# then served from the Model Runtime. No manifest to hand-maintain,
# no separate registry, and the GPU stack is pre-validated.
Built-in vs DIY: the comparison that matters
| Dimension | Built-in (Model Store plus Runtime) | DIY (MLflow plus KServe) |
|---|---|---|
| Time to first endpoint | Hours, mostly UI driven | Days, manifests and plumbing |
| Governance and promotion gate | Native, policy enforced in the Store | You build it from MLflow stages plus CI |
| Experiment tracking and training history | Limited, serving focused | First class, MLflow core strength |
| Model and format flexibility | What the runtime supports | Anything KServe can wrap |
| GPU and driver interop | Pre-validated by Broadcom and NVIDIA | Your problem, your test matrix |
| Observability | Built-in framework, GPU to agent | Assemble Prometheus, Grafana, exporters |
| Air-gapped support | AMT in 2.1, designed for it | Manual mirroring of every image |
| Support boundary | Inside VCF support | Community plus your own runbooks |
Choose which, when
Go built-in when your goal is to serve curated models to applications and agents with governance and support, which describes the majority of enterprise Private AI deployments. Do not go pure DIY just because the open-source stack is fashionable; that decision only pays off when you have a platform team that already lives in Kubernetes and needs the flexibility. Validate one assumption first either way: confirm your target models and formats are supported by the Model Runtime before you commit, because that single fact decides whether the built-in path even applies.
The lifecycle loop both stacks must run
Whichever path you pick, the same loop runs underneath. The built-in services automate most of it; the DIY stack makes you wire each hop. The steps teams skip are validation and retirement, and those are exactly the two that cause incidents later: an unvalidated model that hallucinates in production, and a forgotten endpoint that quietly holds a GPU for months.
In practice, the monitoring hop is where the built-in observability framework earns its keep, because it correlates GPU utilisation, request latency, and knowledge base indexing in one place. On the DIY stack you reach the same view only after wiring a GPU exporter, a KServe metrics scrape, and a dashboard, and most teams discover the gap during an incident rather than before one. For the signals worth watching, see the approach in GPU monitoring with VCF Operations, and if your serving layer leans on NVIDIA microservices, the NIM microservices design covers the runtime side.
What I’d Do
Start built-in. For the large majority of VCF Private AI customers, the Model Store and Model Runtime give you a governed, supported, observable lifecycle in a fraction of the time, and they keep you inside one support boundary instead of owning a six-component matrix. Reach for MLflow only on the side it genuinely owns: experiment tracking and the training history that the platform does not surface. Go full KServe DIY in production only when you have a real ML platform team and a flexibility requirement the runtime cannot meet, and even then, run it on VKS with eyes open about the interop matrix you just inherited. The flexibility is real, but most teams pay for it and never use it. Which path are you running today, and did the support boundary or the flexibility decide it for you?
References
- Activate VCF Private AI Services (VMware Cloud Foundation blog)
- VMware Private AI Services Release Notes (Broadcom TechDocs)
- The 2026 AI/ML on Kubernetes stack: vLLM, Kueue, KServe and more
- Deploy an MLflow model to Kubernetes (MLflow docs)
« Previous: Part 21 | VMware Private AI Complete Guide | Next: Part 23 »








