VMware Private AI MLOps: Built-In Model Lifecycle vs DIY MLflow and KServe (Private AI Series, Part 22)

Two ways to run model lifecycle on VMware Private AI: the built-in Model Store and Model Runtime, or a DIY MLflow and KServe stack on VKS. Here is when each one wins, and the verdict.

by

Dr. Pranay Jha

June 15, 2026

No comments

8 minutes

Read Time

VMware Private AI Series · Part 22 of 24

TL;DR · Key Takeaways

The built-in path (Model Store plus Model Runtime in Private AI Services 2.1) wins on time to value, governance, and staying inside the VCF support boundary.
The DIY path (MLflow registry, KServe serving, vLLM on VKS) wins on experiment tracking, custom model formats, and ecosystem freedom, but you own every upgrade and interop break.
Most teams should start built-in and graft MLflow on only for the training and experiment-tracking side the platform genuinely does not cover.
The lifecycle loop is identical either way: register, validate, stage, promote, serve, monitor, retire. The tooling only moves the work around.

Two architects can deploy the same 8B model on VMware Private AI and end up with completely different operational lives. One opens Private AI Services, registers the model in the Model Store, and serves it from the Model Runtime before lunch. The other stands up MLflow, wires it to a KServe InferenceService on a VKS cluster, and spends three days on storage initializers, S3 credentials, and namespace RBAC. Both serve tokens. The real question is which lifecycle you want to own for the next two years, and the honest answer depends on how mature your ML practice already is, not on which stack has the longer feature list.

The two paths, at a glance

MLOps on Private AI is really a choice about where the model registry, the promotion gate, and the serving runtime live. Path A keeps all three inside the VMware-managed control plane that ships with Private AI Services. Path B assembles them yourself from the open-source stack the wider Kubernetes world standardised on in 2026. Here is the split before we get into the trade-offs.

The built-in path hands you a managed lifecycle; the DIY path hands you a toolbox and the bill for assembly.

What the built-in lifecycle actually gives you

Private AI Services, introduced with Private AI Foundation 9.0 and matured in the 2.1 release, splits the lifecycle into two named services. The Model Store is the secure repository where models are curated, tested, and governed before anything reaches production, so only validated and policy-compliant models get an endpoint. The Model Runtime is the GenAI execution layer that handles active inference. The promotion gate between them is the part you would otherwise build by hand.

The quiet advantage is the observability framework. It reports health and performance across the whole AI footprint, from the inference engine and GPU utilisation down to knowledge base indexing and agent calls, without you deploying a separate metrics stack. Pair that with the 2.1 artifact mirroring tool (AMT) and the same lifecycle runs inside an air-gapped environment, which is the scenario where a DIY stack hurts most. The cost is honesty about the ceiling: you serve what the runtime supports, in the formats it supports, and you do not get a first-class experiment-tracking server or a training-run history out of the box.

What DIY MLflow plus KServe buys you, and what it costs

The DIY path is the consensus 2026 Kubernetes inference stack dropped onto a VKS workload cluster: MLflow for the model registry and experiment tracking, KServe for serving (it wraps vLLM, Triton, and others behind one InferenceService CRD), vLLM as the actual LLM engine, Kueue for GPU and batch scheduling, and the NVIDIA GPU Operator underneath. KServe’s storage initializer pulls the model artifact from S3 and hands it to the predictor. It is powerful and genuinely flexible.

Flexibility is real, but so is the integration surface you sign up to maintain.

The cost is interop. The single most common way this path breaks is a version skew between the GPU Operator, the vGPU guest driver, and the CUDA build inside the KServe runtime image, which surfaces as a pod stuck in CrashLoopBackOff with a CUDA driver mismatch in the logs. Nothing in the open-source stack validates that matrix for you. The built-in runtime does, because Broadcom and NVIDIA ship and test it together. Here is the same model in both worlds.

# DIY path: a KServe InferenceService backed by vLLM on a VKS cluster
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-1-8b
  namespace: genai-prod
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      storageUri: s3://models/llama-3-1-8b/v3
      resources:
        limits:
          nvidia.com/gpu: "1"
EOF
# You now own the registry, the S3 bucket, the runtime image, and the GPU matrix.

# Built-in path: the model is registered once in the Model Store,
# then served from the Model Runtime. No manifest to hand-maintain,
# no separate registry, and the GPU stack is pre-validated.

Built-in vs DIY: the comparison that matters

Dimension	Built-in (Model Store plus Runtime)	DIY (MLflow plus KServe)
Time to first endpoint	Hours, mostly UI driven	Days, manifests and plumbing
Governance and promotion gate	Native, policy enforced in the Store	You build it from MLflow stages plus CI
Experiment tracking and training history	Limited, serving focused	First class, MLflow core strength
Model and format flexibility	What the runtime supports	Anything KServe can wrap
GPU and driver interop	Pre-validated by Broadcom and NVIDIA	Your problem, your test matrix
Observability	Built-in framework, GPU to agent	Assemble Prometheus, Grafana, exporters
Air-gapped support	AMT in 2.1, designed for it	Manual mirroring of every image
Support boundary	Inside VCF support	Community plus your own runbooks

Choose which, when

The fork is team maturity, not model size. A two-person team should not run six open-source components in production.

Go built-in when your goal is to serve curated models to applications and agents with governance and support, which describes the majority of enterprise Private AI deployments. Do not go pure DIY just because the open-source stack is fashionable; that decision only pays off when you have a platform team that already lives in Kubernetes and needs the flexibility. Validate one assumption first either way: confirm your target models and formats are supported by the Model Runtime before you commit, because that single fact decides whether the built-in path even applies.

The lifecycle loop both stacks must run

Whichever path you pick, the same loop runs underneath. The built-in services automate most of it; the DIY stack makes you wire each hop. The steps teams skip are validation and retirement, and those are exactly the two that cause incidents later: an unvalidated model that hallucinates in production, and a forgotten endpoint that quietly holds a GPU for months.

The loop is identical on both stacks. Built-in automates hops 1 to 6; DIY makes you assemble each one.

In practice, the monitoring hop is where the built-in observability framework earns its keep, because it correlates GPU utilisation, request latency, and knowledge base indexing in one place. On the DIY stack you reach the same view only after wiring a GPU exporter, a KServe metrics scrape, and a dashboard, and most teams discover the gap during an incident rather than before one. For the signals worth watching, see the approach in GPU monitoring with VCF Operations, and if your serving layer leans on NVIDIA microservices, the NIM microservices design covers the runtime side.

What I’d Do

Start built-in. For the large majority of VCF Private AI customers, the Model Store and Model Runtime give you a governed, supported, observable lifecycle in a fraction of the time, and they keep you inside one support boundary instead of owning a six-component matrix. Reach for MLflow only on the side it genuinely owns: experiment tracking and the training history that the platform does not surface. Go full KServe DIY in production only when you have a real ML platform team and a flexibility requirement the runtime cannot meet, and even then, run it on VKS with eyes open about the interop matrix you just inherited. The flexibility is real, but most teams pay for it and never use it. Which path are you running today, and did the support boundary or the flexibility decide it for you?

References

VMware Private AI Series · Part 22 of 30
« Previous: Part 21 | VMware Private AI Complete Guide | Next: Part 23 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts