Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

VMware Private AI MLOps: Built-In Model Lifecycle vs DIY MLflow and KServe (Private AI Series, Part 22)

Two ways to run model lifecycle on VMware Private AI: the built-in Model Store and Model Runtime, or a DIY MLflow and KServe stack on VKS. Here is when each one wins, and the verdict.

VMware Private AI Series · Part 22 of 24

TL;DR · Key Takeaways

  • The built-in path (Model Store plus Model Runtime in Private AI Services 2.1) wins on time to value, governance, and staying inside the VCF support boundary.
  • The DIY path (MLflow registry, KServe serving, vLLM on VKS) wins on experiment tracking, custom model formats, and ecosystem freedom, but you own every upgrade and interop break.
  • Most teams should start built-in and graft MLflow on only for the training and experiment-tracking side the platform genuinely does not cover.
  • The lifecycle loop is identical either way: register, validate, stage, promote, serve, monitor, retire. The tooling only moves the work around.

Two architects can deploy the same 8B model on VMware Private AI and end up with completely different operational lives. One opens Private AI Services, registers the model in the Model Store, and serves it from the Model Runtime before lunch. The other stands up MLflow, wires it to a KServe InferenceService on a VKS cluster, and spends three days on storage initializers, S3 credentials, and namespace RBAC. Both serve tokens. The real question is which lifecycle you want to own for the next two years, and the honest answer depends on how mature your ML practice already is, not on which stack has the longer feature list.

The two paths, at a glance

MLOps on Private AI is really a choice about where the model registry, the promotion gate, and the serving runtime live. Path A keeps all three inside the VMware-managed control plane that ships with Private AI Services. Path B assembles them yourself from the open-source stack the wider Kubernetes world standardised on in 2026. Here is the split before we get into the trade-offs.

Two ways to run model lifecycle on Private AI Same loop, very different ownership A. Built-in (Private AI Services) Model Store curates and governs Model Runtime serves inference Observability framework built in UI plus API, VCF support boundary YOU OWN Model selection and policy GPU sizing and namespaces B. DIY (MLflow plus KServe) MLflow registry and tracking KServe serves, wraps vLLM Prometheus and Grafana for signals Runs on a VKS workload cluster YOU OWN Every component and its upgrades Driver, operator and CRD interop
The built-in path hands you a managed lifecycle; the DIY path hands you a toolbox and the bill for assembly.

What the built-in lifecycle actually gives you

Private AI Services, introduced with Private AI Foundation 9.0 and matured in the 2.1 release, splits the lifecycle into two named services. The Model Store is the secure repository where models are curated, tested, and governed before anything reaches production, so only validated and policy-compliant models get an endpoint. The Model Runtime is the GenAI execution layer that handles active inference. The promotion gate between them is the part you would otherwise build by hand.

The quiet advantage is the observability framework. It reports health and performance across the whole AI footprint, from the inference engine and GPU utilisation down to knowledge base indexing and agent calls, without you deploying a separate metrics stack. Pair that with the 2.1 artifact mirroring tool (AMT) and the same lifecycle runs inside an air-gapped environment, which is the scenario where a DIY stack hurts most. The cost is honesty about the ceiling: you serve what the runtime supports, in the formats it supports, and you do not get a first-class experiment-tracking server or a training-run history out of the box.

What DIY MLflow plus KServe buys you, and what it costs

The DIY path is the consensus 2026 Kubernetes inference stack dropped onto a VKS workload cluster: MLflow for the model registry and experiment tracking, KServe for serving (it wraps vLLM, Triton, and others behind one InferenceService CRD), vLLM as the actual LLM engine, Kueue for GPU and batch scheduling, and the NVIDIA GPU Operator underneath. KServe’s storage initializer pulls the model artifact from S3 and hands it to the predictor. It is powerful and genuinely flexible.

The DIY stack, layer by layer Each box is a component you deploy, secure, and upgrade yourself MLflow registry and tracking server versions, stages, experiment history KServe InferenceService (control plane) routing, autoscaling, canary vLLM engine Kueue GPU and batch scheduling NVIDIA GPU Operator and vGPU drivers VKS workload cluster on VCF Six moving parts, six upgrade cadences, six chances for a CRD or driver break.
Flexibility is real, but so is the integration surface you sign up to maintain.

The cost is interop. The single most common way this path breaks is a version skew between the GPU Operator, the vGPU guest driver, and the CUDA build inside the KServe runtime image, which surfaces as a pod stuck in CrashLoopBackOff with a CUDA driver mismatch in the logs. Nothing in the open-source stack validates that matrix for you. The built-in runtime does, because Broadcom and NVIDIA ship and test it together. Here is the same model in both worlds.

# DIY path: a KServe InferenceService backed by vLLM on a VKS cluster
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-1-8b
  namespace: genai-prod
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      storageUri: s3://models/llama-3-1-8b/v3
      resources:
        limits:
          nvidia.com/gpu: "1"
EOF
# You now own the registry, the S3 bucket, the runtime image, and the GPU matrix.

# Built-in path: the model is registered once in the Model Store,
# then served from the Model Runtime. No manifest to hand-maintain,
# no separate registry, and the GPU stack is pre-validated.

Built-in vs DIY: the comparison that matters

DimensionBuilt-in (Model Store plus Runtime)DIY (MLflow plus KServe)
Time to first endpointHours, mostly UI drivenDays, manifests and plumbing
Governance and promotion gateNative, policy enforced in the StoreYou build it from MLflow stages plus CI
Experiment tracking and training historyLimited, serving focusedFirst class, MLflow core strength
Model and format flexibilityWhat the runtime supportsAnything KServe can wrap
GPU and driver interopPre-validated by Broadcom and NVIDIAYour problem, your test matrix
ObservabilityBuilt-in framework, GPU to agentAssemble Prometheus, Grafana, exporters
Air-gapped supportAMT in 2.1, designed for itManual mirroring of every image
Support boundaryInside VCF supportCommunity plus your own runbooks

Choose which, when

Decision flow Do you have a dedicated ML platform team? Need experiment tracking or custom model formats? NO to both YES to both Go built-in Model Store plus Model Runtime. Fastest, governed, supported. Best for most enterprises. Go hybrid MLflow for training and tracking, runtime for governed serving. Pure DIY only if you must.
The fork is team maturity, not model size. A two-person team should not run six open-source components in production.

Go built-in when your goal is to serve curated models to applications and agents with governance and support, which describes the majority of enterprise Private AI deployments. Do not go pure DIY just because the open-source stack is fashionable; that decision only pays off when you have a platform team that already lives in Kubernetes and needs the flexibility. Validate one assumption first either way: confirm your target models and formats are supported by the Model Runtime before you commit, because that single fact decides whether the built-in path even applies.

The lifecycle loop both stacks must run

Whichever path you pick, the same loop runs underneath. The built-in services automate most of it; the DIY stack makes you wire each hop. The steps teams skip are validation and retirement, and those are exactly the two that cause incidents later: an unvalidated model that hallucinates in production, and a forgotten endpoint that quietly holds a GPU for months.

The lifecycle loop 1Register 2Validate 3Stage 4Promote 5Serve 6Monitor 7Retire Monitoring feeds the next validation: drift and regression send a model back to step 2
The loop is identical on both stacks. Built-in automates hops 1 to 6; DIY makes you assemble each one.

In practice, the monitoring hop is where the built-in observability framework earns its keep, because it correlates GPU utilisation, request latency, and knowledge base indexing in one place. On the DIY stack you reach the same view only after wiring a GPU exporter, a KServe metrics scrape, and a dashboard, and most teams discover the gap during an incident rather than before one. For the signals worth watching, see the approach in GPU monitoring with VCF Operations, and if your serving layer leans on NVIDIA microservices, the NIM microservices design covers the runtime side.


What I’d Do

Start built-in. For the large majority of VCF Private AI customers, the Model Store and Model Runtime give you a governed, supported, observable lifecycle in a fraction of the time, and they keep you inside one support boundary instead of owning a six-component matrix. Reach for MLflow only on the side it genuinely owns: experiment tracking and the training history that the platform does not surface. Go full KServe DIY in production only when you have a real ML platform team and a flexibility requirement the runtime cannot meet, and even then, run it on VKS with eyes open about the interop matrix you just inherited. The flexibility is real, but most teams pay for it and never use it. Which path are you running today, and did the support boundary or the flexibility decide it for you?

References

VMware Private AI Series · Part 22 of 30
« Previous: Part 21  |  VMware Private AI Complete Guide  |  Next: Part 23 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading