TL;DR · Key Takeaways
- VKS clusters can be monitored through VCF Operations, so Kubernetes clusters appear alongside the hosts, datastores and networks they depend on, not in a silo.
- Inside the cluster you run the usual open tooling: the metrics server, and a Prometheus/Grafana stack for deep application metrics. Nothing here is VKS-proprietary.
- Logs ship with a standard agent (Fluent Bit and friends) to whatever destination you operate. Centralize them, because nodes are ephemeral and autoscaled.
- The differentiator is correlation: a healthy-looking pod on a struggling ESX host is invisible to Kubernetes-only tooling. VCF Operations sees both. VCF 9.1 adds IPFIX flow visibility across VMs and VKS pods.
You cannot operate what you cannot see, and Kubernetes is very good at hiding problems one layer down from wherever you happen to be looking. The VKS advantage is that its clusters are first-class citizens in VCF Operations, so a struggling pod can be traced to a hot host or a saturated datastore without jumping between three tools. This part covers the three signal types, metrics, logs and the unified VCF Operations view, and how they fit together. It complements my VCF 9 deep dive on VCF Operations monitoring.
Three signal types, and where each lives
At the cluster level you have standard Kubernetes signals: node CPU and memory, pod scheduling pressure, control-plane health, and the metrics server feeding kubectl top and the Horizontal Pod Autoscaler. For deeper application metrics most teams add Prometheus and Grafana, deployed into the cluster like any other workload, because VKS gives you conformant Kubernetes and does not stop you running the cloud-native tooling you already know. Log collection follows ordinary Kubernetes practice: a node-level agent, Fluent Bit is the common choice, tails container logs and ships them to whatever destination you operate, in-house Elastic, a SaaS platform, or VMware’s own log tooling. None of this is VKS-proprietary; that portability is deliberate. What VKS adds is the layer beneath the cluster.
The unified view, and why it is the differentiator
VCF Operations can monitor VKS clusters directly, surfacing cluster and node health next to the hosts, datastores and networks they depend on. This is the genuinely differentiated capability. When an application slows down, the question is almost always “is this the app, the cluster, or the infrastructure?”, and a single pane that correlates all three answers it far faster than stitching together a Kubernetes dashboard, a vCenter view and an infrastructure monitor by hand. A node that looks fine in Kubernetes terms can be sitting on an ESX host under pressure, and only a tool that sees both will tell you. VCF 9.1 deepens this with IPFIX flow visibility on the VDS and Antrea CNI, giving granular flow data across VMs and VKS pods, which is the kind of east-west visibility pure Kubernetes tooling struggles to provide.
| Question | Best tool | Why |
|---|---|---|
| Is my app slow? | Prometheus / Grafana | App-level metrics and SLOs |
| What did the pod log? | Centralized logs (Fluent Bit) | Survives node churn |
| Is the infra the problem? | VCF Operations | Correlates cluster with host/storage/net |
What VCF Operations surfaces for a VKS cluster
It is worth being specific about what the unified view actually gives you, because “single pane” is a phrase every vendor uses. For a VKS cluster, VCF Operations shows cluster and node health, but the value is the relationships it draws: this node VM runs on that ESX host, which sits on that datastore and that network, and here is the contention on each. So when a node reports memory pressure, you can immediately see whether the host behind it is itself overcommitted, whether the datastore is saturated, or whether it is genuinely a Kubernetes-level problem. That is the question pure Kubernetes tooling cannot answer, because it stops at the node boundary and has no idea what the node is standing on.
This changes how you triage. Instead of a Kubernetes dashboard saying a pod is slow and a separate vCenter view saying a host is busy, with a human guessing whether they are related, the correlation is already drawn. For platform teams who run the rest of the estate in VCF Operations, the VKS clusters simply join that picture rather than becoming one more console to watch, and that consolidation is worth more day to day than any single metric.
Dashboards and the signals that matter
On the in-cluster side, the temptation is to build dashboards full of every metric Prometheus can scrape, which produces a wall nobody reads. Build for the signals that actually predict trouble. At the cluster level: API server latency and error rate, etcd write latency, scheduler pending pods, and node resource saturation. At the workload level: the golden signals for each service, latency, traffic, errors, saturation, tied to the SLO you actually promised. A dashboard that shows whether you are meeting your SLOs and, when you are not, points at which layer is responsible, is worth ten that display raw counters. Wire the node metrics to the VCF Operations infrastructure view so a saturating node leads you straight to the host, not to a dead end.
Keep the application dashboards owned by the teams that own the services, and the cluster and infrastructure dashboards owned by the platform team. That division mirrors the responsibility split and stops the platform team drowning in app-specific panels they cannot action, while still giving them the cluster-health view they need.
Alerting that catches real failures, not noise
Alerting is where most observability setups quietly fail, by paging on causes instead of symptoms. An alert that fires every time CPU crosses 80 percent will be muted within a week and then misses the real incident. Alert on user-visible symptoms, error rate breaching the SLO, latency past the budget, requests failing, and on leading indicators that genuinely precede an outage, etcd latency climbing, the autoscaler stuck at max with pending pods, a node pool unable to place VMs, certificates near expiry. Each alert should be actionable and rare. If an alert does not tell the responder what to do, it is noise, and noise trains people to ignore the dashboard exactly when it matters.
The leading indicators are where VKS specifics pay off. “Pods pending and autoscaler at maximum” is the early warning for the quota and capacity problems from the autoscaling and day-2 parts, and catching it as a warning, before it becomes a stalled deployment, is the difference between a quiet adjustment and an incident. Build a small number of these high-signal alerts, tie them to the symptom not the cause, and resist the urge to alert on everything the dashboards can show.
What I’d Do
I run Prometheus and Grafana for application depth, because that is what developers want and what defines SLOs, and I centralize logs with Fluent Bit so a post-mortem does not die with an autoscaled node. But I lean on VCF Operations for the one thing the cloud-native stack cannot give me: correlation with the infrastructure underneath, host pressure, datastore latency, the network path. The mistake is picking a single altitude and going blind in the other direction. Wire up both, and the 2am question, app, cluster or infra, has an answer in one place instead of three. Next time a cluster gets slow, can you see the ESX host it is running on in the same view as the pod, or are you alt-tabbing between tools and guessing?
References
- Broadcom TechDocs: Monitoring VKS Clusters Using VCF Operations
- VCF Blog: Enhanced Network Scale and Visibility with VCF 9.1
- VCF Operations in VCF 9: Monitoring and Observability Explained, drpranayjha.com









