Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

Observability for VKS: Metrics, Logs and VCF Operations (VKS Series, Part 11)

Kubernetes-only tooling is blind below the node. Here is how metrics, logs and VCF Operations fit together so you can tell the app from the cluster from the infrastructure.

Observability: Metrics, Logs & VCF Operations
VKS Series · Part 11 of 17

TL;DR · Key Takeaways

  • VKS clusters can be monitored through VCF Operations, so Kubernetes clusters appear alongside the hosts, datastores and networks they depend on, not in a silo.
  • Inside the cluster you run the usual open tooling: the metrics server, and a Prometheus/Grafana stack for deep application metrics. Nothing here is VKS-proprietary.
  • Logs ship with a standard agent (Fluent Bit and friends) to whatever destination you operate. Centralize them, because nodes are ephemeral and autoscaled.
  • The differentiator is correlation: a healthy-looking pod on a struggling ESX host is invisible to Kubernetes-only tooling. VCF Operations sees both. VCF 9.1 adds IPFIX flow visibility across VMs and VKS pods.
Who this is for: platform and SRE teams who have to answer “is it the app, the cluster, or the infrastructure?” at 2am.  Prerequisites: running VKS clusters and access to VCF Operations.

You cannot operate what you cannot see, and Kubernetes is very good at hiding problems one layer down from wherever you happen to be looking. The VKS advantage is that its clusters are first-class citizens in VCF Operations, so a struggling pod can be traced to a hot host or a saturated datastore without jumping between three tools. This part covers the three signal types, metrics, logs and the unified VCF Operations view, and how they fit together. It complements my VCF 9 deep dive on VCF Operations monitoring.

Three signal types, and where each lives

At the cluster level you have standard Kubernetes signals: node CPU and memory, pod scheduling pressure, control-plane health, and the metrics server feeding kubectl top and the Horizontal Pod Autoscaler. For deeper application metrics most teams add Prometheus and Grafana, deployed into the cluster like any other workload, because VKS gives you conformant Kubernetes and does not stop you running the cloud-native tooling you already know. Log collection follows ordinary Kubernetes practice: a node-level agent, Fluent Bit is the common choice, tails container logs and ships them to whatever destination you operate, in-house Elastic, a SaaS platform, or VMware’s own log tooling. None of this is VKS-proprietary; that portability is deliberate. What VKS adds is the layer beneath the cluster.

Above the node and below it Metrics (in-cluster)metrics server, Prometheus,Grafana dashboardsapp-level depth Logs (shipped out)Fluent Bit → your platformElastic, SaaS or VMwarecentralize: nodes are ephemeral VCF Operationscluster + node health next tohosts, datastores, networkthe correlation layer Kubernetes-only tooling is blind below the node; VCF Operations is what sees the infrastructure underneath.
Open tooling gives application depth above the node; VCF Operations supplies the infrastructure correlation below it.

The unified view, and why it is the differentiator

VCF Operations can monitor VKS clusters directly, surfacing cluster and node health next to the hosts, datastores and networks they depend on. This is the genuinely differentiated capability. When an application slows down, the question is almost always “is this the app, the cluster, or the infrastructure?”, and a single pane that correlates all three answers it far faster than stitching together a Kubernetes dashboard, a vCenter view and an infrastructure monitor by hand. A node that looks fine in Kubernetes terms can be sitting on an ESX host under pressure, and only a tool that sees both will tell you. VCF 9.1 deepens this with IPFIX flow visibility on the VDS and Antrea CNI, giving granular flow data across VMs and VKS pods, which is the kind of east-west visibility pure Kubernetes tooling struggles to provide.

QuestionBest toolWhy
Is my app slow?Prometheus / GrafanaApp-level metrics and SLOs
What did the pod log?Centralized logs (Fluent Bit)Survives node churn
Is the infra the problem?VCF OperationsCorrelates cluster with host/storage/net
Don’t pick only one altitude: Kubernetes-only observability is blind below the node, and infrastructure-only observability is blind above it. The teams that debug fastest run both and correlate, app metrics from Prometheus, infrastructure truth from VCF Operations.

What VCF Operations surfaces for a VKS cluster

It is worth being specific about what the unified view actually gives you, because “single pane” is a phrase every vendor uses. For a VKS cluster, VCF Operations shows cluster and node health, but the value is the relationships it draws: this node VM runs on that ESX host, which sits on that datastore and that network, and here is the contention on each. So when a node reports memory pressure, you can immediately see whether the host behind it is itself overcommitted, whether the datastore is saturated, or whether it is genuinely a Kubernetes-level problem. That is the question pure Kubernetes tooling cannot answer, because it stops at the node boundary and has no idea what the node is standing on.

This changes how you triage. Instead of a Kubernetes dashboard saying a pod is slow and a separate vCenter view saying a host is busy, with a human guessing whether they are related, the correlation is already drawn. For platform teams who run the rest of the estate in VCF Operations, the VKS clusters simply join that picture rather than becoming one more console to watch, and that consolidation is worth more day to day than any single metric.

Dashboards and the signals that matter

On the in-cluster side, the temptation is to build dashboards full of every metric Prometheus can scrape, which produces a wall nobody reads. Build for the signals that actually predict trouble. At the cluster level: API server latency and error rate, etcd write latency, scheduler pending pods, and node resource saturation. At the workload level: the golden signals for each service, latency, traffic, errors, saturation, tied to the SLO you actually promised. A dashboard that shows whether you are meeting your SLOs and, when you are not, points at which layer is responsible, is worth ten that display raw counters. Wire the node metrics to the VCF Operations infrastructure view so a saturating node leads you straight to the host, not to a dead end.

Keep the application dashboards owned by the teams that own the services, and the cluster and infrastructure dashboards owned by the platform team. That division mirrors the responsibility split and stops the platform team drowning in app-specific panels they cannot action, while still giving them the cluster-health view they need.

Alerting that catches real failures, not noise

Alerting is where most observability setups quietly fail, by paging on causes instead of symptoms. An alert that fires every time CPU crosses 80 percent will be muted within a week and then misses the real incident. Alert on user-visible symptoms, error rate breaching the SLO, latency past the budget, requests failing, and on leading indicators that genuinely precede an outage, etcd latency climbing, the autoscaler stuck at max with pending pods, a node pool unable to place VMs, certificates near expiry. Each alert should be actionable and rare. If an alert does not tell the responder what to do, it is noise, and noise trains people to ignore the dashboard exactly when it matters.

The leading indicators are where VKS specifics pay off. “Pods pending and autoscaler at maximum” is the early warning for the quota and capacity problems from the autoscaling and day-2 parts, and catching it as a warning, before it becomes a stalled deployment, is the difference between a quiet adjustment and an incident. Build a small number of these high-signal alerts, tie them to the symptom not the cause, and resist the urge to alert on everything the dashboards can show.

What I’d Do

I run Prometheus and Grafana for application depth, because that is what developers want and what defines SLOs, and I centralize logs with Fluent Bit so a post-mortem does not die with an autoscaled node. But I lean on VCF Operations for the one thing the cloud-native stack cannot give me: correlation with the infrastructure underneath, host pressure, datastore latency, the network path. The mistake is picking a single altitude and going blind in the other direction. Wire up both, and the 2am question, app, cluster or infra, has an answer in one place instead of three. Next time a cluster gets slow, can you see the ESX host it is running on in the same view as the pod, or are you alt-tabbing between tools and guessing?

References

VKS Series · Part 11 of 17
« Prev: Part 10  |  VKS Complete Guide  |  Next: Part 12 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading