Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VCF 9 Troubleshooting: The Stuck Workflows, Locks and Log Trails That Actually Bite (VCF 9 Series, Part 34)

A field guide to VCF 9 troubleshooting: clearing stuck SDDC Manager workflows, releasing the password and certificate system lock, pulling SoS support bundles, and reading the logs that actually tell you what failed.

VCF 9 Series · Part 34 of 36

TL;DR · Key Takeaways

  • Most VCF 9 incidents are not mysterious bugs. They are a single failed or in-progress workflow that holds a system lock and blocks every password and certificate operation after it.
  • Triage in this order: read the failed task and its reference token, check for a held lock, then pull logs. Reading domainmanager.log first wastes the first hour.
  • Always collect the support bundle from the SoS CLI, not only the UI. The VCF Operations UI bundle sometimes completes with no data.
  • Async patch and LCM failures are usually depot or API reachability, not the bundle itself. Confirm the LCM service is up before you blame the patch.
  • Editing the SDDC database to clear a stuck task is a real fix, but it is the last step, not the first, and only with a backup and ideally a support case open.
Who this is for: VCF administrators and architects running a live VCF 9.0 or 9.1 fleet.  Prerequisites: SSH access to SDDC Manager, the vcf and root credentials, and access to the VCF Operations fleet UI.

Here is the pattern I see on almost every VCF 9 escalation. A password rotation or certificate replace fails once, late at night, on one component. Nobody notices. The next morning a completely unrelated operation refuses to start with a message about the system lock being held by a Password Manager operation in progress, and the team spends two hours reading logs for the wrong thing. The original failure was the disease. Everything after it is a symptom. VCF 9 troubleshooting is mostly about finding that first failed workflow and releasing what it is holding, not about decoding stack traces.

This part of the series is the triage playbook I actually use: the order to work in, where the logs really are, how to pull a support bundle that is not empty, and the handful of failures that account for most real cases. It pairs with the certificate, identity and backup failures deep dive earlier in the series, which goes domain by domain. Here the focus is method.

Work the workflow before you touch a log

SDDC Manager and VCF Operations drive almost everything through workflows: bring-up, domain creation, password rotation, certificate replacement, patching. When something fails, the UI shows you the failed workflow and, critically, a reference token. That token is the thread you pull on. Expand the workflow, find the exact subtask that failed, and copy the reference token from the error. Do not start at the bottom of a log file. Start at the task that knows its own ID.

This matters because of a design behaviour that surprises people: password and certificate operations take a global system lock. If a workflow ends in a FAILED or stuck IN_PROGRESS state, that lock is never released, and every later password or certificate task refuses to run with a message that the operation is not allowed because a lock is held. The fix is not to retry harder. It is to resolve or clear the original stuck task so the lock releases. I have watched teams open a Sev 2 over a backup failure that was really a service-account password rotation that died three days earlier and was quietly blocking everything.

Why one stuck task blocks everythingPassword and certificate operations take a global system lockStuck workflowpassword or cert task: FAILED / IN_PROGRESSGlobal system locknever releasedEverything blockedlater password/cert ops: “lock held”Fix: resolve or clear the original stuck task (restart services first) to release the lock, do not just retry.
Treat a failed password or cert task as urgent; it is silently holding the lock for the whole instance.

Where the logs actually live, and how to pull them fast

On the SDDC Manager appliance, the VCF logs sit under /var/log/vmware/vcf. The domain manager log at /var/log/vmware/vcf/domainmanager/domainmanager.log is where most orchestration failures land. But do not grep blind. Use the SoS (Supportability and Serviceability) tool, and where you have the reference token, scope the collection to that workflow so you get the relevant logs across components rather than one file on one node.

# Pull logs for one failed workflow by its reference token
vcfsos logs <reference_token_id>

# Full diagnostic bundle (all components), zipped
/opt/vmware/sddc-support/sos --collect-all-logs --zip

# Just SDDC Manager logs when the failure is clearly local
/opt/vmware/sddc-support/sos --sddc-manager-logs --zip

# Output lands here by default:
# /var/log/vmware/vcf/sddc-support/

For the management and operations layer, collect the VCF Operations support bundle from Administration so you capture the fleet nodes, and lean on Log Assist when you need to hand a bundle to Broadcom for an open case. One consideration worth knowing: the VCF Operations for Logs bundle occasionally completes through the UI with no data inside. If you open the ZIP and it is empty, do not assume the system is fine. Re-collect from the command line. An empty bundle is a collection failure, not a clean bill of health. For ongoing signal rather than point-in-time capture, wire this into the monitoring you set up in VCF Operations monitoring and observability.

VCF 9 Troubleshooting Triage Flow From a failed task to root cause, in the order that saves time 1 Read the failed task, copy its reference token Expand the workflow in SDDC Manager / VCF Operations 2 Check for a held system lock A stuck password or cert task blocks all later operations 3 Pull logs by token, then the SoS bundle vcfsos logs <token> , then sos –collect-all-logs 4 Match symptom to cause Depot / API, lock, creds, cert, or component version skew 5 Remediate, release the lock, re-run Restart services if needed; clear stale task only as last resort KEY COMMANDS vcfsos logs <token> sos –collect-all-logs sos –sddc-manager-logs sddcmanager_restart_ services.sh lcm-bundle-transfer- util –setUpOffline… LOG PATH /var/log/vmware/vcf/ domainmanager/ Bundle out: …/sddc-support/
VCF 9 troubleshooting triage: resolve the workflow and its lock first, read logs second.

The failures that fill real support cases

Across VCF 9.0 and 9.1, a small set of failure families covers the majority of what I see. Here is the symptom, the cause that is usually behind it, and the first move that resolves it. Match against the table before you go deep.

SymptomLikely causeFirst fix
Password or cert task fails with "operation not allowed, lock held"An earlier password or certificate workflow is stuck FAILED or IN_PROGRESSFind and resolve the original stuck task to release the lock; restart SDDC Manager services if it will not clear
NSX Manager service-account password rotation failsSDDC cannot get NSX user details; stale credential or connectivity to NSX (a known 9.x issue)Validate NSX reachability and credentials, then retry; check the matching Broadcom KB for your build
Nightly backup job suddenly failingA service-account password rotation failed and left credentials out of syncFix the rotation first, then re-run the backup; do not treat the backup as the root cause
Patching shows "No patches available" or async download failsDepot or token mismatch, or SDDC_SERVICE_API_CALL_FAILED with LCM not reachableConfirm LCM service is up and its API is reachable; for offline, re-stage with the offline depot utility
Upgrade precheck blocks the jump to 9.0.1Skip-upgrade precheck wants an intermediate bundle that is not stagedStage the required intermediate bundle in the depot, then re-run the precheck
VCF Operations support bundle downloads emptyKnown UI collection edge case completing with no dataRe-collect from the command line; do not assume the absence of data means a healthy node

Two of these deserve a sharper opinion. First, patching: when a download fails, the instinct is to re-download the bundle. In practice the bundle is rarely the problem. The error is almost always reachability, either the LCM service is not healthy or the depot or its token is wrong, and SDDC_SERVICE_API_CALL_FAILED is telling you exactly that. Check the service and the API path first and you will skip a lot of wasted bandwidth. Second, the async patch offline depot tooling has a sharp edge: the transfer utility historically fetches only the bundles you select, not everything, so an offline depot can look complete while missing patches. Stage with the explicit offline-depot and async-patch options and verify the contents before you trust it.

For anything that began at deployment rather than day-2, work back to the install logs. The bring-up sequence and where its logs live are covered in the management domain bring-up walkthrough, and a surprising number of “day-2” failures trace back to a shortcut taken during bring-up.


Clearing a stuck task: the right way and the wrong way

Sometimes a workflow will not move and will not clear from the UI, and the lock stays held. There is a documented path to update the status of the failed operation in the operations database so the lock releases, and restarting SDDC Manager services with the bundled restart script is often enough on its own. Both are legitimate. But editing the SDDC database directly is a controlled action, not a casual one. Order matters: try a service restart first, escalate to clearing the stale task only if the restart does not free the lock, take a backup before any database change, and ideally have a support case open so the change is sanctioned for your build.

# Least invasive first: restart SDDC Manager services
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh

# Re-check whether the lock has released and the task cleared
# (retry the original operation; only then consider a DB-level fix)

# Offline depot: stage ALL async patch bundles, not just selected ones
lcm-bundle-transfer-util --setUpOfflineDepot --asyncPatches
Disclaimer: Restarting services and especially editing the SDDC database are production-impacting actions. Take a current SDDC Manager backup, confirm the exact procedure against the Broadcom KB for your precise build, and run it in a maintenance window. Where a database change is involved, open a support case first so the change is sanctioned. Validate against your environment before applying anything here.

What I'd Do

Build the muscle memory before you need it. The single highest-value habit in VCF 9 operations is treating a failed password or certificate task as a Sev 3 the moment it happens, not when something downstream breaks days later, because that one failure is silently holding a lock. Standardise your triage on the five steps above, keep the SoS commands in a runbook your whole team can reach at 2am, and always verify a support bundle has data in it before you send it. If you run mixed component versions, watch the newer VCF 9.1 diagnostics tooling closely, since version skew between fleet components is becoming one of the more common reasons a workflow behaves strangely. What is the failure that has cost your team the most hours on VCF 9 so far?

References

VCF 9 Series · Part 34 of 36
« Previous: Part 33  |  VCF 9 Complete Guide  |  Next: Part 35 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading