TL;DR · Key Takeaways
- Most VCF 9 incidents are not mysterious bugs. They are a single failed or in-progress workflow that holds a system lock and blocks every password and certificate operation after it.
- Triage in this order: read the failed task and its reference token, check for a held lock, then pull logs. Reading
domainmanager.logfirst wastes the first hour. - Always collect the support bundle from the SoS CLI, not only the UI. The VCF Operations UI bundle sometimes completes with no data.
- Async patch and LCM failures are usually depot or API reachability, not the bundle itself. Confirm the LCM service is up before you blame the patch.
- Editing the SDDC database to clear a stuck task is a real fix, but it is the last step, not the first, and only with a backup and ideally a support case open.
Here is the pattern I see on almost every VCF 9 escalation. A password rotation or certificate replace fails once, late at night, on one component. Nobody notices. The next morning a completely unrelated operation refuses to start with a message about the system lock being held by a Password Manager operation in progress, and the team spends two hours reading logs for the wrong thing. The original failure was the disease. Everything after it is a symptom. VCF 9 troubleshooting is mostly about finding that first failed workflow and releasing what it is holding, not about decoding stack traces.
This part of the series is the triage playbook I actually use: the order to work in, where the logs really are, how to pull a support bundle that is not empty, and the handful of failures that account for most real cases. It pairs with the certificate, identity and backup failures deep dive earlier in the series, which goes domain by domain. Here the focus is method.
Work the workflow before you touch a log
SDDC Manager and VCF Operations drive almost everything through workflows: bring-up, domain creation, password rotation, certificate replacement, patching. When something fails, the UI shows you the failed workflow and, critically, a reference token. That token is the thread you pull on. Expand the workflow, find the exact subtask that failed, and copy the reference token from the error. Do not start at the bottom of a log file. Start at the task that knows its own ID.
This matters because of a design behaviour that surprises people: password and certificate operations take a global system lock. If a workflow ends in a FAILED or stuck IN_PROGRESS state, that lock is never released, and every later password or certificate task refuses to run with a message that the operation is not allowed because a lock is held. The fix is not to retry harder. It is to resolve or clear the original stuck task so the lock releases. I have watched teams open a Sev 2 over a backup failure that was really a service-account password rotation that died three days earlier and was quietly blocking everything.
Where the logs actually live, and how to pull them fast
On the SDDC Manager appliance, the VCF logs sit under /var/log/vmware/vcf. The domain manager log at /var/log/vmware/vcf/domainmanager/domainmanager.log is where most orchestration failures land. But do not grep blind. Use the SoS (Supportability and Serviceability) tool, and where you have the reference token, scope the collection to that workflow so you get the relevant logs across components rather than one file on one node.
# Pull logs for one failed workflow by its reference token
vcfsos logs <reference_token_id>
# Full diagnostic bundle (all components), zipped
/opt/vmware/sddc-support/sos --collect-all-logs --zip
# Just SDDC Manager logs when the failure is clearly local
/opt/vmware/sddc-support/sos --sddc-manager-logs --zip
# Output lands here by default:
# /var/log/vmware/vcf/sddc-support/
For the management and operations layer, collect the VCF Operations support bundle from Administration so you capture the fleet nodes, and lean on Log Assist when you need to hand a bundle to Broadcom for an open case. One consideration worth knowing: the VCF Operations for Logs bundle occasionally completes through the UI with no data inside. If you open the ZIP and it is empty, do not assume the system is fine. Re-collect from the command line. An empty bundle is a collection failure, not a clean bill of health. For ongoing signal rather than point-in-time capture, wire this into the monitoring you set up in VCF Operations monitoring and observability.
The failures that fill real support cases
Across VCF 9.0 and 9.1, a small set of failure families covers the majority of what I see. Here is the symptom, the cause that is usually behind it, and the first move that resolves it. Match against the table before you go deep.
| Symptom | Likely cause | First fix |
|---|---|---|
| Password or cert task fails with "operation not allowed, lock held" | An earlier password or certificate workflow is stuck FAILED or IN_PROGRESS | Find and resolve the original stuck task to release the lock; restart SDDC Manager services if it will not clear |
| NSX Manager service-account password rotation fails | SDDC cannot get NSX user details; stale credential or connectivity to NSX (a known 9.x issue) | Validate NSX reachability and credentials, then retry; check the matching Broadcom KB for your build |
| Nightly backup job suddenly failing | A service-account password rotation failed and left credentials out of sync | Fix the rotation first, then re-run the backup; do not treat the backup as the root cause |
| Patching shows "No patches available" or async download fails | Depot or token mismatch, or SDDC_SERVICE_API_CALL_FAILED with LCM not reachable | Confirm LCM service is up and its API is reachable; for offline, re-stage with the offline depot utility |
| Upgrade precheck blocks the jump to 9.0.1 | Skip-upgrade precheck wants an intermediate bundle that is not staged | Stage the required intermediate bundle in the depot, then re-run the precheck |
| VCF Operations support bundle downloads empty | Known UI collection edge case completing with no data | Re-collect from the command line; do not assume the absence of data means a healthy node |
Two of these deserve a sharper opinion. First, patching: when a download fails, the instinct is to re-download the bundle. In practice the bundle is rarely the problem. The error is almost always reachability, either the LCM service is not healthy or the depot or its token is wrong, and SDDC_SERVICE_API_CALL_FAILED is telling you exactly that. Check the service and the API path first and you will skip a lot of wasted bandwidth. Second, the async patch offline depot tooling has a sharp edge: the transfer utility historically fetches only the bundles you select, not everything, so an offline depot can look complete while missing patches. Stage with the explicit offline-depot and async-patch options and verify the contents before you trust it.
For anything that began at deployment rather than day-2, work back to the install logs. The bring-up sequence and where its logs live are covered in the management domain bring-up walkthrough, and a surprising number of “day-2” failures trace back to a shortcut taken during bring-up.
Clearing a stuck task: the right way and the wrong way
Sometimes a workflow will not move and will not clear from the UI, and the lock stays held. There is a documented path to update the status of the failed operation in the operations database so the lock releases, and restarting SDDC Manager services with the bundled restart script is often enough on its own. Both are legitimate. But editing the SDDC database directly is a controlled action, not a casual one. Order matters: try a service restart first, escalate to clearing the stale task only if the restart does not free the lock, take a backup before any database change, and ideally have a support case open so the change is sanctioned for your build.
# Least invasive first: restart SDDC Manager services
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
# Re-check whether the lock has released and the task cleared
# (retry the original operation; only then consider a DB-level fix)
# Offline depot: stage ALL async patch bundles, not just selected ones
lcm-bundle-transfer-util --setUpOfflineDepot --asyncPatches
What I'd Do
Build the muscle memory before you need it. The single highest-value habit in VCF 9 operations is treating a failed password or certificate task as a Sev 3 the moment it happens, not when something downstream breaks days later, because that one failure is silently holding a lock. Standardise your triage on the five steps above, keep the SoS commands in a runbook your whole team can reach at 2am, and always verify a support bundle has data in it before you send it. If you run mixed component versions, watch the newer VCF 9.1 diagnostics tooling closely, since version skew between fleet components is becoming one of the more common reasons a workflow behaves strangely. What is the failure that has cost your team the most hours on VCF 9 so far?
References
- Broadcom TechDocs: Troubleshooting VCF and vSphere Foundation deployments
- Broadcom TechDocs: Create a VCF Operations support bundle
- Broadcom KB: VCF Operations for Logs bundle completes with no data
- Broadcom KB: NSX Manager service-account password rotation fails in VCF 9.x
- VCF Blog: Diagnostics for VCF 9.1 with old versions of VCF components



