Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

Troubleshooting NSX 9 DFW and Security Policy: Applied-To, Realized Rules and the Default-Rule Trap (NSX Series, Part 30)

The most common DFW outage is a published rule with the wrong Applied-To. Here is how to see what is really realized on a vNIC, why a rule visible in the UI may not be applied, the default-rule trap, and the symptom-to-fix path. The NSX Series finale.

NSX Series · Part 30 of 30

TL;DR · Key Takeaways

  • The most common DFW outage is a published rule with the wrong Applied-To. A bad Applied-To either fails to realize the rule on the VM you meant, or pushes it far wider than you intended.
  • A rule visible in NSX Manager is not proof the rule is enforced on a VM. You have to check what is actually realized on the vNIC filter, on the host, with vsipioctl.
  • If a rule has an Applied-To error it will not realize on the VM and will not even show in the host CLI. A rule in the GUI that is absent on the vNIC is the classic Applied-To symptom.
  • Packets briefly hitting the default rule often mean realization was not complete, for example the VM was mid-restart, so the full config had not landed yet. Check VM and realization state before blaming the rule.
  • Troubleshoot from the vNIC up: find the filter with summarize-dvfilter, dump the realized rules with vsipioctl getrules, and compare to intent. The host truth beats the Manager view every time.
Who this is for: NSX and security admins debugging distributed firewall behavior, dropped or allowed traffic that should not be, and rules that do not seem to take.  Prerequisites: DFW fundamentals and Applied-To from Part 12, groups and tags from Part 13, and host CLI access.

Here is a claim I will defend: the single most common distributed firewall outage is not a missing rule or a bad signature. It is a correct rule with the wrong Applied-To. The Applied-To field decides which vNICs a rule is actually enforced on, and it is the easiest thing in NSX to get subtly wrong. Set it too narrow and your carefully written rule never lands on the workload you meant to protect. Set it to the wrong group and you can publish a deny across a whole segment you never intended to touch. I have watched a single mis-scoped Applied-To take out an application in seconds. So to close this series, let me show you how to find the truth quickly when the firewall is not doing what the UI says it should.

The UI shows intent, the vNIC shows truth

The mistake that wastes the most time is trusting NSX Manager as the source of truth for what is enforced. The Manager shows your intent: the rules you wrote and published. What is actually enforced lives on the host, in the firewall filter attached to each VM’s vNIC. Between intent and enforcement sits realization, and realization can fail or lag. So when behavior disagrees with the rule table, you do not argue with the rule table. You go to the host and read what the vNIC filter is really enforcing. If the realized rules on the vNIC do not match the Manager, you have found the gap, and the gap is almost always realization, usually driven by Applied-To.

Intent, realization, enforcement Applied-To decides which vNICs the rule actually lands on. NSX Manager your intent the rule table Applied-To scopes realization vNIC filter (host) what is enforced the only truth A wrong Applied-To breaks the middle arrow: the rule never reaches the vNIC you meant.
Diagram 1: The rule in the UI is intent. Applied-To scopes realization. The vNIC filter is the only place that tells you what is truly enforced.

Read the realized rules on the host

The workflow is short and it is the single highest-value DFW troubleshooting skill. On the ESX host where the VM runs, find the firewall filter attached to its vNIC with summarize-dvfilter, then dump the rules that filter is actually enforcing with vsipioctl. If the rule you expect is not in that output, it is not enforced on that VM, no matter what the Manager shows. The generation number and realization timestamp from the firewall config tell you whether what you are looking at is current.

# 1. Find the VM's vNIC filter name on the ESX host
summarize-dvfilter | grep -A 4 <vm-name>

# 2. Dump the rules actually realized on that filter
vsipioctl getrules -f <nic-filter-name>

# 3. Inspect realized config: generation number + realization time
vsipioctl getfwconfig -f <nic-filter-name>

# If the expected rule is absent here, it is NOT enforced on this VM
# (most often an Applied-To error: it never realized, and never shows)
In practice: when a rule is in the GUI but missing from vsipioctl getrules, stop looking at the rule and look at its Applied-To. An Applied-To error means the rule never realizes on the VM and does not even appear in the host CLI, which is exactly why people stare at a perfect-looking rule for an hour wondering why it does nothing.

The default-rule trap and group membership

Two more failure modes account for most of the rest. The first is the default-rule trap: traffic you expected a specific rule to handle is briefly hitting the default rule instead. Before you assume the rule is wrong, check whether the VM was up and fully realized at the time. If a VM was restarting, the full DFW configuration may not have landed yet, so traffic falls through to the default for a window. That is a realization timing issue, not a rule logic issue, and chasing it as the latter wastes hours.

The second is empty or wrong group membership. A rule that references a dynamic group is only as good as that group’s effective members. If the tag expression matches nothing, or a workload lost its tag, the group has no effective member IPs and the rule silently does nothing, or drops traffic it should allow. This ties straight back to the tag taxonomy discipline from Part 13: when a rule misbehaves, check the group’s effective members before you touch the rule itself.

A fast triage path Is the rule realized on the vNIC? The answer splits the whole problem. In getrules on the vNIC?vsipioctl check Not realized Applied-To error, or realization lag Realized but wrong result group members, order, default-rule timing No Yes
Diagram 2: One question splits DFW problems in two. Not realized points at Applied-To; realized-but-wrong points at members, order, or timing.
SymptomLikely causeFix
Rule in GUI, absent in getrulesApplied-To error, not realizedFix Applied-To scope, republish, re-check
Whole segment suddenly blockedApplied-To too broad on a denyNarrow Applied-To to the intended group
Traffic briefly hits default ruleRealization incomplete (VM restart)Confirm VM up and config realized
Rule allows/drops nothingGroup has no effective membersCheck tags and group membership
Right rule, wrong precedenceA broader rule above it matches firstCheck category and rule order

Worked example

A team publishes a new allow rule for an app tier and traffic still drops. The rule looks perfect in the UI. On the host, vsipioctl getrules on the VM’s filter does not list it at all. The Applied-To was set to a group whose membership expression had a typo, so it matched zero VMs, so the rule realized nowhere and never appeared in the CLI. Fix the group expression, the members populate, the rule realizes on the vNIC, traffic flows. The lesson the whole series keeps returning to: the host tells the truth, and Applied-To plus group membership is where DFW problems hide.

Disclaimer: host CLI commands shown here are representative for DFW troubleshooting and may vary by NSX version and platform. Validate exact syntax against current Broadcom NSX documentation, run read-only diagnostics first, and make rule or group changes through your change process, testing the effect on a non-critical workload before production.

Final Thoughts

When the distributed firewall misbehaves, do not debate the rule table. Go to the vNIC, read what is realized with summarize-dvfilter and vsipioctl, and let the host settle the argument. A rule that is in the GUI but missing on the filter is an Applied-To problem nine times out of ten. A rule that is realized but does the wrong thing is usually group membership, rule order, or a realization timing window after a VM restart. Internalize that one split, not-realized versus realized-but-wrong, and you will cut most DFW investigations from hours to minutes.

That brings the NSX Series to a close, thirty parts from what NSX 9 is through architecture, deployment, routing, security, multi-tenancy, operations, and the failure modes that actually bite. If there is one thread running through all of it, it is this: NSX rewards architects who design deliberately and verify against reality rather than trusting the console. Size for the failure case, scope your policy precisely, check realized state, and treat the network as code. Do that, and NSX 9 is one of the most capable platforms you can run. Thanks for reading the whole way through. Now go check your Applied-To fields before something else does.

First match wins, top to bottom If a rule seems ignored, something above it matched first. Check precedence. Ethernet Emergency Infrastructure Environment Application then Default evaluationorder
Diagram 3: Categories evaluate top to bottom and the first match wins. A broad allow high up will mask a specific rule below it.

Rule order and category precedence

When a rule seems to do nothing, the cause is often that another rule matched first. The distributed firewall evaluates top to bottom, both across categories and within them, and the first matching rule decides the verdict. A broad allow sitting high in the order, perhaps left over from an early permissive phase, will match traffic before your carefully written specific rule ever gets a look, so the specific rule appears ignored when it is simply unreachable. The category structure exists precisely to impose a sane order on this, with infrastructure and environment rules above application rules, but you can still defeat it by placing an overly broad rule where it shadows everything below.

So when a rule does not behave, check precedence before you rewrite the rule. Walk the categories from the top and ask what could match this traffic before my rule does. Shadowed and overly permissive rules are the two findings that explain most of these cases, and the firewall rule analysis tooling will surface them for you. The fix is usually not a new rule but moving or tightening an existing one, because the rule you wrote was correct, it was just standing behind a rule that answered first.

Connection state and the established-flow gotcha

The distributed firewall is stateful, and that statefulness produces a test result people misread constantly. When you publish a new deny rule, it governs new connections, but an already-established session may keep flowing under the state table entry it created when it was first allowed, until that connection actually ends. So an engineer adds a deny, tests with an existing SSH session or a long-lived database connection, sees traffic still passing, and concludes the rule did not work. The rule worked perfectly; it just does not retroactively tear down a connection that was permitted when it began.

Knowing this saves a lot of false alarms. When you test enforcement changes, test with new connections, not with sessions that were open before the change, and if you genuinely need to cut existing flows you have to account for connection state explicitly rather than assuming a rule change is instantaneous for everything. This is normal stateful-firewall behaviour, but it surprises people who expect a published rule to act on every packet immediately. Expect the established-flow lag, test accordingly, and you stop chasing a rule that is doing exactly what it should.

Exclusion lists and where rules simply do not apply

One more reason a rule can appear to do nothing: the workload it should govern is on the distributed firewall exclusion list. Certain management and infrastructure VMs are deliberately excluded from DFW enforcement so that a bad rule cannot lock out the very things you need to fix it, and that is a sensible safety mechanism. It is also a trap when you forget about it, because a rule that looks correct will have no effect on an excluded VM, and you can spend a long time debugging the rule before you think to check whether the target is even being enforced.

So when a specific workload stubbornly ignores a rule that clearly should apply, check the exclusion list before you touch the rule. If the VM is excluded, the rule is irrelevant to it by design, and the question becomes whether that exclusion is intentional rather than what is wrong with the rule. This is the same lesson the whole troubleshooting Part keeps returning to: confirm where enforcement actually applies before you debate the policy, because the host, and the exclusion list, settle questions that the rule table never can.

References

NSX Series · Part 30 of 30
« Previous: Part 29  |  NSX Complete Guide

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading