Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

How to Benchmark LLM Inference on VMware Private AI with genai-perf (Private AI Series, Part 21)

A practical runbook for benchmarking NIM inference on VMware Private AI Foundation: the metrics that matter, the concurrency sweep that exposes the real latency-throughput curve, and how to pick an operating point you can defend.

VMware Private AI Series · Part 21 of 24

TL;DR · Key Takeaways

  • A single tokens-per-second number is close to useless. Benchmark the latency-versus-throughput curve by sweeping concurrency, then read off the throughput your TTFT and ITL budget actually allows.
  • Use NVIDIA genai-perf (now shipping as aiperf) against the OpenAI-compatible endpoint your NIM exposes. Set input and output sequence length to match your real workload, not the 128/128 default.
  • The metrics that matter: TTFT, ITL (time per output token), total TPS per system, TPS per user, and RPS. They trade against each other, so name your SLO before you run anything.
  • Broadcom and NVIDIA measured vGPU on Private AI Foundation within roughly 1 to 2 percent of bare metal for Llama 3.1 70B inference. The hypervisor is not where your throughput goes.
  • The throughput killers are real: wrong sequence-length mix, concurrency set to 1, an undersized KV cache, and a GPU profile that cannot hold the model plus its cache.
Who this is for: architects and platform engineers sizing or validating inference on VMware Private AI Foundation with NVIDIA.  Prerequisites: a running NIM endpoint on a GPU workload domain (PAIF 9.0 or 9.1), a client VM or pod that can reach it, and Python 3.10 or later.

Here is the number you will be tempted to quote in the design review: “the L40S does 3,200 tokens per second.” It is a real measurement and it is also misleading, because nobody told you it was taken at a concurrency of 64 with a time to first token of 4 seconds. At the concurrency and latency your chatbot actually needs, that same GPU might give you a third of it. Benchmarking inference is not about finding the biggest number. It is about finding the throughput you can sustain while staying inside a latency budget, and that requires a sweep, not a single run.

This part walks through how to benchmark NIM inference on Private AI Foundation with genai-perf, which metrics to trust, the parameter that quietly decides every result, and how to read the output so the answer survives contact with production.

The five metrics that actually describe an LLM endpoint

An LLM request has two phases with different costs, and most benchmarking confusion comes from collapsing them into one average. The prefill phase reads your whole prompt and builds the KV cache. The decode phase then emits tokens one at a time. Five metrics describe what the user and the system experience across those phases.

Anatomy of one request Where each metric is measured along the timeline request sent TTFT (prefill + queue) first token ITL decode: one token at a time last token end-to-end latency = TTFT + generation time
One request, two phases. TTFT covers prefill and queueing; ITL governs the decode stream.
  • Time to first token (TTFT): the wait from sending the query to seeing the first token. It includes request queuing, prefill compute, and network latency, and it grows with prompt length because the model must build the KV cache for the entire input before decoding starts. This is the number a user feels as “responsiveness”.
  • Inter-token latency (ITL), also called time per output token: the average gap between consecutive tokens during decode. genai-perf defines it as (end-to-end latency minus TTFT) divided by (output tokens minus 1), deliberately excluding the first token so it measures only the decode loop. ITL is what makes streaming text feel smooth or stuttery.
  • Total TPS per system: output tokens per second across all concurrent requests. It climbs as you add concurrency until the GPU saturates, then flattens or falls. This is the metric you size capacity against.
  • TPS per user: throughput from one request’s point of view, which asymptotically approaches 1 divided by ITL. As you pack in more concurrent users, total TPS rises while per-user TPS drops. That tension is the whole game.
  • Requests per second (RPS): completed requests per second. Useful when your unit of work is a whole short response rather than a long stream.

One trap worth naming up front: different tools compute these differently. genai-perf excludes TTFT from ITL; the older LLMPerf includes it. genai-perf measures TPS between the first request and the last response using a sliding window that discards warm-up and cool-down; LLMPerf divides by the entire benchmark wall clock, which in a single-concurrency run can fold in up to a third of overhead that has nothing to do with the GPU. Pick one tool and stay on it. Comparing a genai-perf number to an LLMPerf number is comparing two different definitions.

Stand up the benchmark rig against your NIM

NIM exposes an OpenAI-compatible API, and genai-perf speaks exactly that. The benchmark rig is just a client process that you point at the endpoint your NIM microservice already serves. Run it on a separate VM or pod, not on the GPU host, so you are not stealing CPU from the thing you are measuring. Keep it on the same workload-domain network segment so you are testing the model, not the WAN.

Where the load comes from Drive load from a separate client so the GPU host is not also generating it Client VM genai-perf / aiperf concurrency sweep NIM pod OpenAI-compatible /v1/chat/completions vGPU on ESXi host C-series profile HTTP CUDA
Keep the load generator off the GPU host and on the same segment as the endpoint.

Install the tool and confirm it can reach the endpoint. NVIDIA renamed the project: genai-perf is now distributed as aiperf, and the flags are compatible. A first smoke test:

pip install aiperf

# point at the NIM endpoint and confirm it answers
genai-perf profile 
  --model meta/llama-3.1-70b-instruct 
  --url http://nim-llama-70b.private-ai.svc:8000 
  --endpoint-type chat 
  --streaming 
  --synthetic-input-tokens-mean 1024 
  --output-tokens-mean 256 
  --concurrency 1 
  --request-count 50

That concurrency-of-1 run is a sanity check, not a result. It tells you the model is loaded, streaming works, and your TTFT at idle is sane. Treat any throughput number from it as a lower bound and nothing more.

The parameter that decides every result: sequence length and concurrency

Input sequence length (ISL) and output sequence length (OSL) shape the entire benchmark, and the default 128/128 represents almost no real workload. A RAG answer stuffs thousands of retrieved tokens into the prompt and emits a few hundred, so it is prefill-heavy and TTFT-bound. A code or essay generator takes a short prompt and emits a long completion, so it is decode-heavy and ITL-bound. These two profiles stress completely different parts of the GPU. Benchmarking one when you will run the other gives you a confident, precise, wrong answer.

Concurrency is the second lever, and it is the one people forget to sweep. Total throughput rises with concurrency because NIM batches in-flight requests, but TTFT and ITL degrade as the batch fills and requests queue. The deliverable from a benchmark is not a point. It is the curve below, and the operating point you can defend lives on it where your latency SLO intersects.

The latency-throughput curve Each dot is a concurrency level. Pick the point under your latency budget, not the peak. total throughput (tokens/sec) → per-request latency → c=1 c=8 c=32 c=64 latency SLO usable peak
Past the SLO line you are buying throughput with latency your users will not accept.

Notice the knee. Up to concurrency 32 in this shape, throughput climbs steeply while latency stays under budget. Past it, each extra unit of concurrency buys a little throughput for a lot of latency. The defensible operating point is at or just below where the curve crosses your SLO line, not at the top right where the marketing number lives.

Run the sweep

Disclaimer: a saturating sweep drives the GPU to full load and can affect co-located tenants. Run it against a non-production endpoint or a maintenance window, confirm the vGPU profile and NIM version match your target deployment, and record the exact model, ISL, OSL, and driver build alongside every result so the numbers are reproducible.

Fix ISL and OSL to your real workload, then step concurrency across a range that brackets the knee. Give each level enough requests that the sliding window has stable data, and let NIM warm up first so the first batch does not pollute the average.

# RAG-shaped profile: long prompt, short answer
for C in 1 4 8 16 32 64 128; do
  genai-perf profile 
    --model meta/llama-3.1-70b-instruct 
    --url http://nim-llama-70b.private-ai.svc:8000 
    --endpoint-type chat --streaming 
    --synthetic-input-tokens-mean 3000 
    --synthetic-input-tokens-stddev 300 
    --output-tokens-mean 200 
    --output-tokens-stddev 30 
    --concurrency $C 
    --request-count $((C * 20)) 
    --warmup-request-count 10 
    --artifact-dir ./results/c$C
done
  1. Set --synthetic-input-tokens-mean and --output-tokens-mean to your measured workload, with a realistic standard deviation so the batch mix is not artificially uniform.
  2. Scale --request-count with concurrency so higher levels still collect enough completed requests for a stable window.
  3. Keep --warmup-request-count non-zero. The first requests load weights into cache and skew TTFT badly.
  4. Write each level to its own --artifact-dir so you can plot the curve afterward.
  5. Watch the GPU from the other side at the same time so you can correlate the knee with real saturation. See GPU monitoring with VCF Operations for the signals worth watching.

Read the results and pick an operating point

For each concurrency level genai-perf reports TTFT, ITL, total TPS, per-user TPS, and RPS, with percentiles. Use percentiles, not the mean: a p50 TTFT of 300 ms with a p99 of 6 seconds is a bad endpoint wearing a good average. The table below is the matrix I hand a client, mapping each metric to what it tells you and the trap that hides behind it.

MetricWhat it tells youCommon trap
TTFT (p95)Perceived responsiveness; prefill plus queue costMeasured at short ISL, then blows up on real RAG prompts
ITL / TPOT (p95)Streaming smoothness; decode efficiencyHides behind a healthy mean while p99 stutters
Total TPS per systemCapacity you can sell; sizing inputQuoted at a concurrency no SLO would allow
TPS per userWhat one user feels; tracks 1/ITLIgnored, so per-user experience collapses under load
RPSThroughput for short-response workloadsMisleading for long streaming generations

The decision is mechanical once you have the curve and an SLO. Walk it like this.

Choosing the operating point State TTFT + ITL SLO Highest concurrency that still meets the SLO? GPU below saturation and KV cache not full? Throughput enough? Scale out replicas or move to a bigger profile Lock the operating point and size to it no yes
SLO first, then the highest concurrency that fits, then size to that point.

The virtualization question, answered

Someone in the room will ask whether the hypervisor is costing you throughput. It is a fair question and the answer is now measured rather than hand-waved. In the joint Broadcom and NVIDIA reference architecture for Private AI Foundation on HGX servers, genai-perf compared vGPU against bare metal for Llama 3.1 70B inference. Virtual GPUs delivered roughly 1 to 2 percent higher throughput than bare metal in some scenarios, and TTFT landed within about 1 to 2 percent either way depending on concurrency. In the same test only 24 of 208 logical cores and 256 GB of 2 TB of host memory were consumed by inference, leaving the rest of the host for other tenants.

My take: stop arguing about the 1 to 2 percent. It is inside the noise of your ISL and OSL choices, and you give up live migration, DRS placement, GPU pooling, and instant cloning of preloaded models to chase it. The throughput you are actually losing is somewhere else entirely: a vGPU profile too small to hold the model plus its KV cache, a sequence-length mix that does not match production, or a concurrency setting frozen at 1. Those cost you tens of percent. The hypervisor costs you almost nothing. If you want to push real performance, revisit your GPU memory math and sizing before you blame virtualization.


What I’d Do

Write down the SLO before you install anything: a TTFT target, an ITL target, and the request shape that produces them. Benchmark with that ISL and OSL, sweep concurrency through the knee, and report the curve plus one operating point, not a headline tokens-per-second figure. Save the model name, driver build, vGPU profile, and NIM version next to every result so the run is reproducible six months from now when someone disputes it. Do that and your capacity plan rests on a number you can defend, which is the only kind worth having.

What request shape does your top workload actually have, and have you ever measured it? If you have not, that is the first benchmark to run.

References

VMware Private AI Series · Part 21 of 30
« Previous: Part 20  |  VMware Private AI Complete Guide  |  Next: Part 22 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading