TL;DR · Key Takeaways
- A single tokens-per-second number is close to useless. Benchmark the latency-versus-throughput curve by sweeping concurrency, then read off the throughput your TTFT and ITL budget actually allows.
- Use NVIDIA
genai-perf(now shipping asaiperf) against the OpenAI-compatible endpoint your NIM exposes. Set input and output sequence length to match your real workload, not the 128/128 default. - The metrics that matter: TTFT, ITL (time per output token), total TPS per system, TPS per user, and RPS. They trade against each other, so name your SLO before you run anything.
- Broadcom and NVIDIA measured vGPU on Private AI Foundation within roughly 1 to 2 percent of bare metal for Llama 3.1 70B inference. The hypervisor is not where your throughput goes.
- The throughput killers are real: wrong sequence-length mix, concurrency set to 1, an undersized KV cache, and a GPU profile that cannot hold the model plus its cache.
Here is the number you will be tempted to quote in the design review: “the L40S does 3,200 tokens per second.” It is a real measurement and it is also misleading, because nobody told you it was taken at a concurrency of 64 with a time to first token of 4 seconds. At the concurrency and latency your chatbot actually needs, that same GPU might give you a third of it. Benchmarking inference is not about finding the biggest number. It is about finding the throughput you can sustain while staying inside a latency budget, and that requires a sweep, not a single run.
This part walks through how to benchmark NIM inference on Private AI Foundation with genai-perf, which metrics to trust, the parameter that quietly decides every result, and how to read the output so the answer survives contact with production.
The five metrics that actually describe an LLM endpoint
An LLM request has two phases with different costs, and most benchmarking confusion comes from collapsing them into one average. The prefill phase reads your whole prompt and builds the KV cache. The decode phase then emits tokens one at a time. Five metrics describe what the user and the system experience across those phases.
- Time to first token (TTFT): the wait from sending the query to seeing the first token. It includes request queuing, prefill compute, and network latency, and it grows with prompt length because the model must build the KV cache for the entire input before decoding starts. This is the number a user feels as “responsiveness”.
- Inter-token latency (ITL), also called time per output token: the average gap between consecutive tokens during decode. genai-perf defines it as (end-to-end latency minus TTFT) divided by (output tokens minus 1), deliberately excluding the first token so it measures only the decode loop. ITL is what makes streaming text feel smooth or stuttery.
- Total TPS per system: output tokens per second across all concurrent requests. It climbs as you add concurrency until the GPU saturates, then flattens or falls. This is the metric you size capacity against.
- TPS per user: throughput from one request’s point of view, which asymptotically approaches 1 divided by ITL. As you pack in more concurrent users, total TPS rises while per-user TPS drops. That tension is the whole game.
- Requests per second (RPS): completed requests per second. Useful when your unit of work is a whole short response rather than a long stream.
One trap worth naming up front: different tools compute these differently. genai-perf excludes TTFT from ITL; the older LLMPerf includes it. genai-perf measures TPS between the first request and the last response using a sliding window that discards warm-up and cool-down; LLMPerf divides by the entire benchmark wall clock, which in a single-concurrency run can fold in up to a third of overhead that has nothing to do with the GPU. Pick one tool and stay on it. Comparing a genai-perf number to an LLMPerf number is comparing two different definitions.
Stand up the benchmark rig against your NIM
NIM exposes an OpenAI-compatible API, and genai-perf speaks exactly that. The benchmark rig is just a client process that you point at the endpoint your NIM microservice already serves. Run it on a separate VM or pod, not on the GPU host, so you are not stealing CPU from the thing you are measuring. Keep it on the same workload-domain network segment so you are testing the model, not the WAN.
Install the tool and confirm it can reach the endpoint. NVIDIA renamed the project: genai-perf is now distributed as aiperf, and the flags are compatible. A first smoke test:
pip install aiperf
# point at the NIM endpoint and confirm it answers
genai-perf profile
--model meta/llama-3.1-70b-instruct
--url http://nim-llama-70b.private-ai.svc:8000
--endpoint-type chat
--streaming
--synthetic-input-tokens-mean 1024
--output-tokens-mean 256
--concurrency 1
--request-count 50
That concurrency-of-1 run is a sanity check, not a result. It tells you the model is loaded, streaming works, and your TTFT at idle is sane. Treat any throughput number from it as a lower bound and nothing more.
The parameter that decides every result: sequence length and concurrency
Input sequence length (ISL) and output sequence length (OSL) shape the entire benchmark, and the default 128/128 represents almost no real workload. A RAG answer stuffs thousands of retrieved tokens into the prompt and emits a few hundred, so it is prefill-heavy and TTFT-bound. A code or essay generator takes a short prompt and emits a long completion, so it is decode-heavy and ITL-bound. These two profiles stress completely different parts of the GPU. Benchmarking one when you will run the other gives you a confident, precise, wrong answer.
Concurrency is the second lever, and it is the one people forget to sweep. Total throughput rises with concurrency because NIM batches in-flight requests, but TTFT and ITL degrade as the batch fills and requests queue. The deliverable from a benchmark is not a point. It is the curve below, and the operating point you can defend lives on it where your latency SLO intersects.
Notice the knee. Up to concurrency 32 in this shape, throughput climbs steeply while latency stays under budget. Past it, each extra unit of concurrency buys a little throughput for a lot of latency. The defensible operating point is at or just below where the curve crosses your SLO line, not at the top right where the marketing number lives.
Run the sweep
Fix ISL and OSL to your real workload, then step concurrency across a range that brackets the knee. Give each level enough requests that the sliding window has stable data, and let NIM warm up first so the first batch does not pollute the average.
# RAG-shaped profile: long prompt, short answer
for C in 1 4 8 16 32 64 128; do
genai-perf profile
--model meta/llama-3.1-70b-instruct
--url http://nim-llama-70b.private-ai.svc:8000
--endpoint-type chat --streaming
--synthetic-input-tokens-mean 3000
--synthetic-input-tokens-stddev 300
--output-tokens-mean 200
--output-tokens-stddev 30
--concurrency $C
--request-count $((C * 20))
--warmup-request-count 10
--artifact-dir ./results/c$C
done
- Set
--synthetic-input-tokens-meanand--output-tokens-meanto your measured workload, with a realistic standard deviation so the batch mix is not artificially uniform. - Scale
--request-countwith concurrency so higher levels still collect enough completed requests for a stable window. - Keep
--warmup-request-countnon-zero. The first requests load weights into cache and skew TTFT badly. - Write each level to its own
--artifact-dirso you can plot the curve afterward. - Watch the GPU from the other side at the same time so you can correlate the knee with real saturation. See GPU monitoring with VCF Operations for the signals worth watching.
Read the results and pick an operating point
For each concurrency level genai-perf reports TTFT, ITL, total TPS, per-user TPS, and RPS, with percentiles. Use percentiles, not the mean: a p50 TTFT of 300 ms with a p99 of 6 seconds is a bad endpoint wearing a good average. The table below is the matrix I hand a client, mapping each metric to what it tells you and the trap that hides behind it.
| Metric | What it tells you | Common trap |
|---|---|---|
| TTFT (p95) | Perceived responsiveness; prefill plus queue cost | Measured at short ISL, then blows up on real RAG prompts |
| ITL / TPOT (p95) | Streaming smoothness; decode efficiency | Hides behind a healthy mean while p99 stutters |
| Total TPS per system | Capacity you can sell; sizing input | Quoted at a concurrency no SLO would allow |
| TPS per user | What one user feels; tracks 1/ITL | Ignored, so per-user experience collapses under load |
| RPS | Throughput for short-response workloads | Misleading for long streaming generations |
The decision is mechanical once you have the curve and an SLO. Walk it like this.
The virtualization question, answered
Someone in the room will ask whether the hypervisor is costing you throughput. It is a fair question and the answer is now measured rather than hand-waved. In the joint Broadcom and NVIDIA reference architecture for Private AI Foundation on HGX servers, genai-perf compared vGPU against bare metal for Llama 3.1 70B inference. Virtual GPUs delivered roughly 1 to 2 percent higher throughput than bare metal in some scenarios, and TTFT landed within about 1 to 2 percent either way depending on concurrency. In the same test only 24 of 208 logical cores and 256 GB of 2 TB of host memory were consumed by inference, leaving the rest of the host for other tenants.
My take: stop arguing about the 1 to 2 percent. It is inside the noise of your ISL and OSL choices, and you give up live migration, DRS placement, GPU pooling, and instant cloning of preloaded models to chase it. The throughput you are actually losing is somewhere else entirely: a vGPU profile too small to hold the model plus its KV cache, a sequence-length mix that does not match production, or a concurrency setting frozen at 1. Those cost you tens of percent. The hypervisor costs you almost nothing. If you want to push real performance, revisit your GPU memory math and sizing before you blame virtualization.
What I’d Do
Write down the SLO before you install anything: a TTFT target, an ITL target, and the request shape that produces them. Benchmark with that ISL and OSL, sweep concurrency through the knee, and report the curve plus one operating point, not a headline tokens-per-second figure. Save the model name, driver build, vGPU profile, and NIM version next to every result so the run is reproducible six months from now when someone disputes it. Do that and your capacity plan rests on a number you can defend, which is the only kind worth having.
What request shape does your top workload actually have, and have you ever measured it? If you have not, that is the first benchmark to run.
References
- VMware Cloud Foundation Blog: Private AI Foundation with NVIDIA on HGX Servers for Inference
- NVIDIA NIM LLMs Benchmarking: Metrics (TTFT, ITL, TPS, RPS)
- NVIDIA Developer Blog: Measuring NIM Performance with GenAI-Perf
- Broadcom TechDocs: VMware Private AI Foundation with NVIDIA 9.1
« Previous: Part 20 | VMware Private AI Complete Guide | Next: Part 22 »








