Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

Tag: TensorRT

AI Stack, AI/ML

TensorRT and TensorRT-LLM: Optimization, Quantization, and Engine Building (NVIDIA AI Series, Part 18)

Dr. Pranay Jha

June 22, 2026

What TensorRT does at build time versus what TensorRT-LLM adds at runtime — kernel fusion, paged KV cache, in-flight batching, and quantization choices from FP8 to NVFP4 — and when to hand-build engines instead of relying on a NIM.
Continue Reading

Architect’s Toolkit

About the Author

Dr Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

AI Stack, AI/ML, VMware & Cloud

Running NVIDIA AI On-Prem and on VCF: Cost, Trade-offs and the Verdict (NVIDIA AI Series, Part 30)

June 23, 2026
AI Stack, AI/ML

GPU Observability and Multi-Tenancy: DCGM, Honest Utilization, and Sharing (NVIDIA AI Series, Part 29)

June 23, 2026
AI Stack, AI/ML

NVIDIA Blueprints and Agentic AI: AI-Q and the NeMo Agent Toolkit (NVIDIA AI Series, Part 28)

June 23, 2026
AI Stack, AI/ML

The NVIDIA NeMo Framework: Training and Fine-Tuning at Scale (NVIDIA AI Series, Part 22)

June 23, 2026
AI Stack, AI/ML

NVIDIA NeMo Retriever: RAG with Embeddings, Reranking and Guardrails (NVIDIA AI Series, Part 27)

June 23, 2026