⚡Compare LLM Serving Frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp
Stop picking vLLM because Twitter said so. You'll learn to read a deployment's shape — concurrency, prefix overlap, hardware, lifetime — and narrow the four frameworks to one defensible choice in four questions.
Phase 1Four Frameworks, Four Bets About the Workload
Meet the four frameworks and what's uniquely theirs
Pick the workload first, then the framework
6 minPick the workload first, then the framework
vLLM optimizes for the crowded chat room
7 minvLLM optimizes for the crowded chat room
TensorRT-LLM is a compiler, not a server
7 minTensorRT-LLM is a compiler, not a server
SGLang shares prefixes; llama.cpp ships everywhere
7 minSGLang shares prefixes; llama.cpp ships everywhere
Phase 2Predict the Winner Before You Benchmark
Predict the winner for three real workloads
High-concurrency chat: vLLM wins by design
6 minHigh-concurrency chat: vLLM wins by design
RAG with reused system prompts: SGLang wins on prefix overlap
7 minRAG with reused system prompts: SGLang wins on prefix overlap
Single-user laptop: llama.cpp is the only credible call
6 minSingle-user laptop: llama.cpp is the only credible call
Fixed model, fixed GPU, long lifetime: TensorRT-LLM compounds
7 minFixed model, fixed GPU, long lifetime: TensorRT-LLM compounds
Predict first, benchmark second — and never the reverse
6 minPredict first, benchmark second — and never the reverse
Phase 3Each Speedup Comes From One Specific Technique
Trace each speedup back to a specific technique
Your slow request shouldn't block fast ones — that's continuous batching
6 minYour slow request shouldn't block fast ones — that's continuous batching
PagedAttention treats KV cache like virtual memory
7 minPagedAttention treats KV cache like virtual memory
RadixAttention is prefix caching that handles branches
7 minRadixAttention is prefix caching that handles branches
Compiled engines and GGUF: optimization at opposite ends
7 minCompiled engines and GGUF: optimization at opposite ends
Phase 4The Four-Question Picking Framework
Write the four questions that pick the framework
Write the four questions that pick the framework
20 minWrite the four questions that pick the framework
Frequently asked questions
- When does TensorRT-LLM actually beat vLLM in throughput?
- This is covered in the “Compare LLM Serving Frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does SGLang win on RAG and multi-turn chat workloads?
- This is covered in the “Compare LLM Serving Frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Can vLLM run on a MacBook or do I need llama.cpp?
- This is covered in the “Compare LLM Serving Frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is PagedAttention and which frameworks have it?
- This is covered in the “Compare LLM Serving Frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does RadixAttention differ from KV cache reuse in vLLM?
- This is covered in the “Compare LLM Serving Frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.