Back to library

🧮Understand vLLM PagedAttention and KV Cache Memory

Re-use the virtual-memory analogy you already know to demystify vLLM: by the end you can sketch a block table, explain prefix sharing, and estimate how many 8k-context sequences fit on your GPU.

Applied14 drops~2-week path Ā· 5–8 min/daytechnology

Phase 1The Hidden Memory Bug in LLM Serving

See where 60-80% of KV cache memory vanishes

4 drops
  1. PagedAttention is paging, not attention math

    5 min

    PagedAttention is paging, not attention math

  2. Pre-vLLM systems waste 60-80% of KV cache memory

    6 min

    Pre-vLLM systems waste 60-80% of KV cache memory

  3. The KV cache is a heap; paging is a malloc rewrite

    6 min

    The KV cache is a heap; paging is a malloc rewrite

  4. A block is 16 tokens of K and V, fixed size

    5 min

    A block is 16 tokens of K and V, fixed size

Phase 2Walking Through the Block Table

Build the block table and trace copy-on-write

5 drops
  1. The block table maps logical to physical, one row per sequence

    6 min

    The block table maps logical to physical, one row per sequence

  2. Sequences grow one block at a time, not in chunks

    6 min

    Sequences grow one block at a time, not in chunks

  3. Two requests with the same system prompt share blocks

    7 min

    Two requests with the same system prompt share blocks

  4. Divergence triggers copy-on-write at block granularity

    7 min

    Divergence triggers copy-on-write at block granularity

  5. When VRAM fills up, vLLM swaps blocks to CPU

    7 min

    When VRAM fills up, vLLM swaps blocks to CPU

Phase 3Paging Meets Batching and Prefix Caching

Connect paging to batching and prefix caching

4 drops
  1. Your latency dashboard shows variance, not improvement

    7 min

    Your latency dashboard shows variance, not improvement

  2. Your chatbot's TTFT mysteriously dropped overnight

    8 min

    Your chatbot's TTFT mysteriously dropped overnight

  3. Why a 128k model still chokes at 100 users

    8 min

    Why a 128k model still chokes at 100 users

  4. INT8 KV cache halves your memory bill, not your block count

    8 min

    INT8 KV cache halves your memory bill, not your block count

Phase 4Sizing Your GPU for Real Workloads

Estimate concurrent sequences for your real GPU

1 drop
  1. Estimate concurrent 8k-context sequences for your GPU

    18 min

    Estimate concurrent 8k-context sequences for your GPU

Frequently asked questions

What is PagedAttention in vLLM and how is it different from FlashAttention?
This is covered in the ā€œUnderstand vLLM PagedAttention and KV Cache Memoryā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does the KV cache waste so much GPU memory without paging?
This is covered in the ā€œUnderstand vLLM PagedAttention and KV Cache Memoryā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does vLLM share a system prompt across multiple requests?
This is covered in the ā€œUnderstand vLLM PagedAttention and KV Cache Memoryā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is a block table in vLLM and how does it map logical to physical blocks?
This is covered in the ā€œUnderstand vLLM PagedAttention and KV Cache Memoryā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I estimate the maximum concurrent sequences my GPU can serve?
This is covered in the ā€œUnderstand vLLM PagedAttention and KV Cache Memoryā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.