Back to library

Understand FlashAttention and Tiling

Stop treating FlashAttention as a mystery flag — understand the tiling, online softmax, and HBM-vs-SRAM tradeoff that turn the same attention math into 2-4× speedups. By the end you can estimate FA's win for any sequence length on graph paper, before touching CUDA.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why attention bleeds memory before it bleeds math

See why attention is memory-bound, not compute-bound

4 drops
  1. Attention isn't slow — its memory traffic is

    6 min

    Attention isn't slow — its memory traffic is

  2. SRAM is the scratchpad nobody used

    6 min

    SRAM is the scratchpad nobody used

  3. The N×N matrix that never had to exist

    7 min

    The N×N matrix that never had to exist

  4. Same math, completely different schedule

    7 min

    Same math, completely different schedule

Phase 2Tiling and online softmax on graph paper

Run online softmax by hand across tiled blocks

5 drops
  1. Cut the matrix until it fits in your scratchpad

    7 min

    Cut the matrix until it fits in your scratchpad

  2. The running-max trick that streams softmax exactly

    8 min

    The running-max trick that streams softmax exactly

  3. Walk one Q tile through every K tile on paper

    8 min

    Walk one Q tile through every K tile on paper

  4. The backward pass that never stored the matrix

    7 min

    The backward pass that never stored the matrix

  5. Causal masking that doesn't waste tiles

    6 min

    Causal masking that doesn't waste tiles

Phase 3FA1 vs FA2 vs FA3 — same math, better schedule

Compare FA1, FA2, and FA3 as same-math reschedules

4 drops
  1. FA2 flipped the loop and doubled the speedup

    7 min

    FA2 flipped the loop and doubled the speedup

  2. FA3 stops waiting for memory and starts overlapping it

    8 min

    FA3 stops waiting for memory and starts overlapping it

  3. Your team turned on FA. Which version is running?

    7 min

    Your team turned on FA. Which version is running?

  4. PyTorch SDPA, xFormers, and the FA family tree

    7 min

    PyTorch SDPA, xFormers, and the FA family tree

Phase 4Estimate FA speedup from a roofline you draw

Estimate FA speedup from a roofline you draw

1 drop
  1. Draw the roofline. Predict FA's win for your sequence.

    8 min

    Draw the roofline. Predict FA's win for your sequence.

Frequently asked questions

Is FlashAttention an approximation of regular attention?
This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is standard attention memory-bound instead of compute-bound?
This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is online softmax and why does it stay exact across tiles?
This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does FlashAttention-2 differ from FlashAttention-1?
This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What changes in FlashAttention-3 on Hopper GPUs?
This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.