| .. | ||
| llm_attention_tradeoff.png | ||
| llm_kv_cache_experiment.py | ||
| llm_kv_cache_results.json | ||
| README.md | ||
LLM KV-Cache Experiment
Overview
This experiment demonstrates space-time tradeoffs in Large Language Model (LLM) attention mechanisms. By varying the KV-cache size, we show how modern AI systems implement Williams' √n pattern through techniques like Flash Attention.
Background
The Attention Mechanism
In transformers, attention computes:
Attention(Q,K,V) = softmax(QK^T/√d)V
For each new token, we need K and V matrices from all previous tokens.
KV-Cache Strategies
-
Full Cache O(n): Store all past keys/values
- Maximum memory usage
- No recomputation needed
- Used in standard implementations
-
Flash Attention O(√n): Store recent √n tokens
- Balanced memory/compute
- Recompute older tokens as needed
- Used in production LLMs
-
Minimal Cache O(1): Store almost nothing
- Minimum memory usage
- Maximum recomputation
- Used in extreme memory-constrained settings
Running the Experiment
python llm_kv_cache_experiment.py
Simulates attention computation for sequences of 512, 1024, and 2048 tokens.
Surprising Results
Our experiment revealed a counterintuitive finding:
| Cache Size | Memory | Tokens/sec | Speedup |
|---|---|---|---|
| O(n) Full | 12 MB | 197 | 1.0× |
| O(√n) | 1.1 MB | 1,349 | 6.8× |
| O(1) | 0.05 MB | 4,169 | 21.2× |
Smaller caches are FASTER! Why?
- Memory bandwidth bottleneck: Moving 12MB of data is slower than recomputing
- Cache locality: Small working sets fit in L2/L3 cache
- Modern CPUs: Computation is cheap, memory access is expensive
Real-World Impact
This pattern is used in:
- GPT-4: Flash Attention enables 32K+ context windows
- Claude: Efficient attention for 100K+ tokens
- Llama: Open models with extended context
- Mobile LLMs: Running models on phones with limited memory
Key Insights
- Williams' bound assumes uniform memory access
- Real systems have memory hierarchies
- Sometimes recomputation is faster than memory access
- The √n pattern emerges naturally as optimal
Production Techniques
- Flash Attention: Fuses operations to minimize memory transfers
- Paged Attention: Virtual memory for KV-cache
- Multi-Query Attention: Shares keys/values across heads
- Sliding Window: Fixed-size attention window
Generated Files
llm_attention_tradeoff.png: Performance visualizationllm_kv_cache_results.json: Detailed metrics