History

Dave Friedel 59539f4daa Initial		2025-07-20 03:56:21 -04:00
..
llm_attention_tradeoff.png	Initial	2025-07-20 03:56:21 -04:00
llm_kv_cache_experiment.py	Initial	2025-07-20 03:56:21 -04:00
llm_kv_cache_results.json	Initial	2025-07-20 03:56:21 -04:00
README.md	Initial	2025-07-20 03:56:21 -04:00

README.md

LLM KV-Cache Experiment

Overview

This experiment demonstrates space-time tradeoffs in Large Language Model (LLM) attention mechanisms. By varying the KV-cache size, we show how modern AI systems implement Williams' √n pattern through techniques like Flash Attention.

Background

The Attention Mechanism

In transformers, attention computes:

Attention(Q,K,V) = softmax(QK^T/√d)V

For each new token, we need K and V matrices from all previous tokens.

KV-Cache Strategies

Full Cache O(n): Store all past keys/values
- Maximum memory usage
- No recomputation needed
- Used in standard implementations
Flash Attention O(√n): Store recent √n tokens
- Balanced memory/compute
- Recompute older tokens as needed
- Used in production LLMs
Minimal Cache O(1): Store almost nothing
- Minimum memory usage
- Maximum recomputation
- Used in extreme memory-constrained settings

Running the Experiment

python llm_kv_cache_experiment.py

Simulates attention computation for sequences of 512, 1024, and 2048 tokens.

Surprising Results

Our experiment revealed a counterintuitive finding:

Cache Size	Memory	Tokens/sec	Speedup
O(n) Full	12 MB	197	1.0×
O(√n)	1.1 MB	1,349	6.8×
O(1)	0.05 MB	4,169	21.2×

Smaller caches are FASTER! Why?

Memory bandwidth bottleneck: Moving 12MB of data is slower than recomputing
Cache locality: Small working sets fit in L2/L3 cache
Modern CPUs: Computation is cheap, memory access is expensive

Real-World Impact

This pattern is used in:

GPT-4: Flash Attention enables 32K+ context windows
Claude: Efficient attention for 100K+ tokens
Llama: Open models with extended context
Mobile LLMs: Running models on phones with limited memory

Key Insights

Williams' bound assumes uniform memory access
Real systems have memory hierarchies
Sometimes recomputation is faster than memory access
The √n pattern emerges naturally as optimal

Production Techniques

Flash Attention: Fuses operations to minimize memory transfers
Paged Attention: Virtual memory for KV-cache
Multi-Query Attention: Shares keys/values across heads
Sliding Window: Fixed-size attention window

Generated Files

llm_attention_tradeoff.png: Performance visualization
llm_kv_cache_results.json: Detailed metrics

README.md Unescape Escape