Files
sqrtspace-experiments/case_studies/distributed_computing/README.md
2025-07-20 03:56:21 -04:00

269 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Distributed Computing: Space-Time Tradeoffs at Scale
## Overview
Distributed systems make explicit decisions about replication (space) vs computation (time). Every major distributed framework embodies these tradeoffs.
## 1. MapReduce / Hadoop
### Shuffle Phase - The Classic Tradeoff
```java
// Map output: Written to local disk (space for fault tolerance)
map(key, value):
for word in value.split():
emit(word, 1)
// Shuffle: All-to-all communication
// Choice: Buffer in memory vs spill to disk
shuffle.memory.ratio = 0.7 // 70% of heap for shuffle
shuffle.spill.percent = 0.8 // Spill when 80% full
```
**Memory Settings Impact:**
- High memory: Fast shuffle, risk of OOM
- Low memory: Frequent spills, 10x slower
- Sweet spot: √(data_size) memory per node
### Combiner Optimization
```java
// Without combiner: Send all data
map: (word, 1), (word, 1), (word, 1)...
// With combiner: Local aggregation (compute for space)
combine: (word, 3)
// Network transfer: 100x reduction
// CPU cost: Local sum computation
```
## 2. Apache Spark
### RDD Persistence Levels
```scala
// MEMORY_ONLY: Fast but memory intensive
rdd.persist(StorageLevel.MEMORY_ONLY)
// Space: Full dataset in RAM
// Time: Instant access
// MEMORY_AND_DISK: Spill to disk when needed
rdd.persist(StorageLevel.MEMORY_AND_DISK)
// Space: Min(dataset, available_ram)
// Time: RAM-speed or disk-speed
// DISK_ONLY: Minimal memory
rdd.persist(StorageLevel.DISK_ONLY)
// Space: O(1) RAM
// Time: Always disk I/O
// MEMORY_ONLY_SER: Serialized in memory
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
// Space: 2-5x reduction via serialization
// Time: CPU cost to deserialize
```
### Broadcast Variables
```scala
// Without broadcast: Send to each task
val bigData = loadBigDataset() // 1GB
rdd.map(x => doSomething(x, bigData))
// Network: 1GB × num_tasks
// With broadcast: Send once per node
val bcData = sc.broadcast(bigData)
rdd.map(x => doSomething(x, bcData.value))
// Network: 1GB × num_nodes
// Memory: Extra copy per node
```
## 3. Distributed Key-Value Stores
### Redis Eviction Policies
```conf
# No eviction: Fail when full (pure space)
maxmemory-policy noeviction
# LRU: Recompute evicted data (time for space)
maxmemory-policy allkeys-lru
maxmemory 10gb
# LFU: Better hit rate, more CPU
maxmemory-policy allkeys-lfu
```
### Memcached Slab Allocation
- Fixed-size slabs: Internal fragmentation (waste space)
- Variable-size: External fragmentation (CPU to compact)
- Typical: √n slab classes for n object sizes
## 4. Kafka / Stream Processing
### Log Compaction
```properties
# Keep all messages (max space)
cleanup.policy=none
# Keep only latest per key (compute to save space)
cleanup.policy=compact
min.compaction.lag.ms=86400000
# Compression (CPU for space)
compression.type=lz4 # 4x space reduction
compression.type=zstd # 6x reduction, more CPU
```
### Consumer Groups
- Replicate processing: Each consumer gets all data
- Partition assignment: Each message processed once
- Tradeoff: Redundancy vs coordination overhead
## 5. Kubernetes / Container Orchestration
### Resource Requests vs Limits
```yaml
resources:
requests:
memory: "256Mi" # Guaranteed (space reservation)
cpu: "250m" # Guaranteed (time reservation)
limits:
memory: "512Mi" # Max before OOM
cpu: "500m" # Max before throttling
```
### Image Layer Caching
- Base images: Shared across containers (dedup space)
- Layer reuse: Fast container starts
- Tradeoff: Registry space vs pull time
## 6. Distributed Consensus
### Raft Log Compaction
```go
// Snapshot periodically to bound log size
if logSize > maxLogSize {
snapshot = createSnapshot(stateMachine)
truncateLog(snapshot.index)
}
// Space: O(snapshot) instead of O(all_operations)
// Time: Recreate state from snapshot + recent ops
```
### Multi-Paxos vs Raft
- Multi-Paxos: Less memory, complex recovery
- Raft: More memory (full log), simple recovery
- Tradeoff: Space vs implementation complexity
## 7. Content Delivery Networks (CDNs)
### Edge Caching Strategy
```nginx
# Cache everything (max space)
proxy_cache_valid 200 30d;
proxy_cache_max_size 100g;
# Cache popular only (compute popularity)
proxy_cache_min_uses 3;
proxy_cache_valid 200 1h;
proxy_cache_max_size 10g;
```
### Geographic Replication
- Full replication: Every edge has all content
- Lazy pull: Fetch on demand
- Predictive push: ML models predict demand
## 8. Batch Processing Frameworks
### Apache Flink Checkpointing
```java
// Checkpoint frequency (space vs recovery time)
env.enableCheckpointing(10000); // Every 10 seconds
// State backend choice
env.setStateBackend(new FsStateBackend("hdfs://..."));
// vs
env.setStateBackend(new RocksDBStateBackend("file://..."));
// RocksDB: Spill to disk, slower access
// Memory: Fast access, limited size
```
### Watermark Strategies
- Perfect watermarks: Buffer all late data (space)
- Heuristic watermarks: Drop some late data (accuracy for space)
- Allowed lateness: Bounded buffer
## 9. Real-World Examples
### Google's MapReduce (2004)
- Problem: Processing 20TB of web data
- Solution: Trade disk space for fault tolerance
- Impact: 1000 machines × 3 hours vs 1 machine × 3000 hours
### Facebook's TAO (2013)
- Problem: Social graph queries
- Solution: Replicate to every datacenter
- Tradeoff: Petabytes of RAM for microsecond latency
### Amazon's Dynamo (2007)
- Problem: Shopping cart availability
- Solution: Eventually consistent, multi-version
- Tradeoff: Space for conflict resolution
## 10. Optimization Patterns
### Hierarchical Aggregation
```python
# Naive: All-to-one
results = []
for worker in workers:
results.extend(worker.compute())
return aggregate(results) # Bottleneck!
# Tree aggregation: √n levels
level1 = [aggregate(chunk) for chunk in chunks(workers, sqrt(n))]
level2 = [aggregate(chunk) for chunk in chunks(level1, sqrt(n))]
return aggregate(level2)
# Space: O(√n) intermediate results
# Time: O(log n) vs O(n)
```
### Bloom Filters in Distributed Joins
```java
// Broadcast join with Bloom filter
BloomFilter filter = createBloomFilter(smallTable);
broadcast(filter);
// Each node filters locally
bigTable.filter(row -> filter.mightContain(row.key))
.join(broadcastedSmallTable);
// Space: O(m log n) bits for filter
// Reduction: 99% fewer network transfers
```
## Key Insights
1. **Every distributed system** trades replication for computation
2. **The √n pattern** appears in:
- Shuffle buffer sizes
- Checkpoint frequencies
- Aggregation tree heights
- Cache sizes
3. **Network is the new disk**:
- Network transfer ≈ Disk I/O in cost
- Same space-time tradeoffs apply
4. **Failures force space overhead**:
- Replication for availability
- Checkpointing for recovery
- Logging for consistency
## Connection to Williams' Result
Distributed systems naturally implement √n algorithms:
- Shuffle phases: O(√n) memory per node optimal
- Aggregation trees: O(√n) height minimizes time
- Cache sizing: √(total_data) per node common
These patterns emerge independently across systems, validating the fundamental nature of the √(t log t) space bound for time-t computations.