Files
sqrtspace-experiments/case_studies/database_systems/README.md
2025-07-20 03:56:21 -04:00

184 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Database Systems: Space-Time Tradeoffs in Practice
## Overview
Databases are perhaps the most prominent example of space-time tradeoffs in production systems. Every major database makes explicit decisions about trading memory for computation time.
## 1. Query Processing
### Hash Join vs Nested Loop Join
**Hash Join (More Memory)**
- Build hash table: O(n) space
- Probe phase: O(n+m) time
- Used when: Sufficient memory available
```sql
-- PostgreSQL will choose hash join if work_mem is high enough
SET work_mem = '256MB';
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;
```
**Nested Loop Join (Less Memory)**
- Space: O(1)
- Time: O(n×m)
- Used when: Memory constrained
```sql
-- Force nested loop with low work_mem
SET work_mem = '64kB';
```
### Real PostgreSQL Example
```sql
-- Monitor actual memory usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM large_table JOIN huge_table USING (id);
-- Output shows:
-- Hash Join: 145MB memory, 2.3 seconds
-- Nested Loop: 64KB memory, 487 seconds
```
## 2. Indexing Strategies
### B-Tree vs Full Table Scan
- **B-Tree Index**: O(n) space, O(log n) lookup
- **No Index**: O(1) extra space, O(n) scan time
### Covering Indexes
Trading more space for zero I/O reads:
```sql
-- Regular index: must fetch row data
CREATE INDEX idx_user_email ON users(email);
-- Covering index: all data in index (more space)
CREATE INDEX idx_user_email_covering ON users(email) INCLUDE (name, created_at);
```
## 3. Materialized Views
Ultimate space-for-time trade:
```sql
-- Compute once, store results
CREATE MATERIALIZED VIEW sales_summary AS
SELECT
date_trunc('day', sale_date) as day,
product_id,
SUM(amount) as total_sales,
COUNT(*) as num_sales
FROM sales
GROUP BY 1, 2;
-- Instant queries vs recomputation
SELECT * FROM sales_summary WHERE day = '2024-01-15'; -- 1ms
-- vs
SELECT ... FROM sales GROUP BY ...; -- 30 seconds
```
## 4. Buffer Pool Management
### PostgreSQL's shared_buffers
```
# Low memory: more disk I/O
shared_buffers = 128MB # Frequent disk reads
# High memory: cache working set
shared_buffers = 8GB # Most data in RAM
```
Performance impact:
- 128MB: TPC-H query takes 45 minutes
- 8GB: Same query takes 3 minutes
## 5. Query Planning
### Bitmap Heap Scan
A perfect example of √n-like behavior:
1. Build bitmap of matching rows: O(√n) space
2. Scan heap in physical order: Better than random I/O
3. Falls between index scan and sequential scan
```sql
EXPLAIN SELECT * FROM orders WHERE status IN ('pending', 'processing');
-- Bitmap Heap Scan on orders
-- Recheck Cond: (status = ANY ('{pending,processing}'::text[]))
-- -> Bitmap Index Scan on idx_status
```
## 6. Write-Ahead Logging (WAL)
Trading write performance for durability:
- **Synchronous commit**: Every transaction waits for disk
- **Asynchronous commit**: Buffer writes, risk data loss
```sql
-- Trade durability for speed
SET synchronous_commit = off; -- 10x faster inserts
```
## 7. Column Stores vs Row Stores
### Row Store (PostgreSQL, MySQL)
- Store complete rows together
- Good for OLTP, random access
- Space: Stores all columns even if not needed
### Column Store (ClickHouse, Vertica)
- Store each column separately
- Excellent compression (less space)
- Must reconstruct rows (more time for some queries)
Example compression ratios:
- Row store: 100GB table
- Column store: 15GB (85% space savings)
- But: Random row lookup 100x slower
## 8. Real-World Configuration
### PostgreSQL Memory Settings
```conf
# Total system RAM: 64GB
# Aggressive caching (space for time)
shared_buffers = 16GB # 25% of RAM
work_mem = 256MB # Per operation
maintenance_work_mem = 2GB # For VACUUM, CREATE INDEX
# Conservative (time for space)
shared_buffers = 128MB # Minimal caching
work_mem = 4MB # Forces disk-based operations
```
### MySQL InnoDB Buffer Pool
```conf
# 75% of RAM for buffer pool
innodb_buffer_pool_size = 48G
# Adaptive hash index (space for time)
innodb_adaptive_hash_index = ON
```
## 9. Distributed Databases
### Replication vs Computation
- **Full replication**: n× space, instant reads
- **No replication**: 1× space, distributed queries
### Cassandra's Space Amplification
- Replication factor 3: 3× space
- Plus SSTables: Another 2-3× during compaction
- Total: ~10× space for high availability
## Key Insights
1. **Every join algorithm** is a space-time tradeoff
2. **Indexes** are precomputed results (space for time)
3. **Buffer pools** cache hot data (space for I/O time)
4. **Query planners** explicitly optimize these tradeoffs
5. **DBAs tune memory** to control space-time balance
## Connection to Williams' Result
Databases naturally implement √n-like algorithms:
- Bitmap indexes: O(√n) space for range queries
- Sort-merge joins: O(√n) memory for external sort
- Buffer pool: Typically sized at √(database size)
The ubiquity of these patterns in database internals validates Williams' theoretical insights about the fundamental nature of space-time tradeoffs in computation.