sqrtspace/sqrtspace-experiments

Fork 0

Files

History

Dave Friedel 59539f4daa Initial

2025-07-20 03:56:21 -04:00

README.md

Initial

2025-07-20 03:56:21 -04:00

README.md

Database Systems: Space-Time Tradeoffs in Practice

Overview

Databases are perhaps the most prominent example of space-time tradeoffs in production systems. Every major database makes explicit decisions about trading memory for computation time.

1. Query Processing

Hash Join vs Nested Loop Join

Hash Join (More Memory)

Build hash table: O(n) space
Probe phase: O(n+m) time
Used when: Sufficient memory available

-- PostgreSQL will choose hash join if work_mem is high enough
SET work_mem = '256MB';
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;

Nested Loop Join (Less Memory)

Space: O(1)
Time: O(n×m)
Used when: Memory constrained

-- Force nested loop with low work_mem
SET work_mem = '64kB';

Real PostgreSQL Example

-- Monitor actual memory usage
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM large_table JOIN huge_table USING (id);

-- Output shows:
-- Hash Join: 145MB memory, 2.3 seconds
-- Nested Loop: 64KB memory, 487 seconds

2. Indexing Strategies

B-Tree vs Full Table Scan

B-Tree Index: O(n) space, O(log n) lookup
No Index: O(1) extra space, O(n) scan time

Covering Indexes

Trading more space for zero I/O reads:

-- Regular index: must fetch row data
CREATE INDEX idx_user_email ON users(email);

-- Covering index: all data in index (more space)
CREATE INDEX idx_user_email_covering ON users(email) INCLUDE (name, created_at);

3. Materialized Views

Ultimate space-for-time trade:

-- Compute once, store results
CREATE MATERIALIZED VIEW sales_summary AS
SELECT 
    date_trunc('day', sale_date) as day,
    product_id,
    SUM(amount) as total_sales,
    COUNT(*) as num_sales
FROM sales
GROUP BY 1, 2;

-- Instant queries vs recomputation
SELECT * FROM sales_summary WHERE day = '2024-01-15';  -- 1ms
-- vs
SELECT ... FROM sales GROUP BY ...;  -- 30 seconds

4. Buffer Pool Management

PostgreSQL's shared_buffers

# Low memory: more disk I/O
shared_buffers = 128MB  # Frequent disk reads

# High memory: cache working set  
shared_buffers = 8GB    # Most data in RAM

Performance impact:

128MB: TPC-H query takes 45 minutes
8GB: Same query takes 3 minutes

5. Query Planning

Bitmap Heap Scan

A perfect example of √n-like behavior:

Build bitmap of matching rows: O(√n) space
Scan heap in physical order: Better than random I/O
Falls between index scan and sequential scan

EXPLAIN SELECT * FROM orders WHERE status IN ('pending', 'processing');
-- Bitmap Heap Scan on orders
-- Recheck Cond: (status = ANY ('{pending,processing}'::text[]))
-- -> Bitmap Index Scan on idx_status

6. Write-Ahead Logging (WAL)

Trading write performance for durability:

Synchronous commit: Every transaction waits for disk
Asynchronous commit: Buffer writes, risk data loss

-- Trade durability for speed
SET synchronous_commit = off;  -- 10x faster inserts

7. Column Stores vs Row Stores

Row Store (PostgreSQL, MySQL)

Store complete rows together
Good for OLTP, random access
Space: Stores all columns even if not needed

Column Store (ClickHouse, Vertica)

Store each column separately
Excellent compression (less space)
Must reconstruct rows (more time for some queries)

Example compression ratios:

Row store: 100GB table
Column store: 15GB (85% space savings)
But: Random row lookup 100x slower

8. Real-World Configuration

PostgreSQL Memory Settings

# Total system RAM: 64GB

# Aggressive caching (space for time)
shared_buffers = 16GB          # 25% of RAM
work_mem = 256MB               # Per operation
maintenance_work_mem = 2GB     # For VACUUM, CREATE INDEX

# Conservative (time for space)  
shared_buffers = 128MB         # Minimal caching
work_mem = 4MB                 # Forces disk-based operations

MySQL InnoDB Buffer Pool

# 75% of RAM for buffer pool
innodb_buffer_pool_size = 48G

# Adaptive hash index (space for time)
innodb_adaptive_hash_index = ON

9. Distributed Databases

Replication vs Computation

Full replication: n× space, instant reads
No replication: 1× space, distributed queries

Cassandra's Space Amplification

Replication factor 3: 3× space
Plus SSTables: Another 2-3× during compaction
Total: ~10× space for high availability

Key Insights

Every join algorithm is a space-time tradeoff
Indexes are precomputed results (space for time)
Buffer pools cache hot data (space for I/O time)
Query planners explicitly optimize these tradeoffs
DBAs tune memory to control space-time balance

Connection to Williams' Result

Databases naturally implement √n-like algorithms:

Bitmap indexes: O(√n) space for range queries
Sort-merge joins: O(√n) memory for external sort
Buffer pool: Typically sized at √(database size)

The ubiquity of these patterns in database internals validates Williams' theoretical insights about the fundamental nature of space-time tradeoffs in computation.

README.md Unescape Escape

Database Systems: Space-Time Tradeoffs in Practice

Overview

1. Query Processing

Hash Join vs Nested Loop Join

Real PostgreSQL Example

2. Indexing Strategies

B-Tree vs Full Table Scan

Covering Indexes

3. Materialized Views

4. Buffer Pool Management

PostgreSQL's shared_buffers

5. Query Planning

Bitmap Heap Scan

6. Write-Ahead Logging (WAL)

7. Column Stores vs Row Stores

Row Store (PostgreSQL, MySQL)

Column Store (ClickHouse, Vertica)

8. Real-World Configuration

PostgreSQL Memory Settings

MySQL InnoDB Buffer Pool

9. Distributed Databases

Replication vs Computation

Cassandra's Space Amplification

Key Insights

Connection to Williams' Result

README.md