sqrtspace-tools/db_optimizer/README.md
2025-07-20 04:04:41 -04:00

6.9 KiB

Memory-Aware Query Optimizer

Database query optimizer that explicitly considers memory hierarchies and space-time tradeoffs based on Williams' theoretical bounds.

Features

  • Cost Model: Incorporates L3/RAM/SSD boundaries in cost calculations
  • Algorithm Selection: Chooses between hash/sort/nested-loop joins based on true memory costs
  • Buffer Sizing: Automatically sizes buffers to √(data_size) for optimal tradeoffs
  • Spill Planning: Optimizes when and how to spill to disk
  • Memory Hierarchy Awareness: Tracks which level (L1-L3/RAM/Disk) operations will use
  • AI Explanations: Clear reasoning for all optimization decisions

Installation

# From sqrtspace-tools root directory
pip install -r requirements-minimal.txt

Quick Start

from db_optimizer.memory_aware_optimizer import MemoryAwareOptimizer
import sqlite3

# Connect to database
conn = sqlite3.connect('mydb.db')

# Create optimizer with 10MB memory limit
optimizer = MemoryAwareOptimizer(conn, memory_limit=10*1024*1024)

# Optimize a query
sql = """
SELECT c.name, SUM(o.total) 
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.name
ORDER BY SUM(o.total) DESC
"""

result = optimizer.optimize_query(sql)
print(result.explanation)
# "Optimized query plan reduces memory usage by 87.3% with 2.1x estimated speedup.
#  Changed join from nested_loop to hash_join saving 9216KB.
#  Allocated 4 buffers totaling 2048KB for optimal performance."

Join Algorithm Selection

The optimizer intelligently selects join algorithms based on memory constraints:

1. Hash Join

  • When: Smaller table fits in memory
  • Memory: O(min(n,m))
  • Time: O(n+m)
  • Best for: Equi-joins with one small table

2. Sort-Merge Join

  • When: Both tables fit in memory for sorting
  • Memory: O(n+m)
  • Time: O(n log n + m log m)
  • Best for: Pre-sorted data or when output needs ordering

3. Block Nested Loop

  • When: Limited memory, uses √n blocks
  • Memory: O(√n)
  • Time: O(n*m/√n)
  • Best for: Memory-constrained environments

4. Nested Loop

  • When: Extreme memory constraints
  • Memory: O(1)
  • Time: O(n*m)
  • Last resort: When memory is critically limited

Buffer Management

The optimizer automatically calculates optimal buffer sizes:

# Get buffer recommendations
result = optimizer.optimize_query(query)
for buffer_name, size in result.buffer_sizes.items():
    print(f"{buffer_name}: {size / 1024:.1f}KB")

# Output:
# scan_buffer: 316.2KB      # √n sized for sequential scan
# join_buffer: 1024.0KB     # Optimal for hash table
# sort_buffer: 447.2KB      # √n sized for external sort

Spill Strategies

When memory is exceeded, the optimizer plans spilling:

# Check spill strategy
if result.spill_strategy:
    for operation, strategy in result.spill_strategy.items():
        print(f"{operation}: {strategy}")

# Output:
# JOIN_0: grace_hash_join              # Partition both inputs
# SORT_0: multi_pass_external_sort     # Multiple merge passes
# AGGREGATE_0: spill_partial_aggregates # Write intermediate results

Query Plan Visualization

# View query execution plan
print(optimizer.explain_plan(result.optimized_plan))

# Output:
# AGGREGATE (hash_aggregate)
#   Rows: 100
#   Size: 9.8KB
#   Memory: 14.6KB (L3)
#   Cost: 15234
#   SORT (external_sort)
#     Rows: 1,000
#     Size: 97.7KB
#     Memory: 9.9KB (L3)
#     Cost: 14234
#     JOIN (hash_join)
#       Rows: 1,000
#       Size: 97.7KB
#       Memory: 73.2KB (L3)
#       Cost: 3234
#       SCAN customers (sequential)
#         Rows: 100
#         Size: 9.8KB
#         Memory: 9.8KB (L2)
#         Cost: 98
#       SCAN orders (sequential)
#         Rows: 1,000
#         Size: 48.8KB
#         Memory: 48.8KB (L3)
#         Cost: 488

Optimizer Hints

Apply hints to SQL queries:

# Optimize for minimal memory usage
hinted_sql = optimizer.apply_hints(
    sql, 
    target='memory',
    memory_limit='1MB'
)
# /* SpaceTime Optimizer: Using block nested loop with √n memory ... */
# SELECT ...

# Optimize for speed
hinted_sql = optimizer.apply_hints(
    sql,
    target='latency'
)
# /* SpaceTime Optimizer: Using hash join for minimal latency ... */
# SELECT ...

Real-World Examples

1. Large Table Join with Memory Limit

# 1GB tables, 100MB memory limit
sql = """
SELECT l.*, r.details
FROM large_table l
JOIN reference_table r ON l.ref_id = r.id
WHERE l.status = 'active'
"""

result = optimizer.optimize_query(sql)
# Chooses: Block nested loop with 10MB blocks
# Memory: 10MB (fits in L3 cache)
# Speedup: 10x over naive nested loop

2. Multi-Way Join

sql = """
SELECT *
FROM a
JOIN b ON a.id = b.a_id
JOIN c ON b.id = c.b_id
JOIN d ON c.id = d.c_id
"""

result = optimizer.optimize_query(sql)
# Optimizes join order based on sizes
# Uses different algorithms for each join
# Allocates buffers to minimize spilling

3. Aggregation with Sorting

sql = """
SELECT category, COUNT(*), AVG(price)
FROM products
GROUP BY category
ORDER BY COUNT(*) DESC
"""

result = optimizer.optimize_query(sql)
# Hash aggregation with √n memory
# External sort for final ordering
# Explains tradeoffs clearly

Performance Characteristics

Memory Savings

  • Typical: 50-95% reduction vs naive approach
  • Best case: 99% reduction (large self-joins)
  • Worst case: 10% reduction (already optimal)

Speed Impact

  • Hash to Block Nested: 2-10x speedup
  • External Sort: 20-50% overhead vs in-memory
  • Overall: Usually faster despite less memory

Memory Hierarchy Benefits

  • L3 vs RAM: 8-10x latency improvement
  • RAM vs SSD: 100-1000x latency improvement
  • Optimizer targets: Keep hot data in faster levels

Integration

SQLite

conn = sqlite3.connect('mydb.db')
optimizer = MemoryAwareOptimizer(conn)

PostgreSQL (via psycopg2)

# Use explain analyze to get statistics
# Apply recommendations via SET commands

MySQL (planned)

# Similar approach with optimizer hints

How It Works

  1. Statistics Collection: Gathers table sizes, indexes, cardinalities
  2. Query Analysis: Parses SQL to extract operations
  3. Cost Modeling: Estimates cost with memory hierarchy awareness
  4. Algorithm Selection: Chooses optimal algorithms for each operation
  5. Buffer Allocation: Sizes buffers using √n principle
  6. Spill Planning: Determines graceful degradation strategy

Limitations

  • Simplified cardinality estimation
  • SQLite-focused (PostgreSQL support planned)
  • No runtime adaptation yet
  • Requires accurate statistics

Future Enhancements

  • Runtime plan adjustment
  • Learned cost models
  • PostgreSQL native integration
  • Distributed query optimization
  • GPU memory hierarchy support

See Also