2025-07-20 04:04:41 -04:00

6.9 KiB

Raw Blame History

Memory-Aware Query Optimizer

Database query optimizer that explicitly considers memory hierarchies and space-time tradeoffs based on Williams' theoretical bounds.

Features

Cost Model: Incorporates L3/RAM/SSD boundaries in cost calculations
Algorithm Selection: Chooses between hash/sort/nested-loop joins based on true memory costs
Buffer Sizing: Automatically sizes buffers to √(data_size) for optimal tradeoffs
Spill Planning: Optimizes when and how to spill to disk
Memory Hierarchy Awareness: Tracks which level (L1-L3/RAM/Disk) operations will use
AI Explanations: Clear reasoning for all optimization decisions

Installation

# From sqrtspace-tools root directory
pip install -r requirements-minimal.txt

Quick Start

from db_optimizer.memory_aware_optimizer import MemoryAwareOptimizer
import sqlite3

# Connect to database
conn = sqlite3.connect('mydb.db')

# Create optimizer with 10MB memory limit
optimizer = MemoryAwareOptimizer(conn, memory_limit=10*1024*1024)

# Optimize a query
sql = """
SELECT c.name, SUM(o.total) 
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.name
ORDER BY SUM(o.total) DESC
"""

result = optimizer.optimize_query(sql)
print(result.explanation)
# "Optimized query plan reduces memory usage by 87.3% with 2.1x estimated speedup.
#  Changed join from nested_loop to hash_join saving 9216KB.
#  Allocated 4 buffers totaling 2048KB for optimal performance."

Join Algorithm Selection

The optimizer intelligently selects join algorithms based on memory constraints:

1. Hash Join

When: Smaller table fits in memory
Memory: O(min(n,m))
Time: O(n+m)
Best for: Equi-joins with one small table

2. Sort-Merge Join

When: Both tables fit in memory for sorting
Memory: O(n+m)
Time: O(n log n + m log m)
Best for: Pre-sorted data or when output needs ordering

3. Block Nested Loop

When: Limited memory, uses √n blocks
Memory: O(√n)
Time: O(n*m/√n)
Best for: Memory-constrained environments

4. Nested Loop

When: Extreme memory constraints
Memory: O(1)
Time: O(n*m)
Last resort: When memory is critically limited

Buffer Management

The optimizer automatically calculates optimal buffer sizes:

# Get buffer recommendations
result = optimizer.optimize_query(query)
for buffer_name, size in result.buffer_sizes.items():
    print(f"{buffer_name}: {size / 1024:.1f}KB")

# Output:
# scan_buffer: 316.2KB      # √n sized for sequential scan
# join_buffer: 1024.0KB     # Optimal for hash table
# sort_buffer: 447.2KB      # √n sized for external sort

Spill Strategies

When memory is exceeded, the optimizer plans spilling:

# Check spill strategy
if result.spill_strategy:
    for operation, strategy in result.spill_strategy.items():
        print(f"{operation}: {strategy}")

# Output:
# JOIN_0: grace_hash_join              # Partition both inputs
# SORT_0: multi_pass_external_sort     # Multiple merge passes
# AGGREGATE_0: spill_partial_aggregates # Write intermediate results

Query Plan Visualization

# View query execution plan
print(optimizer.explain_plan(result.optimized_plan))

# Output:
# AGGREGATE (hash_aggregate)
#   Rows: 100
#   Size: 9.8KB
#   Memory: 14.6KB (L3)
#   Cost: 15234
#   SORT (external_sort)
#     Rows: 1,000
#     Size: 97.7KB
#     Memory: 9.9KB (L3)
#     Cost: 14234
#     JOIN (hash_join)
#       Rows: 1,000
#       Size: 97.7KB
#       Memory: 73.2KB (L3)
#       Cost: 3234
#       SCAN customers (sequential)
#         Rows: 100
#         Size: 9.8KB
#         Memory: 9.8KB (L2)
#         Cost: 98
#       SCAN orders (sequential)
#         Rows: 1,000
#         Size: 48.8KB
#         Memory: 48.8KB (L3)
#         Cost: 488

Optimizer Hints

Apply hints to SQL queries:

# Optimize for minimal memory usage
hinted_sql = optimizer.apply_hints(
    sql, 
    target='memory',
    memory_limit='1MB'
)
# /* SpaceTime Optimizer: Using block nested loop with √n memory ... */
# SELECT ...

# Optimize for speed
hinted_sql = optimizer.apply_hints(
    sql,
    target='latency'
)
# /* SpaceTime Optimizer: Using hash join for minimal latency ... */
# SELECT ...

Real-World Examples

1. Large Table Join with Memory Limit

# 1GB tables, 100MB memory limit
sql = """
SELECT l.*, r.details
FROM large_table l
JOIN reference_table r ON l.ref_id = r.id
WHERE l.status = 'active'
"""

result = optimizer.optimize_query(sql)
# Chooses: Block nested loop with 10MB blocks
# Memory: 10MB (fits in L3 cache)
# Speedup: 10x over naive nested loop

2. Multi-Way Join

sql = """
SELECT *
FROM a
JOIN b ON a.id = b.a_id
JOIN c ON b.id = c.b_id
JOIN d ON c.id = d.c_id
"""

result = optimizer.optimize_query(sql)
# Optimizes join order based on sizes
# Uses different algorithms for each join
# Allocates buffers to minimize spilling

3. Aggregation with Sorting

sql = """
SELECT category, COUNT(*), AVG(price)
FROM products
GROUP BY category
ORDER BY COUNT(*) DESC
"""

result = optimizer.optimize_query(sql)
# Hash aggregation with √n memory
# External sort for final ordering
# Explains tradeoffs clearly

Performance Characteristics

Memory Savings

Typical: 50-95% reduction vs naive approach
Best case: 99% reduction (large self-joins)
Worst case: 10% reduction (already optimal)

Speed Impact

Hash to Block Nested: 2-10x speedup
External Sort: 20-50% overhead vs in-memory
Overall: Usually faster despite less memory

Memory Hierarchy Benefits

L3 vs RAM: 8-10x latency improvement
RAM vs SSD: 100-1000x latency improvement
Optimizer targets: Keep hot data in faster levels

Integration

SQLite

conn = sqlite3.connect('mydb.db')
optimizer = MemoryAwareOptimizer(conn)

PostgreSQL (via psycopg2)

# Use explain analyze to get statistics
# Apply recommendations via SET commands

MySQL (planned)

# Similar approach with optimizer hints

How It Works

Statistics Collection: Gathers table sizes, indexes, cardinalities
Query Analysis: Parses SQL to extract operations
Cost Modeling: Estimates cost with memory hierarchy awareness
Algorithm Selection: Chooses optimal algorithms for each operation
Buffer Allocation: Sizes buffers using √n principle
Spill Planning: Determines graceful degradation strategy

Limitations

Simplified cardinality estimation
SQLite-focused (PostgreSQL support planned)
No runtime adaptation yet
Requires accurate statistics

Future Enhancements

Runtime plan adjustment
Learned cost models
PostgreSQL native integration
Distributed query optimization
GPU memory hierarchy support

6.9 KiB Raw Blame History