Files
sqrtspace-experiments/experiments/checkpointed_sorting/README.md
2025-07-20 03:56:21 -04:00

96 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Checkpointed Sorting Experiment
## Overview
This experiment demonstrates how external merge sort with limited memory exhibits the space-time tradeoff predicted by Williams' 2025 result.
## Key Concepts
### Standard In-Memory Sort
- **Space**: O(n) - entire array in memory
- **Time**: O(n log n) - optimal comparison-based sorting
- **Example**: Python's built-in sort, quicksort
### Checkpointed External Sort
- **Space**: O(√n) - only √n elements in memory at once
- **Time**: O(n√n) - due to disk I/O and recomputation
- **Technique**: Sort chunks that fit in memory, merge with limited buffers
### Extreme Space-Limited Sort
- **Space**: O(log n) - minimal memory usage
- **Time**: O(n²) - extensive recomputation required
- **Technique**: Iterative merging with frequent checkpointing
## Running the Experiments
### Quick Test
```bash
python test_quick.py
```
Runs with small input sizes (100-1000) to verify correctness.
### Full Experiment
```bash
python run_final_experiment.py
```
Runs complete experiment with:
- Input sizes: 1000, 2000, 5000, 10000, 20000
- 10 trials per size for statistical significance
- RAM disk comparison to isolate I/O overhead
- Generates publication-quality plots
### Rigorous Analysis
```bash
python rigorous_experiment.py
```
Comprehensive experiment with:
- 20 trials per size
- Detailed memory profiling
- Environment logging
- Statistical analysis with confidence intervals
## Actual Results (Apple M3 Max, 64GB RAM)
| Input Size | In-Memory Time | Checkpointed Time | Slowdown | Memory Reduction |
|------------|----------------|-------------------|----------|------------------|
| 1,000 | 0.022 ms | 8.2 ms | 375× | 0.1× (overhead) |
| 5,000 | 0.045 ms | 23.4 ms | 516× | 0.2× |
| 10,000 | 0.091 ms | 40.5 ms | 444× | 0.2× |
| 20,000 | 0.191 ms | 71.4 ms | 375× | 0.2× |
Note: Memory shows algorithmic overhead due to Python's memory management.
## Key Findings
1. **Massive Constant Factors**: 375-627× slowdown instead of theoretical √n
2. **I/O Not Dominant**: Fast NVMe SSDs show only 1.0-1.1× I/O overhead
3. **Scaling Confirmed**: Power law fits show n^1.0 for in-memory, n^1.4 for checkpointed
## Real-World Applications
- **Database Systems**: External sorting for large datasets
- **MapReduce**: Shuffle phase with limited memory
- **Video Processing**: Frame-by-frame processing with checkpoints
- **Scientific Computing**: Out-of-core algorithms
## Visualization
The experiment generates:
1. `paper_sorting_figure.png` - Clean figure for publication
2. `rigorous_sorting_analysis.png` - Detailed analysis with error bars
3. `memory_usage_analysis.png` - Memory scaling comparison
4. `experiment_environment.json` - Hardware/software configuration
5. `final_experiment_results.json` - Raw experimental data
## Dependencies
```bash
pip install numpy scipy matplotlib psutil
```
## Reproducing Results
To reproduce our results exactly:
1. Ensure CPU frequency scaling is disabled
2. Close all other applications
3. Run on a machine with fast SSD (>3GB/s read)
4. Use Python 3.10+ with NumPy 2.0+