Project Motivation
This project was created as part of CSE 530: Computer Architecture at Pennsylvania State University. The goal was to:
- Learn gem5: Understand how to use a production-grade architecture simulator
- Design empirical experiments: Learn to structure a rigorous, reproducible evaluation
- Understand trade-offs: Quantify the impact of CPU complexity, clock frequency, and memory bandwidth on application performance
- Practice measurement: Extract meaningful metrics from architectural simulation and present findings clearly
The Quicksort algorithm was chosen because it:
- Is a well-known, classic algorithm with predictable characteristics
- Has moderate working-set size (80,000 integers ≈ 320 KB)
- Exhibits both CPU-intensive and memory-bound phases
- Allows controlled repeated execution (100 runs) for statistical stability
Background Concepts
What is gem5?
gem5 is an open-source, modular architectural simulator developed and maintained by computer architecture researchers and industry practitioners. It allows detailed cycle-level simulation of modern processors without building physical hardware.
Why Use a Simulator?
| Aspect | Real Hardware | Simulation |
|---|---|---|
| Cost | Expensive (millions $) | Free (open-source) |
| Development Time | Years | Hours to weeks |
| Observability | Limited | Complete (every cycle, cache hit/miss) |
| Flexibility | Fixed | Easily modify CPU, caches, memory |
| Risk | High (tapeout bets billions) | Low (experimental) |
CPU Models Tested
X86TimingSimpleCPU
A simple in-order processor with realistic timing approximations:
- Functional in-order execution
- Single-cycle base latency for most ops
- Approximate memory latency
- Lower simulation overhead
X86MinorCPU
A detailed in-order processor with explicit pipeline stages:
- Explicit pipeline stages (5+ stages)
- Separate fetch, execute, and memory units
- Models structural hazards and pipeline stalls
- Cycle-accurate timing
Memory Hierarchies
Modern systems use multi-level caches to bridge the speed gap between CPU and main memory.
Cache Configuration
- L1 I/D Cache: 32 KB each (instruction and data)
- L2 Cache: 256 KB, shared
- Memory Bandwidth: Varies by DRAM type
DRAM Types Tested
| Memory Type | Frequency | Bandwidth | Use Case |
|---|---|---|---|
| DDR3_1600_8x8 | 1600 MHz | ~25.6 GB/s | Standard server/consumer |
| DDR3_2133_8x8 | 2133 MHz | ~34.1 GB/s | High-performance systems |
| LPDDR2_S4_1066_1x32 | 1066 MHz | ~17.1 GB/s | Mobile/embedded systems |
Experimental Design
Research Questions
This project answers:
- How does clock frequency affect performance? Does doubling the clock speed yield 2× speedup?
- How sensitive is Quicksort to memory bandwidth? Does a faster DRAM significantly reduce execution time?
- Does CPU pipeline complexity matter? How different are Minor vs. TimingSimple for this workload?
- Can we achieve linear scaling? Or do bottlenecks (cache misses, memory latency) prevent ideal speedup?
Experimental Matrix
Total Configurations: 2 CPUs × 6 Frequencies × 3 Memory Types = 36 simulation runs
| Dimension | Options | Count |
|---|---|---|
| CPU Model | X86TimingSimpleCPU, X86MinorCPU | 2 |
| Clock Speed | 1.0, 1.2, 1.4, 1.6, 1.8, 2.0 GHz | 6 |
| Memory Type | DDR3_1600_8x8, DDR3_2133_8x8, LPDDR2_S4_1066_1x32 | 3 |
Each point is a complete simulation run in gem5's SE mode, measuring a fixed workload (100 Quicksort iterations on 80,000 elements).
Implementation Details
Quicksort Workload
The benchmark uses a classic in-place Quicksort implementation:
void Quicksort(int a[], int l, int r) {
/* Partition around pivot a[(l+r)/2] */
int i = l, j = r;
int x = a[(l+r) / 2];
do {
while (a[i] < x) i++; /* Scan from left */
while (x < a[j]) j--; /* Scan from right */
if (i <= j) {
/* Swap */
int w = a[i];
a[i] = a[j];
a[j] = w;
i++; j--;
}
} while (i <= j);
/* Recurse on partitions */
if (l < j) Quicksort(a, l, j);
if (i < r) Quicksort(a, i, r);
}
Test Parameters
- Problem Size: 80,000 integers
- Data Size: 320 KB
- Iterations: 100 runs
- Total Comparisons: ~800 million
Algorithm Characteristics
- Time Complexity: O(n log n) average
- Space Complexity: O(log n) stack depth
- Memory Pattern: Random access
- Cache Behavior: Good temporal locality
Instrumentation with m5ops
The benchmark uses gem5 m5ops for precise measurement:
#ifdef FRMWRK
#include "gem5/m5ops.h"
int main(int argc, char *argv[]) {
m5_reset_stats(0, 0); /* Start counters */
for (int i = 0; i < 100; i++)
Quick(i); /* 100 iterations */
m5_dump_stats(0, 0); /* Dump counters */
}
#endif
Why Instrumentation?
Without m5_reset_stats/m5_dump_stats, statistics include memory initialization overhead, function call setup, and I/O operations.
With instrumentation, we measure only the core algorithm, eliminating noise and improving statistical quality.
Results & Analysis
X86MinorCPU Performance
| Memory Model | 1.0 GHz | 1.2 GHz | 1.4 GHz | 1.6 GHz | 1.8 GHz | 2.0 GHz |
|---|---|---|---|---|---|---|
| DDR3_1600_8x8 | 12.699 s | 10.587 s | 9.081 s | 7.957 s | 7.085 s | 6.374 s |
| DDR3_2133_8x8 | 12.698 s | 10.584 s | 9.080 s | 7.955 s | 7.081 s | 6.373 s |
| LPDDR2_S4_1066_1x32 | 12.714 s | 10.603 s | 9.100 s | 7.971 s | 7.101 s | 6.390 s |
X86TimingSimpleCPU Performance
| Memory Model | 1.0 GHz | 1.2 GHz | 1.4 GHz | 1.6 GHz | 1.8 GHz | 2.0 GHz |
|---|---|---|---|---|---|---|
| DDR3_1600_8x8 | 12.889 s | 10.587 s | 9.081 s | 7.957 s | 7.085 s | 6.374 s |
| DDR3_2133_8x8 | 12.698 s | 10.584 s | 9.080 s | 7.955 s | 7.081 s | 6.373 s |
| LPDDR2_S4_1066_1x32 | 12.714 s | 10.603 s | 9.100 s | 7.971 s | 7.101 s | 6.390 s |
Key Findings
Finding 1: Clock Scaling is Nearly Linear
The most striking observation: doubling the clock speed yields approximately 2× speedup.
Evidence:
- 1.0 GHz → 2.0 GHz: ~12.7s → 6.37s = 1.99× speedup
- 1.0 GHz → 1.2 GHz: 12.7s → 10.59s = 1.20× speedup
- 1.6 GHz → 2.0 GHz: 7.96s → 6.37s = 1.25× speedup
Why?
This linear relationship holds when the instruction count is constant and memory is not the primary bottleneck. The L1 and L2 caches handle most memory requests, with pipeline bubbles due to memory stalls being minimal. The CPU, not memory, is the bottleneck.
Finding 2: Memory Bandwidth Has Minimal Impact
Despite testing three very different memory types, performance differences are tiny:
| Memory Type | Relative Performance |
|---|---|
| DDR3_1600_8x8 | Baseline (100%) |
| DDR3_2133_8x8 (+33% bandwidth) | 99.97% (negligible) |
| LPDDR2_S4_1066_1x32 (-33% bandwidth) | 99.75% (0.25% slower) |
Why?
- Working Set Fits in Cache: 320 KB data array is well-served by 32 KB L1 + 256 KB L2
- Temporal Locality: Quicksort repeatedly accesses the same elements during partitioning
- Cache Hierarchy Dominance: Memory bandwidth only matters when data misses all caches, which is rare here
Finding 3: CPU Model Complexity Doesn't Matter
Minor vs. TimingSimple performance is nearly identical:
Comparison (at 1.0 GHz, DDR3_1600_8x8):
- Minor: 12.698 s
- TimingSimple: 12.889 s
- Difference: 0.15%
Why?
- Both are in-order: Neither superscalar nor out-of-order, so instruction-level parallelism potential is limited
- Memory dominates: Cache hits/misses dominate cycles, not pipeline structure
- Pipeline Stalls Align: Both models stall identically on cache misses
Finding 4: Instruction Count Remains Stable
A subtle finding confirmed by examining cycle counts:
simCycles_at_1.0_GHz ≈ 12.7 × 10⁹ cycles
simCycles_at_2.0_GHz ≈ 12.7 × 10⁹ cycles
This confirms: The instruction count (and cycle count at the ISA level) is invariant—the algorithm does the same work regardless of clock speed. Only the wall-clock time decreases (because cycles are shorter).
Lessons & Insights
For Computer Architects
Cache Hierarchy Matters More Than Raw Bandwidth
The dominance of L1/L2 caches in hiding memory latency means that careful cache design (size, associativity, replacement policy) has greater impact than memory speed for well-locality code.
Clock Scaling Remains Powerful
Nearly perfect 2× speedup from 2× frequency increase demonstrates that higher clocks still drive performance—until power/heat limits are hit.
Pipeline Complexity is a Design Trade-off
Minor's detailed pipeline doesn't yield advantage here, but for superscalar/out-of-order workloads, it's essential. The cost is higher power and verification effort.
For Software Engineers
Algorithmic Choice Dominates
Memory bandwidth choice has minimal impact. Algorithmic efficiency (O(n log n) vs. O(n²)) matters far more than hardware specifics for moderately-sized data.
Cache Friendliness is Practical
Even though Quicksort isn't the cache-optimal sorting algorithm, it runs efficiently on realistic caches. Simple algorithms with temporal locality often win in practice.
Measurement Beats Intuition
Without simulation, predicting speedup from frequency scaling or memory bandwidth intuitively is error-prone. Precise measurement reveals ground truth.
Reproducibility & Extensibility
How to Reproduce
Prerequisites
- gem5 (version ~May 2024 snapshot)
- GCC or Clang (for compilation)
- Python (for data analysis scripts)
- Matplotlib (for visualization)
Build & Run Example
# Step 1: Navigate to a simulation directory
cd quicksort/X86MinorCPU/DDR3_1600_8x8/1.2GHz
# Step 2: Compile quicksort with gem5 m5ops
make all
# Step 3: Run gem5 simulation
/path/to/gem5/build/X86/gem5.opt \
/path/to/gem5/configs/example/se.py \
--cmd=./a.out --options="100 100" \
--cpu-type=X86MinorCPU \
--caches --l2cache \
--l1d_size=32768 --l1i_size=32768 \
--l2_size=262144 \
--cpu-clock="1.2GHz" \
--mem-type=DDR3_1600_8x8
# Step 4: Extract results
grep "^simSeconds" m5out/stats.txt
grep "^system.cpu.numCycles" m5out/stats.txt
How to Extend
The experiment framework is easily extensible:
Add a New CPU Model
Test superscalar or out-of-order CPUs (e.g., X86O3CPU) to see if pipeline complexity matters for more complex workloads.
Add a New Memory Type
Test newer memory technologies (e.g., GDDR6, HBM) or configure custom memory parameters in gem5.
Change the Workload
Replace Quicksort with other algorithms (heapsort, matrix multiply) to observe different memory access patterns.
Conclusion
This project demonstrates a principled approach to computer architecture evaluation using simulation. By carefully designing an experiment, implementing a representative workload, and interpreting results through the lens of architectural understanding, we gained actionable insights:
- Clock frequency remains a powerful performance lever for well-designed systems
- Memory bandwidth matters less than cache design for workloads with good locality
- CPU pipeline complexity isn't always necessary—simpler designs suffice when memory isn't the bottleneck
These findings, while demonstrated on Quicksort in gem5, transfer to broader systems and inform real architectural decisions made by researchers and industry practitioners every day.