Computer Architecture

Analyzing CPU and Memory Trade-offs Using gem5

A Quicksort Performance Case Study

Author: Vedant Misra

Institution: Pennsylvania State University

Course: CSE 530 - Computer Architecture

Publication Date: November 2024

Project Overview

This project demonstrates how to quantitatively analyze architectural trade-offs using gem5, a powerful, modular architectural simulator. By running the classic Quicksort algorithm across different CPU models, clock frequencies, and memory configurations, I created a reproducible experiment that reveals how these architectural choices impact real-world performance.

36

Simulations

Complete configurations tested

~2×

Speedup

2× frequency → 2× performance

<0.3%

Bandwidth Impact

Memory variation negligible

800M

Comparisons

Total operations executed

Key Achievements

  • Factorial experiment design (2 CPUs × 6 frequencies × 3 memory types)
  • Nearly linear clock scaling validated empirically
  • Memory bandwidth impact quantified (<0.3% variation)
  • CPU pipeline complexity comparison (Minor vs TimingSimple)
  • gem5 m5ops instrumentation for precise measurement

Technologies Used

gem5 Simulator C (ISO 99) X86 Architecture Python Computer Architecture Performance Analysis POSIX Systems Matplotlib

Project Motivation

This project was created as part of CSE 530: Computer Architecture at Pennsylvania State University. The goal was to:

  1. Learn gem5: Understand how to use a production-grade architecture simulator
  2. Design empirical experiments: Learn to structure a rigorous, reproducible evaluation
  3. Understand trade-offs: Quantify the impact of CPU complexity, clock frequency, and memory bandwidth on application performance
  4. Practice measurement: Extract meaningful metrics from architectural simulation and present findings clearly

The Quicksort algorithm was chosen because it:

  • Is a well-known, classic algorithm with predictable characteristics
  • Has moderate working-set size (80,000 integers ≈ 320 KB)
  • Exhibits both CPU-intensive and memory-bound phases
  • Allows controlled repeated execution (100 runs) for statistical stability

Background Concepts

What is gem5?

gem5 is an open-source, modular architectural simulator developed and maintained by computer architecture researchers and industry practitioners. It allows detailed cycle-level simulation of modern processors without building physical hardware.

Why Use a Simulator?

Aspect Real Hardware Simulation
Cost Expensive (millions $) Free (open-source)
Development Time Years Hours to weeks
Observability Limited Complete (every cycle, cache hit/miss)
Flexibility Fixed Easily modify CPU, caches, memory
Risk High (tapeout bets billions) Low (experimental)

CPU Models Tested

X86TimingSimpleCPU

A simple in-order processor with realistic timing approximations:

  • Functional in-order execution
  • Single-cycle base latency for most ops
  • Approximate memory latency
  • Lower simulation overhead

X86MinorCPU

A detailed in-order processor with explicit pipeline stages:

  • Explicit pipeline stages (5+ stages)
  • Separate fetch, execute, and memory units
  • Models structural hazards and pipeline stalls
  • Cycle-accurate timing

Memory Hierarchies

Modern systems use multi-level caches to bridge the speed gap between CPU and main memory.

Cache Configuration

  • L1 I/D Cache: 32 KB each (instruction and data)
  • L2 Cache: 256 KB, shared
  • Memory Bandwidth: Varies by DRAM type

DRAM Types Tested

Memory Type Frequency Bandwidth Use Case
DDR3_1600_8x8 1600 MHz ~25.6 GB/s Standard server/consumer
DDR3_2133_8x8 2133 MHz ~34.1 GB/s High-performance systems
LPDDR2_S4_1066_1x32 1066 MHz ~17.1 GB/s Mobile/embedded systems

Experimental Design

Research Questions

This project answers:

  1. How does clock frequency affect performance? Does doubling the clock speed yield 2× speedup?
  2. How sensitive is Quicksort to memory bandwidth? Does a faster DRAM significantly reduce execution time?
  3. Does CPU pipeline complexity matter? How different are Minor vs. TimingSimple for this workload?
  4. Can we achieve linear scaling? Or do bottlenecks (cache misses, memory latency) prevent ideal speedup?

Experimental Matrix

Total Configurations: 2 CPUs × 6 Frequencies × 3 Memory Types = 36 simulation runs

Dimension Options Count
CPU Model X86TimingSimpleCPU, X86MinorCPU 2
Clock Speed 1.0, 1.2, 1.4, 1.6, 1.8, 2.0 GHz 6
Memory Type DDR3_1600_8x8, DDR3_2133_8x8, LPDDR2_S4_1066_1x32 3

Each point is a complete simulation run in gem5's SE mode, measuring a fixed workload (100 Quicksort iterations on 80,000 elements).

Implementation Details

Quicksort Workload

The benchmark uses a classic in-place Quicksort implementation:

void Quicksort(int a[], int l, int r) {
    /* Partition around pivot a[(l+r)/2] */
    int i = l, j = r;
    int x = a[(l+r) / 2];
    
    do {
        while (a[i] < x) i++;       /* Scan from left */
        while (x < a[j]) j--;       /* Scan from right */
        if (i <= j) {
            /* Swap */
            int w = a[i];
            a[i] = a[j];
            a[j] = w;
            i++; j--;
        }
    } while (i <= j);
    
    /* Recurse on partitions */
    if (l < j) Quicksort(a, l, j);
    if (i < r) Quicksort(a, i, r);
}

Test Parameters

  • Problem Size: 80,000 integers
  • Data Size: 320 KB
  • Iterations: 100 runs
  • Total Comparisons: ~800 million

Algorithm Characteristics

  • Time Complexity: O(n log n) average
  • Space Complexity: O(log n) stack depth
  • Memory Pattern: Random access
  • Cache Behavior: Good temporal locality

Instrumentation with m5ops

The benchmark uses gem5 m5ops for precise measurement:

#ifdef FRMWRK
#include "gem5/m5ops.h"

int main(int argc, char *argv[]) {
    m5_reset_stats(0, 0);           /* Start counters */
    
    for (int i = 0; i < 100; i++) 
        Quick(i);                   /* 100 iterations */
    
    m5_dump_stats(0, 0);            /* Dump counters */
}
#endif

Why Instrumentation?

Without m5_reset_stats/m5_dump_stats, statistics include memory initialization overhead, function call setup, and I/O operations.

With instrumentation, we measure only the core algorithm, eliminating noise and improving statistical quality.

Results & Analysis

X86MinorCPU Performance

Memory Model 1.0 GHz 1.2 GHz 1.4 GHz 1.6 GHz 1.8 GHz 2.0 GHz
DDR3_1600_8x8 12.699 s 10.587 s 9.081 s 7.957 s 7.085 s 6.374 s
DDR3_2133_8x8 12.698 s 10.584 s 9.080 s 7.955 s 7.081 s 6.373 s
LPDDR2_S4_1066_1x32 12.714 s 10.603 s 9.100 s 7.971 s 7.101 s 6.390 s

X86TimingSimpleCPU Performance

Memory Model 1.0 GHz 1.2 GHz 1.4 GHz 1.6 GHz 1.8 GHz 2.0 GHz
DDR3_1600_8x8 12.889 s 10.587 s 9.081 s 7.957 s 7.085 s 6.374 s
DDR3_2133_8x8 12.698 s 10.584 s 9.080 s 7.955 s 7.081 s 6.373 s
LPDDR2_S4_1066_1x32 12.714 s 10.603 s 9.100 s 7.971 s 7.101 s 6.390 s

Key Findings

Finding 1: Clock Scaling is Nearly Linear

The most striking observation: doubling the clock speed yields approximately 2× speedup.

Evidence:

  • 1.0 GHz → 2.0 GHz: ~12.7s → 6.37s = 1.99× speedup
  • 1.0 GHz → 1.2 GHz: 12.7s → 10.59s = 1.20× speedup
  • 1.6 GHz → 2.0 GHz: 7.96s → 6.37s = 1.25× speedup

Why?

This linear relationship holds when the instruction count is constant and memory is not the primary bottleneck. The L1 and L2 caches handle most memory requests, with pipeline bubbles due to memory stalls being minimal. The CPU, not memory, is the bottleneck.

Finding 2: Memory Bandwidth Has Minimal Impact

Despite testing three very different memory types, performance differences are tiny:

Memory Type Relative Performance
DDR3_1600_8x8 Baseline (100%)
DDR3_2133_8x8 (+33% bandwidth) 99.97% (negligible)
LPDDR2_S4_1066_1x32 (-33% bandwidth) 99.75% (0.25% slower)

Why?

  • Working Set Fits in Cache: 320 KB data array is well-served by 32 KB L1 + 256 KB L2
  • Temporal Locality: Quicksort repeatedly accesses the same elements during partitioning
  • Cache Hierarchy Dominance: Memory bandwidth only matters when data misses all caches, which is rare here

Finding 3: CPU Model Complexity Doesn't Matter

Minor vs. TimingSimple performance is nearly identical:

Comparison (at 1.0 GHz, DDR3_1600_8x8):

  • Minor: 12.698 s
  • TimingSimple: 12.889 s
  • Difference: 0.15%

Why?

  1. Both are in-order: Neither superscalar nor out-of-order, so instruction-level parallelism potential is limited
  2. Memory dominates: Cache hits/misses dominate cycles, not pipeline structure
  3. Pipeline Stalls Align: Both models stall identically on cache misses

Finding 4: Instruction Count Remains Stable

A subtle finding confirmed by examining cycle counts:

simCycles_at_1.0_GHz  ≈ 12.7 × 10⁹ cycles
simCycles_at_2.0_GHz  ≈ 12.7 × 10⁹ cycles

This confirms: The instruction count (and cycle count at the ISA level) is invariant—the algorithm does the same work regardless of clock speed. Only the wall-clock time decreases (because cycles are shorter).

Lessons & Insights

For Computer Architects

Cache Hierarchy Matters More Than Raw Bandwidth

The dominance of L1/L2 caches in hiding memory latency means that careful cache design (size, associativity, replacement policy) has greater impact than memory speed for well-locality code.

Clock Scaling Remains Powerful

Nearly perfect 2× speedup from 2× frequency increase demonstrates that higher clocks still drive performance—until power/heat limits are hit.

Pipeline Complexity is a Design Trade-off

Minor's detailed pipeline doesn't yield advantage here, but for superscalar/out-of-order workloads, it's essential. The cost is higher power and verification effort.

For Software Engineers

Algorithmic Choice Dominates

Memory bandwidth choice has minimal impact. Algorithmic efficiency (O(n log n) vs. O(n²)) matters far more than hardware specifics for moderately-sized data.

Cache Friendliness is Practical

Even though Quicksort isn't the cache-optimal sorting algorithm, it runs efficiently on realistic caches. Simple algorithms with temporal locality often win in practice.

Measurement Beats Intuition

Without simulation, predicting speedup from frequency scaling or memory bandwidth intuitively is error-prone. Precise measurement reveals ground truth.

Reproducibility & Extensibility

How to Reproduce

Prerequisites

  • gem5 (version ~May 2024 snapshot)
  • GCC or Clang (for compilation)
  • Python (for data analysis scripts)
  • Matplotlib (for visualization)

Build & Run Example

# Step 1: Navigate to a simulation directory
cd quicksort/X86MinorCPU/DDR3_1600_8x8/1.2GHz

# Step 2: Compile quicksort with gem5 m5ops
make all

# Step 3: Run gem5 simulation
/path/to/gem5/build/X86/gem5.opt \
  /path/to/gem5/configs/example/se.py \
  --cmd=./a.out --options="100 100" \
  --cpu-type=X86MinorCPU \
  --caches --l2cache \
  --l1d_size=32768 --l1i_size=32768 \
  --l2_size=262144 \
  --cpu-clock="1.2GHz" \
  --mem-type=DDR3_1600_8x8

# Step 4: Extract results
grep "^simSeconds" m5out/stats.txt
grep "^system.cpu.numCycles" m5out/stats.txt

How to Extend

The experiment framework is easily extensible:

Add a New CPU Model

Test superscalar or out-of-order CPUs (e.g., X86O3CPU) to see if pipeline complexity matters for more complex workloads.

Add a New Memory Type

Test newer memory technologies (e.g., GDDR6, HBM) or configure custom memory parameters in gem5.

Change the Workload

Replace Quicksort with other algorithms (heapsort, matrix multiply) to observe different memory access patterns.

Conclusion

This project demonstrates a principled approach to computer architecture evaluation using simulation. By carefully designing an experiment, implementing a representative workload, and interpreting results through the lens of architectural understanding, we gained actionable insights:

  • Clock frequency remains a powerful performance lever for well-designed systems
  • Memory bandwidth matters less than cache design for workloads with good locality
  • CPU pipeline complexity isn't always necessary—simpler designs suffice when memory isn't the bottleneck

These findings, while demonstrated on Quicksort in gem5, transfer to broader systems and inform real architectural decisions made by researchers and industry practitioners every day.