Analyzing CPU and Memory Trade-offs Using gem5

Project Motivation

This project was created as part of CSE 530: Computer Architecture at Pennsylvania State University. The goal was to:

Learn gem5: Understand how to use a production-grade architecture simulator
Design empirical experiments: Learn to structure a rigorous, reproducible evaluation
Understand trade-offs: Quantify the impact of CPU complexity, clock frequency, and memory bandwidth on application performance
Practice measurement: Extract meaningful metrics from architectural simulation and present findings clearly

The Quicksort algorithm was chosen because it:

Is a well-known, classic algorithm with predictable characteristics
Has moderate working-set size (80,000 integers ≈ 320 KB)
Exhibits both CPU-intensive and memory-bound phases
Allows controlled repeated execution (100 runs) for statistical stability

Background Concepts

What is gem5?

gem5 is an open-source, modular architectural simulator developed and maintained by computer architecture researchers and industry practitioners. It allows detailed cycle-level simulation of modern processors without building physical hardware.

Why Use a Simulator?

Aspect	Real Hardware	Simulation
Cost	Expensive (millions $)	Free (open-source)
Development Time	Years	Hours to weeks
Observability	Limited	Complete (every cycle, cache hit/miss)
Flexibility	Fixed	Easily modify CPU, caches, memory
Risk	High (tapeout bets billions)	Low (experimental)

CPU Models Tested

X86TimingSimpleCPU

A simple in-order processor with realistic timing approximations:

Functional in-order execution
Single-cycle base latency for most ops
Approximate memory latency
Lower simulation overhead

X86MinorCPU

A detailed in-order processor with explicit pipeline stages:

Explicit pipeline stages (5+ stages)
Separate fetch, execute, and memory units
Models structural hazards and pipeline stalls
Cycle-accurate timing

Memory Hierarchies

Modern systems use multi-level caches to bridge the speed gap between CPU and main memory.

Cache Configuration

L1 I/D Cache: 32 KB each (instruction and data)
L2 Cache: 256 KB, shared
Memory Bandwidth: Varies by DRAM type

DRAM Types Tested

Memory Type	Frequency	Bandwidth	Use Case
DDR3_1600_8x8	1600 MHz	~25.6 GB/s	Standard server/consumer
DDR3_2133_8x8	2133 MHz	~34.1 GB/s	High-performance systems
LPDDR2_S4_1066_1x32	1066 MHz	~17.1 GB/s	Mobile/embedded systems

Experimental Design

Research Questions

This project answers:

How does clock frequency affect performance? Does doubling the clock speed yield 2× speedup?
How sensitive is Quicksort to memory bandwidth? Does a faster DRAM significantly reduce execution time?
Does CPU pipeline complexity matter? How different are Minor vs. TimingSimple for this workload?
Can we achieve linear scaling? Or do bottlenecks (cache misses, memory latency) prevent ideal speedup?

Experimental Matrix

Total Configurations: 2 CPUs × 6 Frequencies × 3 Memory Types = 36 simulation runs

Dimension	Options	Count
CPU Model	X86TimingSimpleCPU, X86MinorCPU	2
Clock Speed	1.0, 1.2, 1.4, 1.6, 1.8, 2.0 GHz	6
Memory Type	DDR3_1600_8x8, DDR3_2133_8x8, LPDDR2_S4_1066_1x32	3

Each point is a complete simulation run in gem5's SE mode, measuring a fixed workload (100 Quicksort iterations on 80,000 elements).

Implementation Details

Quicksort Workload

The benchmark uses a classic in-place Quicksort implementation:

void Quicksort(int a[], int l, int r) {
    /* Partition around pivot a[(l+r)/2] */
    int i = l, j = r;
    int x = a[(l+r) / 2];
    
    do {
        while (a[i] < x) i++;       /* Scan from left */
        while (x < a[j]) j--;       /* Scan from right */
        if (i <= j) {
            /* Swap */
            int w = a[i];
            a[i] = a[j];
            a[j] = w;
            i++; j--;
        }
    } while (i <= j);
    
    /* Recurse on partitions */
    if (l < j) Quicksort(a, l, j);
    if (i < r) Quicksort(a, i, r);
}

Test Parameters

Problem Size: 80,000 integers
Data Size: 320 KB
Iterations: 100 runs
Total Comparisons: ~800 million

Algorithm Characteristics

Time Complexity: O(n log n) average
Space Complexity: O(log n) stack depth
Memory Pattern: Random access
Cache Behavior: Good temporal locality

Instrumentation with m5ops

The benchmark uses gem5 m5ops for precise measurement:

#ifdef FRMWRK
#include "gem5/m5ops.h"

int main(int argc, char *argv[]) {
    m5_reset_stats(0, 0);           /* Start counters */
    
    for (int i = 0; i < 100; i++) 
        Quick(i);                   /* 100 iterations */
    
    m5_dump_stats(0, 0);            /* Dump counters */
}
#endif

Why Instrumentation?

Without m5_reset_stats/m5_dump_stats, statistics include memory initialization overhead, function call setup, and I/O operations.

With instrumentation, we measure only the core algorithm, eliminating noise and improving statistical quality.

Results & Analysis

X86MinorCPU Performance

Memory Model	1.0 GHz	1.2 GHz	1.4 GHz	1.6 GHz	1.8 GHz	2.0 GHz
DDR3_1600_8x8	12.699 s	10.587 s	9.081 s	7.957 s	7.085 s	6.374 s
DDR3_2133_8x8	12.698 s	10.584 s	9.080 s	7.955 s	7.081 s	6.373 s
LPDDR2_S4_1066_1x32	12.714 s	10.603 s	9.100 s	7.971 s	7.101 s	6.390 s

X86TimingSimpleCPU Performance

Memory Model	1.0 GHz	1.2 GHz	1.4 GHz	1.6 GHz	1.8 GHz	2.0 GHz
DDR3_1600_8x8	12.889 s	10.587 s	9.081 s	7.957 s	7.085 s	6.374 s
DDR3_2133_8x8	12.698 s	10.584 s	9.080 s	7.955 s	7.081 s	6.373 s
LPDDR2_S4_1066_1x32	12.714 s	10.603 s	9.100 s	7.971 s	7.101 s	6.390 s

Key Findings

Finding 1: Clock Scaling is Nearly Linear

The most striking observation: doubling the clock speed yields approximately 2× speedup.

Evidence:

1.0 GHz → 2.0 GHz: ~12.7s → 6.37s = 1.99× speedup
1.0 GHz → 1.2 GHz: 12.7s → 10.59s = 1.20× speedup
1.6 GHz → 2.0 GHz: 7.96s → 6.37s = 1.25× speedup

Why?

This linear relationship holds when the instruction count is constant and memory is not the primary bottleneck. The L1 and L2 caches handle most memory requests, with pipeline bubbles due to memory stalls being minimal. The CPU, not memory, is the bottleneck.

Finding 2: Memory Bandwidth Has Minimal Impact

Despite testing three very different memory types, performance differences are tiny:

Memory Type	Relative Performance
DDR3_1600_8x8	Baseline (100%)
DDR3_2133_8x8 (+33% bandwidth)	99.97% (negligible)
LPDDR2_S4_1066_1x32 (-33% bandwidth)	99.75% (0.25% slower)

Why?

Working Set Fits in Cache: 320 KB data array is well-served by 32 KB L1 + 256 KB L2
Temporal Locality: Quicksort repeatedly accesses the same elements during partitioning
Cache Hierarchy Dominance: Memory bandwidth only matters when data misses all caches, which is rare here

Finding 3: CPU Model Complexity Doesn't Matter

Minor vs. TimingSimple performance is nearly identical:

Comparison (at 1.0 GHz, DDR3_1600_8x8):

Minor: 12.698 s
TimingSimple: 12.889 s
Difference: 0.15%

Why?

Both are in-order: Neither superscalar nor out-of-order, so instruction-level parallelism potential is limited
Memory dominates: Cache hits/misses dominate cycles, not pipeline structure
Pipeline Stalls Align: Both models stall identically on cache misses

Finding 4: Instruction Count Remains Stable

A subtle finding confirmed by examining cycle counts:

simCycles_at_1.0_GHz  ≈ 12.7 × 10⁹ cycles
simCycles_at_2.0_GHz  ≈ 12.7 × 10⁹ cycles

This confirms: The instruction count (and cycle count at the ISA level) is invariant—the algorithm does the same work regardless of clock speed. Only the wall-clock time decreases (because cycles are shorter).

Lessons & Insights

For Computer Architects

Cache Hierarchy Matters More Than Raw Bandwidth

The dominance of L1/L2 caches in hiding memory latency means that careful cache design (size, associativity, replacement policy) has greater impact than memory speed for well-locality code.

Clock Scaling Remains Powerful

Nearly perfect 2× speedup from 2× frequency increase demonstrates that higher clocks still drive performance—until power/heat limits are hit.

Pipeline Complexity is a Design Trade-off

Minor's detailed pipeline doesn't yield advantage here, but for superscalar/out-of-order workloads, it's essential. The cost is higher power and verification effort.

For Software Engineers

Algorithmic Choice Dominates

Memory bandwidth choice has minimal impact. Algorithmic efficiency (O(n log n) vs. O(n²)) matters far more than hardware specifics for moderately-sized data.

Cache Friendliness is Practical

Even though Quicksort isn't the cache-optimal sorting algorithm, it runs efficiently on realistic caches. Simple algorithms with temporal locality often win in practice.

Measurement Beats Intuition

Without simulation, predicting speedup from frequency scaling or memory bandwidth intuitively is error-prone. Precise measurement reveals ground truth.

Reproducibility & Extensibility

How to Reproduce

Prerequisites

gem5 (version ~May 2024 snapshot)
GCC or Clang (for compilation)
Python (for data analysis scripts)
Matplotlib (for visualization)

Build & Run Example

# Step 1: Navigate to a simulation directory
cd quicksort/X86MinorCPU/DDR3_1600_8x8/1.2GHz

# Step 2: Compile quicksort with gem5 m5ops
make all

# Step 3: Run gem5 simulation
/path/to/gem5/build/X86/gem5.opt \
  /path/to/gem5/configs/example/se.py \
  --cmd=./a.out --options="100 100" \
  --cpu-type=X86MinorCPU \
  --caches --l2cache \
  --l1d_size=32768 --l1i_size=32768 \
  --l2_size=262144 \
  --cpu-clock="1.2GHz" \
  --mem-type=DDR3_1600_8x8

# Step 4: Extract results
grep "^simSeconds" m5out/stats.txt
grep "^system.cpu.numCycles" m5out/stats.txt

How to Extend

The experiment framework is easily extensible:

Add a New CPU Model

Test superscalar or out-of-order CPUs (e.g., X86O3CPU) to see if pipeline complexity matters for more complex workloads.

Add a New Memory Type

Test newer memory technologies (e.g., GDDR6, HBM) or configure custom memory parameters in gem5.

Change the Workload

Replace Quicksort with other algorithms (heapsort, matrix multiply) to observe different memory access patterns.

Conclusion

This project demonstrates a principled approach to computer architecture evaluation using simulation. By carefully designing an experiment, implementing a representative workload, and interpreting results through the lens of architectural understanding, we gained actionable insights:

Clock frequency remains a powerful performance lever for well-designed systems
Memory bandwidth matters less than cache design for workloads with good locality
CPU pipeline complexity isn't always necessary—simpler designs suffice when memory isn't the bottleneck

These findings, while demonstrated on Quicksort in gem5, transfer to broader systems and inform real architectural decisions made by researchers and industry practitioners every day.

Analyzing CPU and Memory Trade-offs Using gem5

A Quicksort Performance Case Study

Project Overview

36

~2×

<0.3%

800M

Key Achievements

Technologies Used

Table of Contents

Project Motivation

Background Concepts

What is gem5?

Why Use a Simulator?

CPU Models Tested

X86TimingSimpleCPU

X86MinorCPU

Memory Hierarchies

Cache Configuration

DRAM Types Tested

Experimental Design

Research Questions

Experimental Matrix

Implementation Details

Quicksort Workload

Test Parameters

Algorithm Characteristics

Instrumentation with m5ops

Why Instrumentation?

Results & Analysis

X86MinorCPU Performance

X86TimingSimpleCPU Performance

Key Findings

Finding 1: Clock Scaling is Nearly Linear

Evidence:

Why?

Finding 2: Memory Bandwidth Has Minimal Impact

Why?

Finding 3: CPU Model Complexity Doesn't Matter

Comparison (at 1.0 GHz, DDR3_1600_8x8):

Why?

Finding 4: Instruction Count Remains Stable

Lessons & Insights

For Computer Architects

Cache Hierarchy Matters More Than Raw Bandwidth

Clock Scaling Remains Powerful

Pipeline Complexity is a Design Trade-off

For Software Engineers

Algorithmic Choice Dominates

Cache Friendliness is Practical

Measurement Beats Intuition

Reproducibility & Extensibility

How to Reproduce

Prerequisites

Build & Run Example

How to Extend

Add a New CPU Model

Add a New Memory Type

Change the Workload

Conclusion