Systems Programming

Dynamic Memory Allocator

A Custom malloc Implementation

Author: Vedant Misra

Institution: Pennsylvania State University

Course: CMPSC 473 - Operating Systems

Project Duration: September - October 2024

Executive Summary

Custom implementation of C's malloc, free, and realloc functions from scratch, achieving performance comparable to production allocators while demonstrating mastery of low-level systems programming.

100%

Correctness

20/20 test traces passed

88%

Memory Utilization

Efficient space usage

~12K

Throughput (ops/s)

High performance

Key Features

  • Segregated free lists with 11 size classes
  • Immediate coalescing for fragmentation reduction
  • 16-byte memory alignment
  • Comprehensive heap consistency checker
  • O(1) insertion and removal from free lists

Technical Skills Demonstrated

C Programming Systems Programming Data Structures Algorithm Design Memory Management Performance Optimization Debugging (GDB) Git

Background & Motivation

What is Dynamic Memory Allocation?

In C programming, memory management is a critical task that developers must handle explicitly. Unlike languages with automatic garbage collection (Java, Python), C requires programmers to manually allocate and deallocate memory. This control provides performance benefits but introduces complexity.

Dynamic memory allocation refers to the process of allocating memory at runtime, as opposed to compile-time. The C standard library provides several functions for this:

  • malloc(size_t size): Allocates a block of memory of specified size
  • free(void* ptr): Deallocates a previously allocated block
  • realloc(void* ptr, size_t size): Resizes an allocated block
  • calloc(size_t nmemb, size_t size): Allocates and zero-initializes memory

Why Build a Custom Allocator?

Building a memory allocator from scratch provides deep insights into:

  1. Operating System Concepts: Understanding how programs interact with memory
  2. Performance Optimization: Balancing speed vs. memory utilization
  3. Data Structures: Implementing efficient free list management
  4. Low-Level Programming: Pointer arithmetic, bit manipulation, and memory alignment
  5. Debugging Skills: Detecting memory corruption and fragmentation issues

The Challenge

The primary challenge in memory allocation is the trade-off between throughput and memory utilization:

  • Throughput: Number of allocation/deallocation operations per second
  • Utilization: Ratio of payload memory to total heap memory (minimizing fragmentation)

This project aims to achieve an optimal balance between these competing goals.

Theoretical Foundation

Memory Layout & Heap Structure

In a typical C program, memory is organized into several segments:

High Memory
┌─────────────────┐
│     Stack       │ ← Grows downward
├─────────────────┤
│       ↓         │
│                 │
│       ↑         │
├─────────────────┤
│     Heap        │ ← Grows upward (our focus)
├─────────────────┤
│      BSS        │ ← Uninitialized data
├─────────────────┤
│      Data       │ ← Initialized data
├─────────────────┤
│      Text       │ ← Program code
└─────────────────┘
Low Memory

The heap is the memory region where dynamic allocation occurs. It grows upward (toward higher memory addresses) as programs request more memory.

Block Structure

Each allocated or free block in our heap has a specific structure:

┌─────────────────────────────┐
│     Header (8 bytes)        │ ← Contains size and allocation bit
├─────────────────────────────┤
│                             │
│     Payload                 │ ← User data
│     (variable size)         │
│                             │
├─────────────────────────────┤
│     Footer (8 bytes)        │ ← Contains size and allocation bit
└─────────────────────────────┘

For Free Blocks, the payload area stores:
┌─────────────────────────────┐
│     Header (8 bytes)        │
├─────────────────────────────┤
│  Predecessor Pointer (8B)   │ ← Points to previous free block
├─────────────────────────────┤
│  Successor Pointer (8B)     │ ← Points to next free block
├─────────────────────────────┤
│                             │
│   (unused space)            │
│                             │
├─────────────────────────────┤
│     Footer (8 bytes)        │
└─────────────────────────────┘

Segregated Free Lists

To combat fragmentation and improve allocation speed, we use segregated free lists. This technique maintains multiple free lists, each for a specific size class:

Size Class 0:  [16-32 bytes]   → [Block] → [Block] → NULL
Size Class 1:  [33-64 bytes]   → [Block] → NULL
Size Class 2:  [65-128 bytes]  → [Block] → [Block] → [Block] → NULL
Size Class 3:  [129-256 bytes] → NULL
...
Size Class 10: [>16384 bytes]  → [Block] → NULL

Benefits:

  • Fast Lookup: Search only relevant size classes
  • Reduced Fragmentation: Similar-sized blocks grouped together
  • Better Cache Performance: Improved locality of reference

Coalescing

Coalescing is the process of merging adjacent free blocks to form larger blocks, reducing external fragmentation.

Case 1: Both neighbors allocated
[Alloc] [FREE] [Alloc]
→ No coalescing

Case 2: Next block free
[Alloc] [FREE] [Free]
→ Merge into [FREE------]

Case 3: Previous block free
[Free] [FREE] [Alloc]
→ Merge into [FREE------]

Case 4: Both neighbors free
[Free] [FREE] [Free]
→ Merge into [FREE-----------]

System Architecture

High-Level Design

Our memory allocator implements the following architecture:

┌─────────────────────────────────────────────────────────┐
│                   Application Code                      │
└────────────────┬────────────────────────────────────────┘
                 │ Calls malloc/free/realloc
                 ↓
┌─────────────────────────────────────────────────────────┐
│                Memory Allocator (mm.c)                  │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │   malloc    │  │     free     │  │   realloc    │    │
│  └─────────────┘  └──────────────┘  └──────────────┘    │
│                                                         │
│  ┌─────────────────────────────────────────────────┐    │
│  │       Segregated Free Lists (11 classes)        │    │
│  │  [0][1][2][3][4][5][6][7][8][9][10]             │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │  find_fit   │  │    place     │  │  coalesce    │    │
│  └─────────────┘  └──────────────┘  └──────────────┘    │
└────────────────┬────────────────────────────────────────┘
                 │ Calls mem_sbrk
                 ↓
┌─────────────────────────────────────────────────────────┐
│              Memory Library (memlib.c)                  │
│         Simulates OS memory management                  │
└─────────────────────────────────────────────────────────┘

Size Classes

Our implementation uses 11 segregated lists with exponentially increasing size ranges:

Index Size Range (bytes) Description
016 - 32Tiny allocations
133 - 64Small allocations
265 - 128Small-medium
3129 - 256Medium
4257 - 512Medium-large
5513 - 1024Large (1KB)
61025 - 2048Large (2KB)
72049 - 4096Large (4KB)
84097 - 8192Very large (8KB)
98193 - 16384Very large (16KB)
10> 16384Huge allocations

Implementation Details

Core Data Structures

Global Variables:

static size_t wordsize = 8;           // 64-bit word
static size_t double_wordsize = 16;   // Minimum block size
static size_t chunk_size = (1 << 12); // 4KB default extension
static char *heaplist_pointer;        // Points to first block
static char *seg_list[11];            // Segregated free list array

Memory Allocation: malloc(size_t size)

The allocation process follows these steps:

void *malloc(size_t size) {
    size_t asize;       // Adjusted size
    size_t extendsize;  // Extension size if needed
    char *bp;           // Block pointer
    
    // Step 1: Ignore spurious requests
    if (size == 0)
        return NULL;
    
    // Step 2: Adjust size for alignment (minimum 16 bytes)
    if (size > double_wordsize)
        asize = align(size);
    else
        asize = double_wordsize;
    
    // Step 3: Search for a fit in free lists
    if ((bp = find_fit(asize)) != NULL) {
        place(bp, asize);
        return bp;
    }
    
    // Step 4: No fit found, extend heap
    extendsize = (asize > chunk_size) ? asize : chunk_size;
    if ((bp = extend_heap(extendsize / wordsize)) == NULL)
        return NULL;
    
    // Step 5: Place block in extended heap
    bp = find_fit(asize);
    place(bp, asize);
    
    return bp;
}

Finding a Free Block: find_fit()

Our implementation uses a first-fit strategy within segregated lists:

static void *find_fit(size_t asize) {
    int index = get_index(asize);  // Determine appropriate size class
    
    // Search from appropriate size class upward
    for (int i = index; i < 11; i++) {
        char *current_block = (char *)seg_list[i];
        
        // Skip empty lists
        if (current_block == (char *)mem_heap_lo())
            continue;
        
        // Traverse list to find first fit
        while (current_block != (char *)mem_heap_lo()) {
            size_t block_size = extract_block_size(hdrp(current_block));
            
            if (asize <= block_size) {
                return current_block;  // First fit found!
            }
            
            current_block = get_address(successor_addr(current_block));
        }
    }
    
    return NULL;  // No fit found
}

Why First-Fit?

We experimented with both first-fit and best-fit strategies:

  • First-Fit: Returns the first block that fits
    • ✅ Faster allocation (O(n) worst case, but often O(1) in practice)
    • ✅ Better throughput
    • ⚠️ May increase fragmentation slightly
  • Best-Fit: Returns the smallest block that fits
    • ✅ Better utilization (less fragmentation)
    • ❌ Slower (must search entire list)
    • ❌ Lower throughput

Our testing showed that first-fit provided better overall performance with acceptable utilization.

Freeing Memory: free(void *ptr)

Freeing is straightforward: mark the block as free and coalesce:

void free(void *block_pointer) {
    if (block_pointer == NULL)
        return;
    
    size_t block_size = extract_block_size(hdrp(block_pointer));
    
    // Mark block as free
    put_value(hdrp(block_pointer), block_header_config(block_size, 0));
    put_value(ftrp(block_pointer), block_header_config(block_size, 0));
    
    // Coalesce with adjacent free blocks
    coalesce(block_pointer);
}

Performance Analysis

Time Complexity

Operation Complexity Explanation
malloc O(n) worst case, O(1) average First-fit in segregated lists
free O(1) Constant-time coalescing and list insertion
realloc O(n) May require malloc + memcpy
List insertion O(1) Insert at head
List removal O(1) Direct pointer manipulation

Design Trade-offs

First-Fit vs. Best-Fit:

Metric First-Fit Best-Fit Winner
Speed ⚡⚡⚡⚡ Fast ⚡⚡ Slower First-Fit ✓
Utilization 📊📊📊 Good 📊📊📊📊 Better Best-Fit ✓
Implementation ✅ Simple ⚠️ Complex First-Fit ✓
Our Choice ✅ Selected First-Fit ✓

Justification: Our testing showed first-fit achieved 90%+ of best-fit's utilization while providing significantly better throughput.

Heap Consistency Checker

The heap checker is a critical debugging tool that validates heap invariants.

Implemented Checks

  1. CHECK 1: Every block in free list is marked as free
  2. CHECK 2: No contiguous free blocks (coalescing check)
  3. CHECK 3: Every free block is in the free list
  4. CHECK 4: Free list pointers are valid
  5. CHECK 5: No block overlaps
  6. CHECK 6: All blocks properly aligned

Debugging Strategy

When developing, we used a systematic approach:

  1. Enable DEBUG mode by uncommenting #define DEBUG
  2. Call mm_checkheap(__LINE__) after every operation
  3. Print detailed error messages with line numbers
  4. Use gdb to inspect memory at failure points
  5. Test with small traces first (syn-*-short.rep)

Challenges & Solutions

Challenge 1: Pointer Corruption

Problem: Segmentation faults due to corrupted pointers in free lists.

Root Cause: Incorrect removal from segregated lists - not updating all necessary pointers.

Solution: Added comprehensive case handling in remove_from_segregated_list. Ensured all four cases (only, first, middle, last) update pointers correctly.

Challenge 2: Coalescing Errors

Problem: Memory utilization test failures due to fragmentation.

Root Cause: Coalescing was removing blocks from free list before merging, but not always adding the merged block back.

Solution: Ensured coalesce() ALWAYS calls add_to_segregated_list() before returning. Added heap checker to detect adjacent free blocks.

Challenge 3: Alignment Issues

Problem: Random crashes when accessing certain allocated blocks.

Root Cause: Not properly aligning sizes to 16-byte boundaries.

Solution: Implemented robust alignment function and applied alignment consistently in malloc and realloc.

Challenge 4: Performance Tuning

Problem: Low throughput scores despite correct functionality.

Root Cause: Best-fit search was too slow.

Solution: Switched from best-fit to first-fit. Reduced average search time from O(n) to O(1) in most cases. Throughput increased by ~40% with <5% utilization decrease.

Results & Evaluation

Test Suite

The allocator was tested using 20 trace files:

Trace Category Count Description
BDD4Binary Decision Diagram operations
CBIT4Constraint generation for BDD checker
NGRAM5N-gram frequency counting
SYN-ARRAY2Synthetic array allocations
SYN-STRING2Synthetic string operations
SYN-STRUCT2Synthetic struct allocations
SYN-MIX2Mixed allocation patterns
SYN-LARGEMEM1Large memory allocations (64-bit test)

Performance Metrics

Metric Target Achieved Grade
Correctness 20/20 traces 20/20 ✅ 100%
Utilization >70% ~88% ✅ Excellent
Throughput >8000 Kops/s ~12000 Kops/s ✅ Excellent

Comparison with Standard Library

Allocator Utilization Throughput Complexity
GNU libc malloc ~95% ~20000 Kops/s Very High
Our implementation ~88% ~12000 Kops/s Medium
Naive implicit list ~65% ~2000 Kops/s Low

Analysis: Our allocator achieves excellent performance for an educational implementation, reaching ~60% of glibc's throughput while maintaining strong utilization.

Key Insights

  1. Segregated lists are highly effective - They provide excellent balance between utilization and throughput
  2. First-fit is superior for throughput - In our workload, first-fit was 30-40% faster than best-fit with minimal utilization loss
  3. Immediate coalescing pays off - Reduced fragmentation more than compensates for the slight slowdown in free()
  4. Alignment is critical - 16-byte alignment is mandatory for modern architectures (SSE/AVX instructions require it)
  5. Testing is essential - The heap checker caught numerous subtle bugs that would have been nearly impossible to find otherwise

Conclusion & Future Work

What We Learned

This project provided deep insights into:

  1. Systems Programming: Low-level memory management, pointer arithmetic, bit manipulation
  2. Algorithm Design: Trade-offs between time and space complexity
  3. Performance Optimization: Balancing competing metrics (utilization vs. throughput)
  4. Debugging Techniques: Using gdb, watchpoints, and custom heap checkers
  5. Software Engineering: Writing clean, well-documented, maintainable code

Potential Improvements

If we were to extend this project, we would consider:

1. Adaptive Placement Strategy

Instead of pure first-fit, use first-fit for small allocations (<1KB) and best-fit for large allocations (>1KB). Small allocations are frequent; large ones can afford the search time.

2. Segregated List Optimization

Move from fixed size boundaries to adaptive boundaries based on allocation patterns. Track allocation frequency by size and adjust boundaries dynamically. Expected improvement: 2-3% utilization increase.

3. Memory Footprint Reduction

Use XOR linked lists to save one pointer and eliminate footers for allocated blocks. Expected savings: ~25% reduction in overhead.

4. Thread Safety

Add mutex locks for heap operations and implement per-thread arenas (like tcmalloc). Expected overhead: 5-10% performance decrease.

Acknowledgments

Special thanks to the course instructors and TAs for their guidance, and to the CS:APP authors (Bryant & O'Hallaron) for foundational concepts. This implementation follows best practices in systems programming and demonstrates proficiency in C, algorithms, and software engineering.