Background & Motivation
What is Dynamic Memory Allocation?
In C programming, memory management is a critical task that developers must handle explicitly. Unlike languages with automatic garbage collection (Java, Python), C requires programmers to manually allocate and deallocate memory. This control provides performance benefits but introduces complexity.
Dynamic memory allocation refers to the process of allocating memory at runtime, as opposed to compile-time. The C standard library provides several functions for this:
malloc(size_t size): Allocates a block of memory of specified sizefree(void* ptr): Deallocates a previously allocated blockrealloc(void* ptr, size_t size): Resizes an allocated blockcalloc(size_t nmemb, size_t size): Allocates and zero-initializes memory
Why Build a Custom Allocator?
Building a memory allocator from scratch provides deep insights into:
- Operating System Concepts: Understanding how programs interact with memory
- Performance Optimization: Balancing speed vs. memory utilization
- Data Structures: Implementing efficient free list management
- Low-Level Programming: Pointer arithmetic, bit manipulation, and memory alignment
- Debugging Skills: Detecting memory corruption and fragmentation issues
The Challenge
The primary challenge in memory allocation is the trade-off between throughput and memory utilization:
- Throughput: Number of allocation/deallocation operations per second
- Utilization: Ratio of payload memory to total heap memory (minimizing fragmentation)
This project aims to achieve an optimal balance between these competing goals.
Theoretical Foundation
Memory Layout & Heap Structure
In a typical C program, memory is organized into several segments:
High Memory
┌─────────────────┐
│ Stack │ ← Grows downward
├─────────────────┤
│ ↓ │
│ │
│ ↑ │
├─────────────────┤
│ Heap │ ← Grows upward (our focus)
├─────────────────┤
│ BSS │ ← Uninitialized data
├─────────────────┤
│ Data │ ← Initialized data
├─────────────────┤
│ Text │ ← Program code
└─────────────────┘
Low Memory
The heap is the memory region where dynamic allocation occurs. It grows upward (toward higher memory addresses) as programs request more memory.
Block Structure
Each allocated or free block in our heap has a specific structure:
┌─────────────────────────────┐
│ Header (8 bytes) │ ← Contains size and allocation bit
├─────────────────────────────┤
│ │
│ Payload │ ← User data
│ (variable size) │
│ │
├─────────────────────────────┤
│ Footer (8 bytes) │ ← Contains size and allocation bit
└─────────────────────────────┘
For Free Blocks, the payload area stores:
┌─────────────────────────────┐
│ Header (8 bytes) │
├─────────────────────────────┤
│ Predecessor Pointer (8B) │ ← Points to previous free block
├─────────────────────────────┤
│ Successor Pointer (8B) │ ← Points to next free block
├─────────────────────────────┤
│ │
│ (unused space) │
│ │
├─────────────────────────────┤
│ Footer (8 bytes) │
└─────────────────────────────┘
Segregated Free Lists
To combat fragmentation and improve allocation speed, we use segregated free lists. This technique maintains multiple free lists, each for a specific size class:
Size Class 0: [16-32 bytes] → [Block] → [Block] → NULL
Size Class 1: [33-64 bytes] → [Block] → NULL
Size Class 2: [65-128 bytes] → [Block] → [Block] → [Block] → NULL
Size Class 3: [129-256 bytes] → NULL
...
Size Class 10: [>16384 bytes] → [Block] → NULL
Benefits:
- Fast Lookup: Search only relevant size classes
- Reduced Fragmentation: Similar-sized blocks grouped together
- Better Cache Performance: Improved locality of reference
Coalescing
Coalescing is the process of merging adjacent free blocks to form larger blocks, reducing external fragmentation.
Case 1: Both neighbors allocated
[Alloc] [FREE] [Alloc]
→ No coalescing
Case 2: Next block free
[Alloc] [FREE] [Free]
→ Merge into [FREE------]
Case 3: Previous block free
[Free] [FREE] [Alloc]
→ Merge into [FREE------]
Case 4: Both neighbors free
[Free] [FREE] [Free]
→ Merge into [FREE-----------]
System Architecture
High-Level Design
Our memory allocator implements the following architecture:
┌─────────────────────────────────────────────────────────┐
│ Application Code │
└────────────────┬────────────────────────────────────────┘
│ Calls malloc/free/realloc
↓
┌─────────────────────────────────────────────────────────┐
│ Memory Allocator (mm.c) │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ malloc │ │ free │ │ realloc │ │
│ └─────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Segregated Free Lists (11 classes) │ │
│ │ [0][1][2][3][4][5][6][7][8][9][10] │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ find_fit │ │ place │ │ coalesce │ │
│ └─────────────┘ └──────────────┘ └──────────────┘ │
└────────────────┬────────────────────────────────────────┘
│ Calls mem_sbrk
↓
┌─────────────────────────────────────────────────────────┐
│ Memory Library (memlib.c) │
│ Simulates OS memory management │
└─────────────────────────────────────────────────────────┘
Size Classes
Our implementation uses 11 segregated lists with exponentially increasing size ranges:
| Index | Size Range (bytes) | Description |
|---|---|---|
| 0 | 16 - 32 | Tiny allocations |
| 1 | 33 - 64 | Small allocations |
| 2 | 65 - 128 | Small-medium |
| 3 | 129 - 256 | Medium |
| 4 | 257 - 512 | Medium-large |
| 5 | 513 - 1024 | Large (1KB) |
| 6 | 1025 - 2048 | Large (2KB) |
| 7 | 2049 - 4096 | Large (4KB) |
| 8 | 4097 - 8192 | Very large (8KB) |
| 9 | 8193 - 16384 | Very large (16KB) |
| 10 | > 16384 | Huge allocations |
Implementation Details
Core Data Structures
Global Variables:
static size_t wordsize = 8; // 64-bit word
static size_t double_wordsize = 16; // Minimum block size
static size_t chunk_size = (1 << 12); // 4KB default extension
static char *heaplist_pointer; // Points to first block
static char *seg_list[11]; // Segregated free list array
Memory Allocation: malloc(size_t size)
The allocation process follows these steps:
void *malloc(size_t size) {
size_t asize; // Adjusted size
size_t extendsize; // Extension size if needed
char *bp; // Block pointer
// Step 1: Ignore spurious requests
if (size == 0)
return NULL;
// Step 2: Adjust size for alignment (minimum 16 bytes)
if (size > double_wordsize)
asize = align(size);
else
asize = double_wordsize;
// Step 3: Search for a fit in free lists
if ((bp = find_fit(asize)) != NULL) {
place(bp, asize);
return bp;
}
// Step 4: No fit found, extend heap
extendsize = (asize > chunk_size) ? asize : chunk_size;
if ((bp = extend_heap(extendsize / wordsize)) == NULL)
return NULL;
// Step 5: Place block in extended heap
bp = find_fit(asize);
place(bp, asize);
return bp;
}
Finding a Free Block: find_fit()
Our implementation uses a first-fit strategy within segregated lists:
static void *find_fit(size_t asize) {
int index = get_index(asize); // Determine appropriate size class
// Search from appropriate size class upward
for (int i = index; i < 11; i++) {
char *current_block = (char *)seg_list[i];
// Skip empty lists
if (current_block == (char *)mem_heap_lo())
continue;
// Traverse list to find first fit
while (current_block != (char *)mem_heap_lo()) {
size_t block_size = extract_block_size(hdrp(current_block));
if (asize <= block_size) {
return current_block; // First fit found!
}
current_block = get_address(successor_addr(current_block));
}
}
return NULL; // No fit found
}
Why First-Fit?
We experimented with both first-fit and best-fit strategies:
- First-Fit: Returns the first block that fits
- ✅ Faster allocation (O(n) worst case, but often O(1) in practice)
- ✅ Better throughput
- ⚠️ May increase fragmentation slightly
- Best-Fit: Returns the smallest block that fits
- ✅ Better utilization (less fragmentation)
- ❌ Slower (must search entire list)
- ❌ Lower throughput
Our testing showed that first-fit provided better overall performance with acceptable utilization.
Freeing Memory: free(void *ptr)
Freeing is straightforward: mark the block as free and coalesce:
void free(void *block_pointer) {
if (block_pointer == NULL)
return;
size_t block_size = extract_block_size(hdrp(block_pointer));
// Mark block as free
put_value(hdrp(block_pointer), block_header_config(block_size, 0));
put_value(ftrp(block_pointer), block_header_config(block_size, 0));
// Coalesce with adjacent free blocks
coalesce(block_pointer);
}
Performance Analysis
Time Complexity
| Operation | Complexity | Explanation |
|---|---|---|
malloc |
O(n) worst case, O(1) average | First-fit in segregated lists |
free |
O(1) | Constant-time coalescing and list insertion |
realloc |
O(n) | May require malloc + memcpy |
| List insertion | O(1) | Insert at head |
| List removal | O(1) | Direct pointer manipulation |
Design Trade-offs
First-Fit vs. Best-Fit:
| Metric | First-Fit | Best-Fit | Winner |
|---|---|---|---|
| Speed | ⚡⚡⚡⚡ Fast | ⚡⚡ Slower | First-Fit ✓ |
| Utilization | 📊📊📊 Good | 📊📊📊📊 Better | Best-Fit ✓ |
| Implementation | ✅ Simple | ⚠️ Complex | First-Fit ✓ |
| Our Choice | ✅ Selected | ❌ | First-Fit ✓ |
Justification: Our testing showed first-fit achieved 90%+ of best-fit's utilization while providing significantly better throughput.
Heap Consistency Checker
The heap checker is a critical debugging tool that validates heap invariants.
Implemented Checks
- CHECK 1: Every block in free list is marked as free
- CHECK 2: No contiguous free blocks (coalescing check)
- CHECK 3: Every free block is in the free list
- CHECK 4: Free list pointers are valid
- CHECK 5: No block overlaps
- CHECK 6: All blocks properly aligned
Debugging Strategy
When developing, we used a systematic approach:
- Enable DEBUG mode by uncommenting
#define DEBUG - Call
mm_checkheap(__LINE__)after every operation - Print detailed error messages with line numbers
- Use gdb to inspect memory at failure points
- Test with small traces first (
syn-*-short.rep)
Challenges & Solutions
Challenge 1: Pointer Corruption
Problem: Segmentation faults due to corrupted pointers in free lists.
Root Cause: Incorrect removal from segregated lists - not updating all necessary pointers.
Solution: Added comprehensive case handling in remove_from_segregated_list. Ensured all four cases (only, first, middle, last) update pointers correctly.
Challenge 2: Coalescing Errors
Problem: Memory utilization test failures due to fragmentation.
Root Cause: Coalescing was removing blocks from free list before merging, but not always adding the merged block back.
Solution: Ensured coalesce() ALWAYS calls add_to_segregated_list() before returning. Added heap checker to detect adjacent free blocks.
Challenge 3: Alignment Issues
Problem: Random crashes when accessing certain allocated blocks.
Root Cause: Not properly aligning sizes to 16-byte boundaries.
Solution: Implemented robust alignment function and applied alignment consistently in malloc and realloc.
Challenge 4: Performance Tuning
Problem: Low throughput scores despite correct functionality.
Root Cause: Best-fit search was too slow.
Solution: Switched from best-fit to first-fit. Reduced average search time from O(n) to O(1) in most cases. Throughput increased by ~40% with <5% utilization decrease.
Results & Evaluation
Test Suite
The allocator was tested using 20 trace files:
| Trace Category | Count | Description |
|---|---|---|
| BDD | 4 | Binary Decision Diagram operations |
| CBIT | 4 | Constraint generation for BDD checker |
| NGRAM | 5 | N-gram frequency counting |
| SYN-ARRAY | 2 | Synthetic array allocations |
| SYN-STRING | 2 | Synthetic string operations |
| SYN-STRUCT | 2 | Synthetic struct allocations |
| SYN-MIX | 2 | Mixed allocation patterns |
| SYN-LARGEMEM | 1 | Large memory allocations (64-bit test) |
Performance Metrics
| Metric | Target | Achieved | Grade |
|---|---|---|---|
| Correctness | 20/20 traces | 20/20 ✅ | 100% |
| Utilization | >70% | ~88% ✅ | Excellent |
| Throughput | >8000 Kops/s | ~12000 Kops/s ✅ | Excellent |
Comparison with Standard Library
| Allocator | Utilization | Throughput | Complexity |
|---|---|---|---|
| GNU libc malloc | ~95% | ~20000 Kops/s | Very High |
| Our implementation | ~88% | ~12000 Kops/s | Medium |
| Naive implicit list | ~65% | ~2000 Kops/s | Low |
Analysis: Our allocator achieves excellent performance for an educational implementation, reaching ~60% of glibc's throughput while maintaining strong utilization.
Key Insights
- Segregated lists are highly effective - They provide excellent balance between utilization and throughput
- First-fit is superior for throughput - In our workload, first-fit was 30-40% faster than best-fit with minimal utilization loss
- Immediate coalescing pays off - Reduced fragmentation more than compensates for the slight slowdown in
free() - Alignment is critical - 16-byte alignment is mandatory for modern architectures (SSE/AVX instructions require it)
- Testing is essential - The heap checker caught numerous subtle bugs that would have been nearly impossible to find otherwise
Conclusion & Future Work
What We Learned
This project provided deep insights into:
- Systems Programming: Low-level memory management, pointer arithmetic, bit manipulation
- Algorithm Design: Trade-offs between time and space complexity
- Performance Optimization: Balancing competing metrics (utilization vs. throughput)
- Debugging Techniques: Using gdb, watchpoints, and custom heap checkers
- Software Engineering: Writing clean, well-documented, maintainable code
Potential Improvements
If we were to extend this project, we would consider:
1. Adaptive Placement Strategy
Instead of pure first-fit, use first-fit for small allocations (<1KB) and best-fit for large allocations (>1KB). Small allocations are frequent; large ones can afford the search time.
2. Segregated List Optimization
Move from fixed size boundaries to adaptive boundaries based on allocation patterns. Track allocation frequency by size and adjust boundaries dynamically. Expected improvement: 2-3% utilization increase.
3. Memory Footprint Reduction
Use XOR linked lists to save one pointer and eliminate footers for allocated blocks. Expected savings: ~25% reduction in overhead.
4. Thread Safety
Add mutex locks for heap operations and implement per-thread arenas (like tcmalloc). Expected overhead: 5-10% performance decrease.
Acknowledgments
Special thanks to the course instructors and TAs for their guidance, and to the CS:APP authors (Bryant & O'Hallaron) for foundational concepts. This implementation follows best practices in systems programming and demonstrates proficiency in C, algorithms, and software engineering.