Memory, Layouts and Allocators
This is my humble and hopefully-concise-enough take on the popular
and excellent book "What Every Programmer Should Know About Memory".
Find it for free
here.
I consider this to be "What Every Programmer Should
Actually Know About Memory".
This post covers how memory works at the OS level, how data layout affects performance and allocation strategies for different use cases. After reading this you should have a good mental model that enables you to make informed decisions when writing programs or when trying to understand how things generally work. I try to cover what I consider to be the most important stuff.
Virtual Memory Overview
The first thing to understand is the concept of virtual memory.
Address Spaces
Every pointer your program uses is a virtual address. The CPU and OS work together to translate these to physical addresses in actual RAM.
Each process has its own virtual address space, providing isolation.
By default, process A's pointer 0x7fff12340000 maps to completely different
physical memory than the same address in Process B.
This allows programs to believe they all have their own
full access to all bytes of the addressable space: process A
typically does not need to care or worry about other processes.
On x86-64, virtual addresses are 48 bits (with 16 bits of sign
extension to fill the 64-bit register), giving 2^48 bytes (256TB) of addressable space.
The address space is split: lower half (0x0 to 0x7fffffffffff) for userspace, upper half
for the kernel. Addresses outside these "canonical" ranges cause
faults (we will take a look at these later).
Page Tables
Memory is managed in fixed-size pages, typically 4KB on x86-64. The page table is a hierarchical data structure that maps virtual pages to physical frames. The atomic unit of memory manipulation is a byte, when thinking about getting memory for our program, we should think as this atomic value being a page. This means if we want to allocate a single byte (i.e a C style char, or a boolean) we still need to put it inside a page, which is PAGE_SIZE (typically 4096 bytes aka 4kb as we previously discussed).
Virtual Address (48 bits on x86-64):
+----------+---------+---------+---------+---------------+
| PML4 | PDPT | PD | PT | Page Offset |
| 9 bits | 9 bits | 9 bits | 9 bits | 12 bits |
+----+-----+----+----+----+----+----+----+---------------+
| | | |
v v v v
+---+ +---+ +---+ +---+
| |----->| |---->| |---->| |----> Physical Page
+---+ +---+ +---+ +---+
PML4 PDPT PD PT
Table Table Table Table
Each level is a 512-entry table (9 bits = 2^9 = 512 entries). The 12-bit offset means pages are 4KB (2^12 = 4096 bytes). This 4-level walk happens for memory accesses, it's how a virtual address is mapped to its actual physical address.
Page table entries contain more than just the physical address. Key flags include:
- Present bit: Is this page actually mapped?
- Read/Write: Can this page be written?
- User/Supervisor: Can userspace access this?
- Accessed: Has this page been read? (set by CPU)
- Dirty: Has this page been written? (set by CPU)
Getting Memory and Page Faults
Programs request memory from the OS via two main mechanisms:
brk() which extends the heap, and mmap()
which maps regions of virtual address space. Most programs don't
interact with mmap() directly: they use malloc(), which wraps
these primitives. We'll cover allocators in detail later.
When you request memory, the OS gives you virtual addresses but doesn't immediately allocate physical RAM. This is called demand paging or lazy allocation. The kernel simply updates its bookkeeping to say "this virtual range is valid" but doesn't populate the page tables with physical mappings yet. This is relatively fast, we pay the cost of the allocation when actually interacting with that memory space (which is why it's considered lazy), we sometimes refer to this as "touching" a page.
So what happens when you actually access that memory? The CPU attempts the virtual-to-physical translation, finds no mapping in the page table (since it's a new allocation), and triggers a page fault. This traps to the kernel, which then:
- Allocates a physical page
- Zeros it (security: you shouldn't see another process's data)
- Updates the page table to map your virtual address to this physical page
- Returns control to your program, which retries the access: now it succeeds
This means page faults are normal and even necessary: they're part of how memory allocation works. There's two kinds of page faults:
Soft faults (minor):
- No disk I/O needed
- Most common example for most programs: first access to malloc'd memory (you use a vector, hashmap or any collection or data structure that "heap" allocates).
- Relatively cheap: kernel updates page table and returns
Hard faults (major):
- Data must be fetched from disk (swap or memory-mapped file)
- Orders of magnitude more expensive (disk I/O)
High-performance programs minimize faults by pre-allocating and
reusing memory. You can also "pre-fault" pages by touching them
before they're needed, moving the fault cost out of your hot path.
As you can now understand page faults occur when interacting/touching a
byte that is not yet "paged in". So if you have a vector/dynamic
array of booleans (one byte each) that are allocated in a new
page/aliged, you should be able to measure that page faults
occur when you do my_arr[0] and then again when doing
my_arr[4096] (crossed the page boundary of a 4kb page). It's
therefore very sensible to allocate in page fault width/size
if you want to maximize faults per byte usage and space.
# Measure page faults for a command perf stat -e minor-faults,major-faults ./my_program
The TLB (Translation Lookaside Buffer)
The TLB is a hardware cache for virtual-to-physical translations. Without it every memory access would require additional memory accesses just to walk the page tables (the process we discussed earlier).
It's a relatively small (currently typically below a few thousand entries), fast cache that stores recent virtual-to-physical translations. When you access an address, the CPU checks the TLB first. Hit: Virt-to-physical translation is instant. Miss: Do page table walk, then cache the result.
This is also why memory locality matters: accessing memory in tight patterns keeps TLB entries hot. Jumping around a large address space thrashes the TLB: you need to remove an entry to allow caching a new one.
TLB miss vs Page fault: A TLB miss just means the translation isn't cached: the CPU does a page table walk in hardware to find it. A page fault occurs when the page table walk finds that the page isn't mapped at all (present bit = 0), which traps to the kernel.
Memory Access (virtual address)
│
▼
┌──────────┐
│ TLB │
└────┬─────┘
│
┌───────────────────┴───────────────────┐
│ │
Hit Miss
│ │
│ ▼
│ ┌─────────────────────┐
│ │ Page table walk │
│ │ (hardware) │
│ └──────────┬──────────┘
│ │
│ ┌───────────────────┴───────────────────┐
│ │ │
│ Present=1 Present=0
│ (page is mapped) (page not mapped)
│ │ │
│ ▼ ▼
│ Cache in TLB PAGE FAULT
│ │ (trap to kernel)
│ │ │
│ │ ┌───────────────┴───────────────┐
│ │ │ │
│ │ Valid access Invalid access
│ │ (demand paging) (bad pointer)
│ │ │ │
│ │ ▼ ▼
│ │ ┌───────────────────┐ SIGSEGV
│ │ │ 1. Allocate frame │ (crash)
│ │ │ 2. Zero the page │
│ │ │ 3. Update page │
│ │ │ table entry │
│ │ └─────────┬─────────┘
│ │ │
│ │ ▼
│ │ Return to userspace
│ │ │
│ │ ▼
│ │ Retry access
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ Access memory │
└─────────────────────────────────────────────────────┘
Huge Pages
x86-64 supports larger page sizes: 2MB and 1GB. Huge pages reduce TLB pressure dramatically.
- 2MB pages: Skip one level of page tables, 512x fewer TLB entries needed
- 1GB pages: Skip two levels, for very large allocations
Trade-offs:
- Internal fragmentation (can't use part of a huge page)
- Allocation difficulty (need contiguous physical memory, more odds of having 4kb free than having 1GB!)
Copy-on-Write
Copy-on-write (COW) is an optimization where pages are shared until one process tries to write.
fork()shares all pages initially, marks them read-only- Write attempt triggers a fault
- Kernel copies the page, updates page tables
- Write completes to the private copy
COW is also used for mmap(MAP_PRIVATE) file mappings
and some allocator optimizations. Understanding COW explains why
fork() is fast even with large address spaces, and why writes to
"shared" data may suddenly consume memory. Won't go in more details
here but it's an important concept for programmers who want more
depth when it comes to memory and performance of data intensive
applications.
Allocation
Now that we understand how virtual memory works, let's look at how programs actually get memory. An allocator manages memory on behalf of your program: it requests large chunks from the OS and subdivides them for individual allocations.
The first thing to understand is when do we actually want to allocate? This might seem obvious but if you've only dealt with programming languages that allocate for you under the hood you might not know.
- Size unknown at compile time: reading a file, user input, network data, collections that grow
- Data that outlives the current function: you can't return a pointer to a local variable (stack gets reused), so you allocate
- Large data: stack space is limited (typically 1-8MB); large arrays or buffers go on the heap
- Dynamic data structures: trees, linked lists, hash maps with nodes allocated individually
Now we can think about allocation in more details.
Most programmers don't need custom allocators. Understanding how your language's standard containers work can get you decent performance with zero custom code in most cases. We'll start there, then cover specialized strategies, and finally dive into how glibc's malloc (the default allocator on gnu distros) actually works for those who want the details.
Dynamic Arrays in Practice
A dynamic array (vector, list, slice: different names in different languages) is an array that can grow. Fixed size arrays require knowing the size upfront; dynamic arrays let you keep adding elements.
The basic idea is simple: allocate some initial capacity, track how many elements you've actually used (size), and when you run out of room, allocate a bigger buffer and copy everything over.
Key concepts:
- Size: how many elements are currently stored
- Capacity: how many elements fit before reallocation
- Growth factor: when full, multiply capacity by this (typically 1.5x or 2x). Growing by a constant factor gives amortized O(1) append: you copy less and less often as the array grows.
If you were to implement one yourself in C: malloc an initial buffer, track size and capacity, and when size hits capacity, call realloc with double the capacity (realloc may extend in place if lucky, or allocate new memory and copy if not). That's essentially what every language's standard dynamic array does under the hood.
Getting Good Behavior
Pre-allocate when size is known or estimable:
- Avoids repeated reallocations during filling
- Single allocation instead of log(n) allocations
// Without pre-allocation:
let mut v = Vec::new(); // capacity = 0
for item in items {
v.push(item); // reallocates at capacity 1, 2, 4, 8, 16...
} // log(n) allocations total
// With pre-allocation:
let mut v = Vec::with_capacity(items.len()); // one allocation upfront
for item in items {
v.push(item); // no reallocation until we exceed capacity
} // single allocation (assuming items.len() doesn't grow)
Page-aligned capacities: Since the OS allocates memory in pages (typically 4KB), choosing capacities that are multiples of page size avoids wasting memory. If you allocate 4097 bytes, you're using 2 pages (8KB) anyway. For large allocations, round up to page boundaries.
Reuse allocations:
- Clear and refill instead of destroy and recreate
- Keep buffers around for repeated operations
Batch allocations:
- Allocate one large array instead of many small objects
- Better cache locality, fewer allocator calls
Avoid allocation in hot loops:
- Hoist allocations out of loops
- Reuse buffers across iterations
Allocation Strategies
General-Purpose Allocators
Default choices: glibc malloc, jemalloc, tcmalloc, mimalloc. Use these for most cases.
Tip: You can sometimes swap allocators via LD_PRELOAD or linker flags without code changes to compare available allocators.
# Try jemalloc without recompiling LD_PRELOAD=/usr/lib/libjemalloc.so ./my_program
Arena/Bump Allocators
The idea: grab a big block of memory upfront, then allocate from it by simply bumping a pointer forward. No bookkeeping per allocation. When you're done with everything, reset the pointer back to the start.
This is extremely fast: allocation is just "increment pointer, return old value". The tradeoff is you can't free individual objects. You free everything at once, or nothing.
This sounds limiting, but many workloads have batch lifetimes properties where a group of allocations all die together:
- Request/response processing: allocate everything needed to handle a request, free it all when the response is sent
- Compiler passes: build an AST, transform it, free everything when the pass completes
- Game frames: per-frame scratch memory that resets every frame (particle effects, temporary buffers)
- Parsing: allocate nodes while parsing, free when done
If your allocation pattern fits this model, arenas are hard to beat: fast allocation, zero fragmentation, and excellent cache locality since allocations are contiguous.
Pool Allocators
The idea: pre-allocate a bunch of fixed-size blocks and maintain a list of which ones are free. Allocation pops a block from the free list; deallocation pushes it back. Both operations are O(1).
The constraint is that all objects must be the same size. But when that fits your use case, pools are very efficient: no fragmentation (every block is the same size), fast allocation and deallocation, and unlike arenas, you can free individual objects.
Pools work well when you have many objects of the same type with unpredictable lifetimes:
- Network connections: connections come and go, but each connection object is the same size
- Tree/graph nodes: nodes are allocated and freed as the structure changes
- Database records: fixed-size row buffers
Allocators and Pages
As we previously saw the OS gives memory in pages (typically 4KB). Allocators like malloc are general-purpose allocators: they request pages from the OS and carve them up into smaller pieces for your individual allocations. When you malloc(32), you're not getting a page; you're getting 32 bytes from a page the allocator already has.
This is why allocation strategy matters: a general-purpose allocator must handle any size, any lifetime, any pattern. Arena and pool allocators trade generality for speed by constraining what patterns they support. Pick the right tool for your access pattern. You might benefit from a specific allocator alongside huge pages.
Memory Alignment and Cache Lines
Now that we know how to get memory, let's understand how the CPU actually accesses it. This knowledge informs how we should organize our data.
Cache Line Basics
Modern x86 CPUs use 64-byte cache lines. Memory is fetched in cache-line units, not individual bytes. Accessing one byte loads the entire 64-byte line into cache.
This has implications: accessing arr[0] also loads arr[1] through arr[15] (assuming 4-byte elements, i.e a 32 bit integer). Sequential access is essentially free after the first fetch. Which is also partly why arrays (contiguous memory) are so powerful.
Alignment Implications
Aligned access:
- Data fits within a single cache line
- One memory transaction
Misaligned/straddling access:
- Data spans two cache lines: two memory transactions (i.e loads) required
Struct Layout and Natural Alignment
CPUs access memory most efficiently when data is naturally aligned: placed at an address divisible by its size. A 4-byte int wants to be at an address divisible by 4; an 8-byte pointer wants an address divisible by 8. A 1-byte char can be anywhere (every address is divisible by 1).
CPUs read memory in fixed-size chunks. If a 4-byte int starts at address 5 instead of address 4 or 8, it spans two of these chunks: the CPU needs two loads instead of one.
Modern compilers mostly handle this for you: they insert padding between struct fields to keep each field properly aligned.
struct Example {
char a; // 1 byte
// 3 bytes padding
int b; // 4 bytes (aligned to 4)
char c; // 1 byte
// 7 bytes padding
long d; // 8 bytes (aligned to 8)
}; // Total: 24 bytes
// Reordered to minimize padding:
struct ExamplePacked {
long d; // 8 bytes
int b; // 4 bytes
char a; // 1 byte
char c; // 1 byte
// 2 bytes padding
}; // Total: 16 bytes
Cache-friendly struct design:
- Keep struct size ≤ 64 bytes if frequently accessed whole (and if possible, or order fields that belong together in the same cache-line-sized group)
- Keep hot fields in the first 64 bytes
- Order fields by size (largest first) to minimize padding
False Sharing
False sharing is a mult-threaded system concern which occurs when two threads access variables that happen to share a cache line. Each write invalidates the cache line on other cores, causing constant cache line bouncing.
// BAD: counters likely share a cache line
struct Counters {
int counter_thread_0; // Thread 0 writes
int counter_thread_1; // Thread 1 writes
};
// GOOD: pad to separate cache lines
struct Counters {
alignas(64) int counter_thread_0;
alignas(64) int counter_thread_1;
};
Data Layouts: SoA vs AoS
With an understanding of cache lines, we can now make informed decisions about how to organize our data in memory.
Array of Structures (AoS)
The traditional OOP approach: objects stored contiguously, with each object's fields adjacent in memory.
struct Particle { float x, y, z, vx, vy, vz; };
Particle particles[10000];
// Memory layout: [x,y,z,vx,vy,vz][x,y,z,vx,vy,vz]...
Structure of Arrays (SoA)
Each field stored in its own contiguous array.
struct Particles {
float* x; // [x x x x x ...]
float* y; // [y y y y y ...]
float* z; // [z z z z z ...]
float* vx; // [vx vx vx ...]
float* vy; // [vy vy vy ...]
float* vz; // [vz vz vz ...]
size_t count;
};
Cache and Memory Analysis
Consider updating all x positions by their velocities:
// AoS: cache lines contain x,y,z,vx,vy,vz - we only need x and vx
for (int i = 0; i < n; i++)
particles[i].x += particles[i].vx * dt;
// SoA: cache lines contain only x values, then only vx values
for (int i = 0; i < n; i++)
p->x[i] += p->vx[i] * dt;
With SoA, we load 16 floats per cache line, all useful. With AoS, each cache line gives us ~2-3 particles worth of x and vx, wasting bandwidth on y, z, vy, vz. SoA also enables SIMD: the compiler can vectorize the SoA loop trivially.
When to Use AoS
- Accessing multiple/all fields of an object together
- Object identity matters (pointer to whole object)
- Simpler code, better per-object locality
- Typical application code, business logic
When to Use SoA
- Hot loops processing single fields across many entities
- SIMD vectorization (compiler can auto-vectorize contiguous same-type data)
- Particle systems, physics simulations, game engines
- Data heavy processing
Hybrid: AoSoA
Array of Structures of Arrays: chunk data into blocks (e.g., 64 elements), with SoA layout within each block.
// AoSoA with 64-element blocks
struct ParticleBlock {
float x[64];
float y[64];
float z[64];
float vx[64];
float vy[64];
float vz[64];
};
ParticleBlock blocks[num_blocks];
This balances SIMD-friendliness (contiguous same-type data within blocks) with reasonable object access (block index + local index).
Hot/Cold Splitting
Keep frequently accessed fields together by moving rarely accessed fields to a separate allocation.
// Hot data accessed every frame
struct EntityHot {
float x, y, z;
float vx, vy, vz;
};
// Cold data accessed occasionally
struct EntityCold {
char name[64];
int creation_time;
int flags;
};
EntityHot* hot_data; // Dense, cache-friendly iteration
EntityCold* cold_data; // Only touched when needed
This improves cache utilization without a full SoA conversion.
Closing Thoughts
Hopefully this helps clear some of the fairly complex stuff behind memory! There's a lot we didn't touch here (mmap specifics, NUMA, I/O (fread vs mmap vs read vs async (io_uring, async APIs)...) but I believe this is the essential to help people investigate more if they so wish.