Memory, Layouts and Allocators

systems, memory, linux

This is my humble and hopefully-concise-enough take on the popular and excellent book "What Every Programmer Should Know About Memory". Find it for free here.

I consider this to be "What Every Programmer Should Actually Know About Memory".

This post covers how memory works at the OS level, how data layout affects performance and allocation strategies for different use cases. After reading this you should have a good mental model that enables you to make informed decisions when writing programs or when trying to understand how things generally work. I try to cover what I consider to be the most important stuff.

Virtual Memory Overview

The first thing to understand is the concept of virtual memory.

Address Spaces

Every pointer your program uses is a virtual address. The CPU and OS work together to translate these to physical addresses in actual RAM.

Each process has its own virtual address space, providing isolation. By default, process A's pointer 0x7fff12340000 maps to completely different physical memory than the same address in Process B.
This allows programs to believe they all have their own full access to all bytes of the addressable space: process A typically does not need to care or worry about other processes.

On x86-64, virtual addresses are 48 bits (with 16 bits of sign extension to fill the 64-bit register), giving 2^48 bytes (256TB) of addressable space.

The address space is split: lower half (0x0 to 0x7fffffffffff) for userspace, upper half for the kernel. Addresses outside these "canonical" ranges cause faults (we will take a look at these later).

Page Tables

Memory is managed in fixed-size pages, typically 4KB on x86-64. The page table is a hierarchical data structure that maps virtual pages to physical frames. The atomic unit of memory manipulation is a byte, when thinking about getting memory for our program, we should think as this atomic value being a page. This means if we want to allocate a single byte (i.e a C style char, or a boolean) we still need to put it inside a page, which is PAGE_SIZE (typically 4096 bytes aka 4kb as we previously discussed).

diagram

Virtual Address (48 bits on x86-64):
+----------+---------+---------+---------+---------------+
|  PML4    |  PDPT   |   PD    |   PT    |  Page Offset  |
|  9 bits  |  9 bits |  9 bits |  9 bits |   12 bits     |
+----+-----+----+----+----+----+----+----+---------------+
     |          |         |         |
     v          v         v         v
   +---+      +---+     +---+     +---+
   |   |----->|   |---->|   |---->|   |----> Physical Page
   +---+      +---+     +---+     +---+
   PML4       PDPT       PD        PT
   Table      Table     Table     Table

Each level is a 512-entry table (9 bits = 2^9 = 512 entries). The 12-bit offset means pages are 4KB (2^12 = 4096 bytes). This 4-level walk happens for memory accesses, it's how a virtual address is mapped to its actual physical address.

Page table entries contain more than just the physical address. Key flags include:

Present bit: Is this page actually mapped?
Read/Write: Can this page be written?
User/Supervisor: Can userspace access this?
Accessed: Has this page been read? (set by CPU)
Dirty: Has this page been written? (set by CPU)

Getting Memory and Page Faults

Programs request memory from the OS via two main mechanisms: brk() which extends the heap, and mmap() which maps regions of virtual address space. Most programs don't interact with mmap() directly: they use malloc(), which wraps these primitives. We'll cover allocators in detail later.

When you request memory, the OS gives you virtual addresses but doesn't immediately allocate physical RAM. This is called demand paging or lazy allocation. The kernel simply updates its bookkeeping to say "this virtual range is valid" but doesn't populate the page tables with physical mappings yet. This is relatively fast, we pay the cost of the allocation when actually interacting with that memory space (which is why it's considered lazy), we sometimes refer to this as "touching" a page.

So what happens when you actually access that memory? The CPU attempts the virtual-to-physical translation, finds no mapping in the page table (since it's a new allocation), and triggers a page fault. This traps to the kernel, which then:

Allocates a physical page
Zeros it (security: you shouldn't see another process's data)
Updates the page table to map your virtual address to this physical page
Returns control to your program, which retries the access: now it succeeds

This means page faults are normal and even necessary: they're part of how memory allocation works. There's two kinds of page faults:

Soft faults (minor):

No disk I/O needed
Most common example for most programs: first access to malloc'd memory (you use a vector, hashmap or any collection or data structure that "heap" allocates).
Relatively cheap: kernel updates page table and returns

Hard faults (major):

Data must be fetched from disk (swap or memory-mapped file)
Orders of magnitude more expensive (disk I/O)

High-performance programs minimize faults by pre-allocating and reusing memory. You can also "pre-fault" pages by touching them before they're needed, moving the fault cost out of your hot path.

As you can now understand page faults occur when interacting/touching a byte that is not yet "paged in". So if you have a vector/dynamic array of booleans (one byte each) that are allocated in a new page/aliged, you should be able to measure that page faults occur when you do my_arr[0] and then again when doing my_arr[4096] (crossed the page boundary of a 4kb page). It's therefore very sensible to allocate in page fault width/size if you want to maximize faults per byte usage and space.

bash

# Measure page faults for a command
perf stat -e minor-faults,major-faults ./my_program

The TLB (Translation Lookaside Buffer)

The TLB is a hardware cache for virtual-to-physical translations. Without it every memory access would require additional memory accesses just to walk the page tables (the process we discussed earlier).

It's a relatively small (currently typically below a few thousand entries), fast cache that stores recent virtual-to-physical translations. When you access an address, the CPU checks the TLB first. Hit: Virt-to-physical translation is instant. Miss: Do page table walk, then cache the result.

This is also why memory locality matters: accessing memory in tight patterns keeps TLB entries hot. Jumping around a large address space thrashes the TLB: you need to remove an entry to allow caching a new one.

TLB miss vs Page fault: A TLB miss just means the translation isn't cached: the CPU does a page table walk in hardware to find it. A page fault occurs when the page table walk finds that the page isn't mapped at all (present bit = 0), which traps to the kernel.

diagram

                        Memory Access (virtual address)
                                      │
                                      ▼
                                ┌──────────┐
                                │   TLB    │
                                └────┬─────┘
                                     │
                 ┌───────────────────┴───────────────────┐
                 │                                       │
                Hit                                    Miss
                 │                                       │
                 │                                       ▼
                 │                            ┌─────────────────────┐
                 │                            │   Page table walk   │
                 │                            │     (hardware)      │
                 │                            └──────────┬──────────┘
                 │                                       │
                 │                   ┌───────────────────┴───────────────────┐
                 │                   │                                       │
                 │               Present=1                               Present=0
                 │           (page is mapped)                        (page not mapped)
                 │                   │                                       │
                 │                   ▼                                       ▼
                 │             Cache in TLB                             PAGE FAULT
                 │                   │                               (trap to kernel)
                 │                   │                                       │
                 │                   │                       ┌───────────────┴───────────────┐
                 │                   │                       │                               │
                 │                   │                Valid access                    Invalid access
                 │                   │              (demand paging)                   (bad pointer)
                 │                   │                       │                               │
                 │                   │                       ▼                               ▼
                 │                   │             ┌───────────────────┐                 SIGSEGV
                 │                   │             │ 1. Allocate frame │                 (crash)
                 │                   │             │ 2. Zero the page  │
                 │                   │             │ 3. Update page    │
                 │                   │             │    table entry    │
                 │                   │             └─────────┬─────────┘
                 │                   │                       │
                 │                   │                       ▼
                 │                   │               Return to userspace
                 │                   │                       │
                 │                   │                       ▼
                 │                   │                 Retry access
                 │                   │                       │
                 ▼                   ▼                       ▼
            ┌─────────────────────────────────────────────────────┐
            │                    Access memory                    │
            └─────────────────────────────────────────────────────┘

Huge Pages

x86-64 supports larger page sizes: 2MB and 1GB. Huge pages reduce TLB pressure dramatically.

2MB pages: Skip one level of page tables, 512x fewer TLB entries needed
1GB pages: Skip two levels, for very large allocations

This is extremely powerful for programs that require large chunks of long-lived memory (hash tables, DB internal structures, ML weights..). Especially when doing non-sequential access, since we are "touching" many sections of our region, which would cause lots of faults if we had multiple 4kb pages instead of one huge one.

Trade-offs:

Internal fragmentation (can't use part of a huge page)
Allocation difficulty (need contiguous physical memory, more odds of having 4kb free than having 1GB!)

Copy-on-Write

Copy-on-write (COW) is an optimization where pages are shared until one process tries to write.

fork() shares all pages initially, marks them read-only
Write attempt triggers a fault
Kernel copies the page, updates page tables
Write completes to the private copy

COW is also used for mmap(MAP_PRIVATE) file mappings and some allocator optimizations. Understanding COW explains why fork() is fast even with large address spaces, and why writes to "shared" data may suddenly consume memory. Won't go in more details here but it's an important concept for programmers who want more depth when it comes to memory and performance of data intensive applications.

Allocation

Now that we understand how virtual memory works, let's look at how programs actually get memory. An allocator manages memory on behalf of your program: it requests large chunks from the OS and subdivides them for individual allocations.

The first thing to understand is when do we actually want to allocate? This might seem obvious but if you've only dealt with programming languages that allocate for you under the hood you might not know.

Size unknown at compile time: reading a file, user input, network data, collections that grow
Data that outlives the current function: you can't return a pointer to a local variable (stack gets reused), so you allocate
Large data: stack space is limited (typically 1-8MB); large arrays or buffers go on the heap
Dynamic data structures: trees, linked lists, hash maps with nodes allocated individually

Now we can think about allocation in more details.

Most programmers don't need custom allocators. Understanding how your language's standard containers work can get you decent performance with zero custom code in most cases. We'll start there, then cover specialized strategies, and finally dive into how glibc's malloc (the default allocator on gnu distros) actually works for those who want the details.

Dynamic Arrays in Practice

A dynamic array (vector, list, slice: different names in different languages) is an array that can grow. Fixed size arrays require knowing the size upfront; dynamic arrays let you keep adding elements.

The basic idea is simple: allocate some initial capacity, track how many elements you've actually used (size), and when you run out of room, allocate a bigger buffer and copy everything over.

Key concepts:

Size: how many elements are currently stored
Capacity: how many elements fit before reallocation
Growth factor: when full, multiply capacity by this (typically 1.5x or 2x). Growing by a constant factor gives amortized O(1) append: you copy less and less often as the array grows.

If you were to implement one yourself in C: malloc an initial buffer, track size and capacity, and when size hits capacity, call realloc with double the capacity (realloc may extend in place if lucky, or allocate new memory and copy if not). That's essentially what every language's standard dynamic array does under the hood.

Getting Good Behavior

Pre-allocate when size is known or estimable:

Avoids repeated reallocations during filling
Single allocation instead of log(n) allocations

rust

// Without pre-allocation:
let mut v = Vec::new();       // capacity = 0
for item in items {
    v.push(item);             // reallocates at capacity 1, 2, 4, 8, 16...
}                             // log(n) allocations total

// With pre-allocation:
let mut v = Vec::with_capacity(items.len());  // one allocation upfront
for item in items {
    v.push(item);             // no reallocation until we exceed capacity
}                             // single allocation (assuming items.len() doesn't grow)

Page-aligned capacities: Since the OS allocates memory in pages (typically 4KB), choosing capacities that are multiples of page size avoids wasting memory. If you allocate 4097 bytes, you're using 2 pages (8KB) anyway. For large allocations, round up to page boundaries.

Reuse allocations:

Clear and refill instead of destroy and recreate
Keep buffers around for repeated operations

Batch allocations:

Allocate one large array instead of many small objects
Better cache locality, fewer allocator calls

Avoid allocation in hot loops:

Hoist allocations out of loops
Reuse buffers across iterations

Allocation Strategies

General-Purpose Allocators

Default choices: glibc malloc, jemalloc, tcmalloc, mimalloc. Use these for most cases.

Tip: You can sometimes swap allocators via LD_PRELOAD or linker flags without code changes to compare available allocators.

bash

# Try jemalloc without recompiling
LD_PRELOAD=/usr/lib/libjemalloc.so ./my_program

Arena/Bump Allocators

The idea: grab a big block of memory upfront, then allocate from it by simply bumping a pointer forward. No bookkeeping per allocation. When you're done with everything, reset the pointer back to the start.

This is extremely fast: allocation is just "increment pointer, return old value". The tradeoff is you can't free individual objects. You free everything at once, or nothing.

This sounds limiting, but many workloads have batch lifetimes properties where a group of allocations all die together:

Request/response processing: allocate everything needed to handle a request, free it all when the response is sent
Compiler passes: build an AST, transform it, free everything when the pass completes
Game frames: per-frame scratch memory that resets every frame (particle effects, temporary buffers)
Parsing: allocate nodes while parsing, free when done

If your allocation pattern fits this model, arenas are hard to beat: fast allocation, zero fragmentation, and excellent cache locality since allocations are contiguous.

Pool Allocators

The idea: pre-allocate a bunch of fixed-size blocks and maintain a list of which ones are free. Allocation pops a block from the free list; deallocation pushes it back. Both operations are O(1).

The constraint is that all objects must be the same size. But when that fits your use case, pools are very efficient: no fragmentation (every block is the same size), fast allocation and deallocation, and unlike arenas, you can free individual objects.

Pools work well when you have many objects of the same type with unpredictable lifetimes:

Network connections: connections come and go, but each connection object is the same size
Tree/graph nodes: nodes are allocated and freed as the structure changes
Database records: fixed-size row buffers

Allocators and Pages

As we previously saw the OS gives memory in pages (typically 4KB). Allocators like malloc are general-purpose allocators: they request pages from the OS and carve them up into smaller pieces for your individual allocations. When you malloc(32), you're not getting a page; you're getting 32 bytes from a page the allocator already has.

This is why allocation strategy matters: a general-purpose allocator must handle any size, any lifetime, any pattern. Arena and pool allocators trade generality for speed by constraining what patterns they support. Pick the right tool for your access pattern. You might benefit from a specific allocator alongside huge pages.

Memory Alignment and Cache Lines

Now that we know how to get memory, let's understand how the CPU actually accesses it. This knowledge informs how we should organize our data.

Cache Line Basics

Modern x86 CPUs use 64-byte cache lines. Memory is fetched in cache-line units, not individual bytes. Accessing one byte loads the entire 64-byte line into cache.

This has implications: accessing arr[0] also loads arr[1] through arr[15] (assuming 4-byte elements, i.e a 32 bit integer). Sequential access is essentially free after the first fetch. Which is also partly why arrays (contiguous memory) are so powerful.

Alignment Implications

Aligned access:

Data fits within a single cache line
One memory transaction

Misaligned/straddling access:

Data spans two cache lines: two memory transactions (i.e loads) required

Struct Layout and Natural Alignment

CPUs access memory most efficiently when data is naturally aligned: placed at an address divisible by its size. A 4-byte int wants to be at an address divisible by 4; an 8-byte pointer wants an address divisible by 8. A 1-byte char can be anywhere (every address is divisible by 1).

CPUs read memory in fixed-size chunks. If a 4-byte int starts at address 5 instead of address 4 or 8, it spans two of these chunks: the CPU needs two loads instead of one.

Modern compilers mostly handle this for you: they insert padding between struct fields to keep each field properly aligned.

struct Example {
    char a;      // 1 byte
                 // 3 bytes padding
    int b;       // 4 bytes (aligned to 4)
    char c;      // 1 byte
                 // 7 bytes padding
    long d;      // 8 bytes (aligned to 8)
};  // Total: 24 bytes

// Reordered to minimize padding:
struct ExamplePacked {
    long d;      // 8 bytes
    int b;       // 4 bytes
    char a;      // 1 byte
    char c;      // 1 byte
                 // 2 bytes padding
};  // Total: 16 bytes

Cache-friendly struct design:

Keep struct size ≤ 64 bytes if frequently accessed whole (and if possible, or order fields that belong together in the same cache-line-sized group)
Keep hot fields in the first 64 bytes
Order fields by size (largest first) to minimize padding

False sharing is a mult-threaded system concern which occurs when two threads access variables that happen to share a cache line. Each write invalidates the cache line on other cores, causing constant cache line bouncing.

// BAD: counters likely share a cache line
struct Counters {
    int counter_thread_0;  // Thread 0 writes
    int counter_thread_1;  // Thread 1 writes
};

// GOOD: pad to separate cache lines
struct Counters {
    alignas(64) int counter_thread_0;
    alignas(64) int counter_thread_1;
};

Data Layouts: SoA vs AoS

With an understanding of cache lines, we can now make informed decisions about how to organize our data in memory.

Array of Structures (AoS)

The traditional OOP approach: objects stored contiguously, with each object's fields adjacent in memory.

struct Particle { float x, y, z, vx, vy, vz; };
Particle particles[10000];

// Memory layout: [x,y,z,vx,vy,vz][x,y,z,vx,vy,vz]...

Structure of Arrays (SoA)

Each field stored in its own contiguous array.

struct Particles {
    float* x;   // [x x x x x ...]
    float* y;   // [y y y y y ...]
    float* z;   // [z z z z z ...]
    float* vx;  // [vx vx vx ...]
    float* vy;  // [vy vy vy ...]
    float* vz;  // [vz vz vz ...]
    size_t count;
};

Cache and Memory Analysis

Consider updating all x positions by their velocities:

// AoS: cache lines contain x,y,z,vx,vy,vz - we only need x and vx
for (int i = 0; i < n; i++)
    particles[i].x += particles[i].vx * dt;

// SoA: cache lines contain only x values, then only vx values
for (int i = 0; i < n; i++)
    p->x[i] += p->vx[i] * dt;

With SoA, we load 16 floats per cache line, all useful. With AoS, each cache line gives us ~2-3 particles worth of x and vx, wasting bandwidth on y, z, vy, vz. SoA also enables SIMD: the compiler can vectorize the SoA loop trivially.

When to Use AoS

Accessing multiple/all fields of an object together
Object identity matters (pointer to whole object)
Simpler code, better per-object locality
Typical application code, business logic

When to Use SoA

Hot loops processing single fields across many entities
SIMD vectorization (compiler can auto-vectorize contiguous same-type data)
Particle systems, physics simulations, game engines
Data heavy processing

Hybrid: AoSoA

Array of Structures of Arrays: chunk data into blocks (e.g., 64 elements), with SoA layout within each block.

// AoSoA with 64-element blocks
struct ParticleBlock {
    float x[64];
    float y[64];
    float z[64];
    float vx[64];
    float vy[64];
    float vz[64];
};
ParticleBlock blocks[num_blocks];

This balances SIMD-friendliness (contiguous same-type data within blocks) with reasonable object access (block index + local index).

Hot/Cold Splitting

Keep frequently accessed fields together by moving rarely accessed fields to a separate allocation.

// Hot data accessed every frame
struct EntityHot {
    float x, y, z;
    float vx, vy, vz;
};

// Cold data accessed occasionally
struct EntityCold {
    char name[64];
    int creation_time;
    int flags;
};

EntityHot* hot_data;    // Dense, cache-friendly iteration
EntityCold* cold_data;  // Only touched when needed

This improves cache utilization without a full SoA conversion.

Closing Thoughts

Hopefully this helps clear some of the fairly complex stuff behind memory! There's a lot we didn't touch here (mmap specifics, NUMA, I/O (fread vs mmap vs read vs async (io_uring, async APIs)...) but I believe this is the essential to help people investigate more if they so wish.

Pker.xyz

Memory, Layouts and Allocators

Virtual Memory Overview

Address Spaces

Page Tables

Getting Memory and Page Faults

The TLB (Translation Lookaside Buffer)

Huge Pages

Copy-on-Write

Allocation

Dynamic Arrays in Practice

Getting Good Behavior

Allocation Strategies

General-Purpose Allocators

Arena/Bump Allocators

Pool Allocators

Allocators and Pages

Memory Alignment and Cache Lines

Cache Line Basics

Alignment Implications

Struct Layout and Natural Alignment

Data Layouts: SoA vs AoS

Array of Structures (AoS)

Structure of Arrays (SoA)

Cache and Memory Analysis

When to Use AoS

When to Use SoA

Hybrid: AoSoA

Hot/Cold Splitting

Closing Thoughts

Uploaded 31-01-2026

Pker.xyz

Memory, Layouts and Allocators

Virtual Memory Overview

Address Spaces

Page Tables

Getting Memory and Page Faults

The TLB (Translation Lookaside Buffer)

Huge Pages

Copy-on-Write

Allocation

Dynamic Arrays in Practice

Getting Good Behavior

Allocation Strategies

General-Purpose Allocators

Arena/Bump Allocators

Pool Allocators

Allocators and Pages

Memory Alignment and Cache Lines

Cache Line Basics

Alignment Implications

Struct Layout and Natural Alignment

False Sharing

Data Layouts: SoA vs AoS

Array of Structures (AoS)

Structure of Arrays (SoA)

Cache and Memory Analysis

When to Use AoS

When to Use SoA

Hybrid: AoSoA

Hot/Cold Splitting

Closing Thoughts

Uploaded 31-01-2026