Pker.xyz

Memory, Layouts and Allocators

systems, memory, linux


This is my humble and hopefully-concise-enough take on the popular and excellent book "What Every Programmer Should Know About Memory". Find it for free here.

I consider this to be "What Every Programmer Should Actually Know About Memory".

This post covers how memory works at the OS level, how data layout affects performance and allocation strategies for different use cases. After reading this you should have a good mental model that enables you to make informed decisions when writing programs or when trying to understand how things generally work. I try to cover what I consider to be the most important stuff.


Virtual Memory Overview

The first thing to understand is the concept of virtual memory.

Address Spaces

Every pointer your program uses is a virtual address. The CPU and OS work together to translate these to physical addresses in actual RAM.

Each process has its own virtual address space, providing isolation. By default, process A's pointer 0x7fff12340000 maps to completely different physical memory than the same address in Process B.
This allows programs to believe they all have their own full access to all bytes of the addressable space: process A typically does not need to care or worry about other processes.

On x86-64, virtual addresses are 48 bits (with 16 bits of sign extension to fill the 64-bit register), giving 2^48 bytes (256TB) of addressable space.

The address space is split: lower half (0x0 to 0x7fffffffffff) for userspace, upper half for the kernel. Addresses outside these "canonical" ranges cause faults (we will take a look at these later).

Page Tables

Memory is managed in fixed-size pages, typically 4KB on x86-64. The page table is a hierarchical data structure that maps virtual pages to physical frames. The atomic unit of memory manipulation is a byte, when thinking about getting memory for our program, we should think as this atomic value being a page. This means if we want to allocate a single byte (i.e a C style char, or a boolean) we still need to put it inside a page, which is PAGE_SIZE (typically 4096 bytes aka 4kb as we previously discussed).

diagram
Virtual Address (48 bits on x86-64):
+----------+---------+---------+---------+---------------+
|  PML4    |  PDPT   |   PD    |   PT    |  Page Offset  |
|  9 bits  |  9 bits |  9 bits |  9 bits |   12 bits     |
+----+-----+----+----+----+----+----+----+---------------+
     |          |         |         |
     v          v         v         v
   +---+      +---+     +---+     +---+
   |   |----->|   |---->|   |---->|   |----> Physical Page
   +---+      +---+     +---+     +---+
   PML4       PDPT       PD        PT
   Table      Table     Table     Table

Each level is a 512-entry table (9 bits = 2^9 = 512 entries). The 12-bit offset means pages are 4KB (2^12 = 4096 bytes). This 4-level walk happens for memory accesses, it's how a virtual address is mapped to its actual physical address.

Page table entries contain more than just the physical address. Key flags include:

Getting Memory and Page Faults

Programs request memory from the OS via two main mechanisms: brk() which extends the heap, and mmap() which maps regions of virtual address space. Most programs don't interact with mmap() directly: they use malloc(), which wraps these primitives. We'll cover allocators in detail later.

When you request memory, the OS gives you virtual addresses but doesn't immediately allocate physical RAM. This is called demand paging or lazy allocation. The kernel simply updates its bookkeeping to say "this virtual range is valid" but doesn't populate the page tables with physical mappings yet. This is relatively fast, we pay the cost of the allocation when actually interacting with that memory space (which is why it's considered lazy), we sometimes refer to this as "touching" a page.

So what happens when you actually access that memory? The CPU attempts the virtual-to-physical translation, finds no mapping in the page table (since it's a new allocation), and triggers a page fault. This traps to the kernel, which then:

This means page faults are normal and even necessary: they're part of how memory allocation works. There's two kinds of page faults:

Soft faults (minor):

Hard faults (major):

High-performance programs minimize faults by pre-allocating and reusing memory. You can also "pre-fault" pages by touching them before they're needed, moving the fault cost out of your hot path.

As you can now understand page faults occur when interacting/touching a byte that is not yet "paged in". So if you have a vector/dynamic array of booleans (one byte each) that are allocated in a new page/aliged, you should be able to measure that page faults occur when you do my_arr[0] and then again when doing my_arr[4096] (crossed the page boundary of a 4kb page). It's therefore very sensible to allocate in page fault width/size if you want to maximize faults per byte usage and space.

bash
# Measure page faults for a command
perf stat -e minor-faults,major-faults ./my_program

The TLB (Translation Lookaside Buffer)

The TLB is a hardware cache for virtual-to-physical translations. Without it every memory access would require additional memory accesses just to walk the page tables (the process we discussed earlier).

It's a relatively small (currently typically below a few thousand entries), fast cache that stores recent virtual-to-physical translations. When you access an address, the CPU checks the TLB first. Hit: Virt-to-physical translation is instant. Miss: Do page table walk, then cache the result.

This is also why memory locality matters: accessing memory in tight patterns keeps TLB entries hot. Jumping around a large address space thrashes the TLB: you need to remove an entry to allow caching a new one.

TLB miss vs Page fault: A TLB miss just means the translation isn't cached: the CPU does a page table walk in hardware to find it. A page fault occurs when the page table walk finds that the page isn't mapped at all (present bit = 0), which traps to the kernel.

diagram
                        Memory Access (virtual address)
                                      │
                                      ▼
                                ┌──────────┐
                                │   TLB    │
                                └────┬─────┘
                                     │
                 ┌───────────────────┴───────────────────┐
                 │                                       │
                Hit                                    Miss
                 │                                       │
                 │                                       ▼
                 │                            ┌─────────────────────┐
                 │                            │   Page table walk   │
                 │                            │     (hardware)      │
                 │                            └──────────┬──────────┘
                 │                                       │
                 │                   ┌───────────────────┴───────────────────┐
                 │                   │                                       │
                 │               Present=1                               Present=0
                 │           (page is mapped)                        (page not mapped)
                 │                   │                                       │
                 │                   ▼                                       ▼
                 │             Cache in TLB                             PAGE FAULT
                 │                   │                               (trap to kernel)
                 │                   │                                       │
                 │                   │                       ┌───────────────┴───────────────┐
                 │                   │                       │                               │
                 │                   │                Valid access                    Invalid access
                 │                   │              (demand paging)                   (bad pointer)
                 │                   │                       │                               │
                 │                   │                       ▼                               ▼
                 │                   │             ┌───────────────────┐                 SIGSEGV
                 │                   │             │ 1. Allocate frame │                 (crash)
                 │                   │             │ 2. Zero the page  │
                 │                   │             │ 3. Update page    │
                 │                   │             │    table entry    │
                 │                   │             └─────────┬─────────┘
                 │                   │                       │
                 │                   │                       ▼
                 │                   │               Return to userspace
                 │                   │                       │
                 │                   │                       ▼
                 │                   │                 Retry access
                 │                   │                       │
                 ▼                   ▼                       ▼
            ┌─────────────────────────────────────────────────────┐
            │                    Access memory                    │
            └─────────────────────────────────────────────────────┘

Huge Pages

x86-64 supports larger page sizes: 2MB and 1GB. Huge pages reduce TLB pressure dramatically.

This is extremely powerful for programs that require large chunks of long-lived memory (hash tables, DB internal structures, ML weights..). Especially when doing non-sequential access, since we are "touching" many sections of our region, which would cause lots of faults if we had multiple 4kb pages instead of one huge one.

Trade-offs:

Copy-on-Write

Copy-on-write (COW) is an optimization where pages are shared until one process tries to write.

COW is also used for mmap(MAP_PRIVATE) file mappings and some allocator optimizations. Understanding COW explains why fork() is fast even with large address spaces, and why writes to "shared" data may suddenly consume memory. Won't go in more details here but it's an important concept for programmers who want more depth when it comes to memory and performance of data intensive applications.


Allocation

Now that we understand how virtual memory works, let's look at how programs actually get memory. An allocator manages memory on behalf of your program: it requests large chunks from the OS and subdivides them for individual allocations.

The first thing to understand is when do we actually want to allocate? This might seem obvious but if you've only dealt with programming languages that allocate for you under the hood you might not know.

Now we can think about allocation in more details.

Most programmers don't need custom allocators. Understanding how your language's standard containers work can get you decent performance with zero custom code in most cases. We'll start there, then cover specialized strategies, and finally dive into how glibc's malloc (the default allocator on gnu distros) actually works for those who want the details.

Dynamic Arrays in Practice

A dynamic array (vector, list, slice: different names in different languages) is an array that can grow. Fixed size arrays require knowing the size upfront; dynamic arrays let you keep adding elements.

The basic idea is simple: allocate some initial capacity, track how many elements you've actually used (size), and when you run out of room, allocate a bigger buffer and copy everything over.

Key concepts:

If you were to implement one yourself in C: malloc an initial buffer, track size and capacity, and when size hits capacity, call realloc with double the capacity (realloc may extend in place if lucky, or allocate new memory and copy if not). That's essentially what every language's standard dynamic array does under the hood.

Getting Good Behavior

Pre-allocate when size is known or estimable:

rust
// Without pre-allocation:
let mut v = Vec::new();       // capacity = 0
for item in items {
    v.push(item);             // reallocates at capacity 1, 2, 4, 8, 16...
}                             // log(n) allocations total

// With pre-allocation:
let mut v = Vec::with_capacity(items.len());  // one allocation upfront
for item in items {
    v.push(item);             // no reallocation until we exceed capacity
}                             // single allocation (assuming items.len() doesn't grow)

Page-aligned capacities: Since the OS allocates memory in pages (typically 4KB), choosing capacities that are multiples of page size avoids wasting memory. If you allocate 4097 bytes, you're using 2 pages (8KB) anyway. For large allocations, round up to page boundaries.

Reuse allocations:

Batch allocations:

Avoid allocation in hot loops:

Allocation Strategies

General-Purpose Allocators

Default choices: glibc malloc, jemalloc, tcmalloc, mimalloc. Use these for most cases.

Tip: You can sometimes swap allocators via LD_PRELOAD or linker flags without code changes to compare available allocators.

bash
# Try jemalloc without recompiling
LD_PRELOAD=/usr/lib/libjemalloc.so ./my_program

Arena/Bump Allocators

The idea: grab a big block of memory upfront, then allocate from it by simply bumping a pointer forward. No bookkeeping per allocation. When you're done with everything, reset the pointer back to the start.

This is extremely fast: allocation is just "increment pointer, return old value". The tradeoff is you can't free individual objects. You free everything at once, or nothing.

This sounds limiting, but many workloads have batch lifetimes properties where a group of allocations all die together:

If your allocation pattern fits this model, arenas are hard to beat: fast allocation, zero fragmentation, and excellent cache locality since allocations are contiguous.

Pool Allocators

The idea: pre-allocate a bunch of fixed-size blocks and maintain a list of which ones are free. Allocation pops a block from the free list; deallocation pushes it back. Both operations are O(1).

The constraint is that all objects must be the same size. But when that fits your use case, pools are very efficient: no fragmentation (every block is the same size), fast allocation and deallocation, and unlike arenas, you can free individual objects.

Pools work well when you have many objects of the same type with unpredictable lifetimes:

Allocators and Pages

As we previously saw the OS gives memory in pages (typically 4KB). Allocators like malloc are general-purpose allocators: they request pages from the OS and carve them up into smaller pieces for your individual allocations. When you malloc(32), you're not getting a page; you're getting 32 bytes from a page the allocator already has.

This is why allocation strategy matters: a general-purpose allocator must handle any size, any lifetime, any pattern. Arena and pool allocators trade generality for speed by constraining what patterns they support. Pick the right tool for your access pattern. You might benefit from a specific allocator alongside huge pages.


Memory Alignment and Cache Lines

Now that we know how to get memory, let's understand how the CPU actually accesses it. This knowledge informs how we should organize our data.

Cache Line Basics

Modern x86 CPUs use 64-byte cache lines. Memory is fetched in cache-line units, not individual bytes. Accessing one byte loads the entire 64-byte line into cache.

This has implications: accessing arr[0] also loads arr[1] through arr[15] (assuming 4-byte elements, i.e a 32 bit integer). Sequential access is essentially free after the first fetch. Which is also partly why arrays (contiguous memory) are so powerful.

Alignment Implications

Aligned access:

Misaligned/straddling access:

Struct Layout and Natural Alignment

CPUs access memory most efficiently when data is naturally aligned: placed at an address divisible by its size. A 4-byte int wants to be at an address divisible by 4; an 8-byte pointer wants an address divisible by 8. A 1-byte char can be anywhere (every address is divisible by 1).

CPUs read memory in fixed-size chunks. If a 4-byte int starts at address 5 instead of address 4 or 8, it spans two of these chunks: the CPU needs two loads instead of one.

Modern compilers mostly handle this for you: they insert padding between struct fields to keep each field properly aligned.

c
struct Example {
    char a;      // 1 byte
                 // 3 bytes padding
    int b;       // 4 bytes (aligned to 4)
    char c;      // 1 byte
                 // 7 bytes padding
    long d;      // 8 bytes (aligned to 8)
};  // Total: 24 bytes

// Reordered to minimize padding:
struct ExamplePacked {
    long d;      // 8 bytes
    int b;       // 4 bytes
    char a;      // 1 byte
    char c;      // 1 byte
                 // 2 bytes padding
};  // Total: 16 bytes

Cache-friendly struct design:

False Sharing

False sharing is a mult-threaded system concern which occurs when two threads access variables that happen to share a cache line. Each write invalidates the cache line on other cores, causing constant cache line bouncing.

c
// BAD: counters likely share a cache line
struct Counters {
    int counter_thread_0;  // Thread 0 writes
    int counter_thread_1;  // Thread 1 writes
};

// GOOD: pad to separate cache lines
struct Counters {
    alignas(64) int counter_thread_0;
    alignas(64) int counter_thread_1;
};

Data Layouts: SoA vs AoS

With an understanding of cache lines, we can now make informed decisions about how to organize our data in memory.

Array of Structures (AoS)

The traditional OOP approach: objects stored contiguously, with each object's fields adjacent in memory.

c
struct Particle { float x, y, z, vx, vy, vz; };
Particle particles[10000];

// Memory layout: [x,y,z,vx,vy,vz][x,y,z,vx,vy,vz]...

Structure of Arrays (SoA)

Each field stored in its own contiguous array.

c
struct Particles {
    float* x;   // [x x x x x ...]
    float* y;   // [y y y y y ...]
    float* z;   // [z z z z z ...]
    float* vx;  // [vx vx vx ...]
    float* vy;  // [vy vy vy ...]
    float* vz;  // [vz vz vz ...]
    size_t count;
};

Cache and Memory Analysis

Consider updating all x positions by their velocities:

c
// AoS: cache lines contain x,y,z,vx,vy,vz - we only need x and vx
for (int i = 0; i < n; i++)
    particles[i].x += particles[i].vx * dt;

// SoA: cache lines contain only x values, then only vx values
for (int i = 0; i < n; i++)
    p->x[i] += p->vx[i] * dt;

With SoA, we load 16 floats per cache line, all useful. With AoS, each cache line gives us ~2-3 particles worth of x and vx, wasting bandwidth on y, z, vy, vz. SoA also enables SIMD: the compiler can vectorize the SoA loop trivially.

When to Use AoS

When to Use SoA

Hybrid: AoSoA

Array of Structures of Arrays: chunk data into blocks (e.g., 64 elements), with SoA layout within each block.

c
// AoSoA with 64-element blocks
struct ParticleBlock {
    float x[64];
    float y[64];
    float z[64];
    float vx[64];
    float vy[64];
    float vz[64];
};
ParticleBlock blocks[num_blocks];

This balances SIMD-friendliness (contiguous same-type data within blocks) with reasonable object access (block index + local index).

Hot/Cold Splitting

Keep frequently accessed fields together by moving rarely accessed fields to a separate allocation.

c
// Hot data accessed every frame
struct EntityHot {
    float x, y, z;
    float vx, vy, vz;
};

// Cold data accessed occasionally
struct EntityCold {
    char name[64];
    int creation_time;
    int flags;
};

EntityHot* hot_data;    // Dense, cache-friendly iteration
EntityCold* cold_data;  // Only touched when needed

This improves cache utilization without a full SoA conversion.


Closing Thoughts

Hopefully this helps clear some of the fairly complex stuff behind memory! There's a lot we didn't touch here (mmap specifics, NUMA, I/O (fread vs mmap vs read vs async (io_uring, async APIs)...) but I believe this is the essential to help people investigate more if they so wish.

Uploaded 31-01-2026