Performance Analysis and Improvements - part 0

rust, asm, perf part 0

This is the first article in a short series about performance improvements/optimization and related workflow.

We will go through a relatively simple problem, taken from the first day of Advent of Code 2025. Note that this is done under linux on x86_64 using rust, but the techniques and general mental model/strategy apply for pretty much any software stack. The problem and language does not really matter: I went with something simple so that it's easy to follow along.

Setup & Approach

Environment

Dev/benchmark env:

Component	Details
OS	Linux (kernel 6.14.0-37-generic)
CPU	12th Gen Intel Core i9-12900K
Cores/Threads	16 cores, 24 threads
Max Frequency	5.2 GHz
Architecture	x86_64

The Approach

The optimization cycle we follow:

Benchmark - Establish a "truthful" baseline (how fast is our current implementation).
Profile - Find where and how time is spent.
Analyze - Understand what's happening (which instructions are taking time, which code paths are slow and why).
Optimize - Make a targeted change based on analysis
Measure - Verify the improvement (via benchmarks/profiling..), repeat

This article (part 0 of our series) covers the first three steps: establishing a baseline and deep-diving into where time goes.

Tools

We'll use these throughout the series:

Criterion - Statistical benchmarking framework for Rust. Reliable measurements with confidence intervals. cargo bench
cargo-asm - View the assembly generated for given code/fn's. cargo asm --lib --release "function_name"
Intel VTune - CPU profiler for hotspot/microarchitecture analysis (and more). Deep dive into what our program actually does and how it behaves in relationship to the hardware.

Some good alternatives for profiling/investigating/analysis: perf (often alongside flamegraph), ebpf based tooling, dtrace, cargo-flamegraph, llvm-mca, binary ninja. For easy/quick benchmarking I also like hyperfine. If you are reading this and are interested in code analysis and optimization I recommend you check out these tools. Debuggers are also often very useful (I use lldb/rust-lldb via DAP in neovim at the moment). Claude code and other llms can help make the profile-analyze-optimize loop faster, use with care.

For VTune profiling, we create a simple harness that runs the function 100,000 times to give the sampling profiler enough time to sample:

rust

fn main() {
    for _ in 0..100_000 {
        std::hint::black_box(whatever_we_are_profiling());
    }
}

std::hint::black_box - Prevents the compiler from optimizing away our function call.
Release build - We profile optimized code (but keep debug symbols to have more information via our profiler of choice)(cargo build --release --example our_sample_harness).

The Example Problem

Don't spend too much time analyzing the problem, it's really just here as an excuse to go through the motion.

We have a circular dial with positions 0-99 (100 positions total). The dial starts pointing at position 50, and we process a sequence of rotation commands.

Each command is a line (ends with \n) like L68 or R155:

L means rotate left (subtract)
R means rotate right (add)
The number is how many positions to rotate

Goal: Count how many times the dial crosses or lands on position 0.

The counting rules:

Each full rotation (100 units) crosses zero once
A partial rotation that wraps around zero counts as a crossing
Landing exactly on zero also counts

Our input is ~4,500 lines of these commands:

input

R20
R10
L11
R45
R13
R32
R46
L20
L1
L26
...

Base Implementation

Here's a fairly straightforward Rust approach:

rust

pub fn naive(input: &[u8]) -> i32 {
    // Convert bytes to string, split into lines, collect into Vec
    let input_str = unsafe { std::str::from_utf8_unchecked(input) };
    let lines: Vec<&str> = input_str.split('\n').collect();

    let mut cursor = 50;           // Current dial position
    let mut clock_crosses = 0;     // Total zero-crossings

    for l in lines {
        if l.is_empty() { continue; }

        // Parse "L68" into direction="L", amount=68
        let (dir, amount_str) = l.split_at(1);
        let amount = amount_str.parse::<i32>().expect("Failed to parse integer");

        // Full rotations always cross zero; remainder might cross once more
        let quotient = amount / 100;  // Number of full rotations
        let rem = amount % 100;       // Leftover after full rotations

        if dir == "L" {
            // --- LEFT ROTATION ---
            let raw = cursor - rem;
            // Wrap negative values: if raw < 0, add 100 (branchless)
            let new_cursor = raw + (100 & (raw >> 31));
            // Did we cross zero? (cursor was positive and we passed through zero)
            let crosses_extra = (cursor != 0 && cursor < rem) as i32;
            // Count: full rotations + partial crossing + landing on zero
            clock_crosses += quotient + crosses_extra + (cursor != 0 && new_cursor == 0) as i32;
            cursor = new_cursor;
        } else {
            // --- RIGHT ROTATION ---
            let raw = cursor + rem;
            let overflow = (raw >= 100) as i32;
            // Wrap values >= 100: subtract 100 if overflow
            let new_cursor = raw - 100 * overflow;
            // Did we cross zero? (wrapped around, but didn't start or end at zero)
            let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;
            clock_crosses += quotient + crosses_extra + (cursor != 0 && new_cursor == 0) as i32;
            cursor = new_cursor;
        }
    }
    clock_crosses
}

Things to note:

split('\n').collect() gathers all lines into a Vec
split_at(1) separates direction from number
parse::<i32>() converts the string to an integer

cargo bench (criterion) shows 43.6 us. This is our baseline wall time.

Let's find out where time actually goes.

Analysis

Where Time Goes

Note that when profiling we will typically check which lines/blocks of code take which % of CPU time, we will also map these to the relevent assembly. When we talk about time in us, ms or seconds for profiling it's important to understand that these are just to have a ballpark idea of the time it takes compared to other code bits, we run the functions or code snippet thousands of time so that the sampling profiler has enough information to form an accurate report. When we want to know the time that our code spends running we benchmark. When we discuss time for profiling is to know how much cycles were spent in relationship to other computes.

Benchmark's objective is to have statistically accurate timing on our current system. Profiling and analysing: deep insight about what's happening. They work in pair: profile/analyze to know what's happening and benchmark to have a good idea of our program's speed. Our benchmark (done via the handy cargo bench command (internally we use criterion as a library to benchmark and give stats/reports on different benchmark runs) showed us our first implementation took 43.6 us on our system. We are now interested in profiling/analyzing to understand why and how.

Let's profile with Intel VTune, starting with a high level per-function hotspot analysis, to see where time is actually spent.

Function	CPU Time	% of CPU
Split<P>::next (iterator)	0.264s	52.4%
naive (our code)	0.178s	35.3%
__memcmp_avx2_movbe (libc)	0.051s	10.2%
alloc::raw_vec::finish_grow	0.002s	0.4%

Only 35% of CPU time is spent in "our" code. (~50% spent in iterator code from "split_at" and "split" calls to find new lines).

The majority goes to:

52.4% - The Split iterator walking through the string
10.2% - memcmp called internally by the iterator for slice operations

Let's focus on the biggest offender first: the iterator.

Inside the Iterator

VTune's assembly-level profiling shows where cycles are spent inside Split::next:

Assembly	What it does	CPU Time
`callq 0x3ed0a(%rip)`	Call into memchr to search for `\n` in the remaining slice	19.076ms
`jmp 0x15c79`	Jump for the iterator state machine	15.060ms
`pushq %r13`	Save register	14.056ms
`cmp %r10, %r9`	Compare position to end (bounds checking)	12.048ms
`jnz 0x15d20`	Conditional branch: "found delimiter" and "keep searching"	9.036ms
`mov $0x8080..., %r13`	SIMD mask - checks 8 bytes at once for the delimiter	7.028ms

The mov $0x8080... constant is interesting: the iterator uses SIMD-style technique to scan 8 bytes at a time to find newlines. This might be great for general purpose parsing, but our input lines are only 2-4 characters (i.e L68\n).

We call this ~4,500 times (once per line in the input). For our simple input line format, we might be able to do better.

Integer Parsing Overhead

What about amount_str.parse::<i32>()? Using cargo asm to look at the generated code for our function, we can see what the standard library's parsing does:

x86 asm

cmp r8d, 43          ; is first char '+' ?
je .handle_plus
cmp r8d, 45          ; is first char '-' ?
je .handle_minus
; ... digit loop ...
imul r8d, r8d, 10
seto bl              ; overflow check on every digit!

The parser handles cases we don't have:

Optional + or - signs - we only have positive numbers
Overflow detection (seto) - our numbers are small
Error handling paths - our input is well-formed

Our input is always [0-9]+ - positive integers, no signs, no overflow possible. These checks run on every digit of every number we parse.

Microarchitecture Analysis

VTune's microarchitecture exploration allows us to compare some key metrics of the iterator versus the part of our code that is just the algorithm:

	Split::next	Our Algorithm
CPU Time	313ms	141ms
Instructions	5.45 billion	3.92 billion
CPI (Cycles/Instr)	0.249	0.207
IPC (Instr/Cycle)	~4.0	~4.8
Front-End Bound	40.8%	15.6%
Branch Mispredict	8.0%	11.1%

IPC/CPI is a key metric, it represents the number of instructions that run on average per cycle during sampling (Instructions per Cycle). CPI it's just the other way around: Cycle per Instructions. The theorical maximum for my chip (at least that's what intel says) is 0.25 (SIMD and some high latency instructions might make this smaller in practice which is what we see here). This typically means that there is no major bottlenecks: instructions are being decoded, scheduled and executed at a good pace, we are most likely not stalled by things like cache misses are a lot of interdependencies between instructions that hit the same registers.

Note that this does not mean that we are doing efficient work: it just means that the work we are doing is being done efficiently. In our case we can definitely reduce the amount of work that is done.

Split::next is 40.8% Front-End Bound - The instruction decoder is meant to feed the CPU's backend which runs the instructions via it's execution ports. The frontend can't keep up with the iterator code: we are stalled waiting for instructions being decoded, not because we're not executing then fast enough (as we can see by the near theorical optimal IPC/CPI metric) which means that it's probably just because of how much instructions were generating.

5.45 billion instructions to split 4,500 lines is ~1.2 million instructions per line. The iterator is general-purpose, safe and UTF-8-aware: we can probably do better for our specific input: we might want to focus on reducing the number of instructions to parse our input.

What About the Rest?

Now let's look at the remaining 35% our dial simulation algo. Source-level breakdown from VTune:

Code	CPU Time	% Instr
`let rem = amount % 100`	21.084ms	3.8%
`for l in lines {`	9.036ms	0.3%
`clock_crosses += ... (left branch)`	7.028ms	1.1%
`clock_crosses += ... (right branch)`	7.028ms	0.9%
`crosses_extra = (cursor != 0 && cursor < rem)`	6.024ms	1.9%
`new_cursor = raw - 100 * overflow`	6.024ms	0.5%
`if l.is_empty() { continue }`	5.020ms	2.2%
`if dir == "L" {`	3.012ms	0.4%

The hottest line is the modulo operation (~4%). The compiler generates this for our math:

x86 asm

; Division by 100 via "magic multiplication"
imul $0x51eb851f, %r8, %rsi    ; multiply by magic constant
sar $0x25, %rsi                 ; arithmetic shift right by 37

; Branchless cursor wrapping
sar $0x1f, %r8d                ; sign extend (all 1s if negative)
and $0x64, %r8d                ; mask with 100
add %r8d, %edi                 ; add 100 if was negative

The imul $0x51eb851f is a classic: instead of using a slow div instruction, the compiler multiplies by a "magic constant" and shifts. This is quite common/standard for division by constants.

Our algorithm is fine (could definitely be optimized but not the main issue now). The bottleneck is clearly the iterator and parsing overhead.

What We've Learned

Our naive implementation spends more time in split_at/split iterator code than doing other computation:

Component	Time	What it does
`Split::next`	52.4%	Walk through string finding newlines
Algo	35.3%	The dial simulation
`memcmp`	10.2%	String comparison for `dir == "L"`

In Part 1, we'll write a new implementation that:

Streams through the input in one pass (no Vec allocation)
Parse the input bytes ourselves (biggest issue at the moment)
Parses integers without sign/overflow handling since we know our input is ok

Pker.xyz