Performance Analysis and Improvements - part 0
This is the first article in a short series about
performance improvements/optimization and related workflow.
We will go through a relatively simple problem, taken from the first
day of Advent of Code 2025. Note that this is done under linux on
x86_64 using rust, but the techniques and general mental
model/strategy apply for pretty much any software stack. The
problem and language does not really matter: I went with
something simple so that it's easy to follow along.
Setup & Approach
Environment
Dev/benchmark env:
| Component | Details |
|---|---|
| OS | Linux (kernel 6.14.0-37-generic) |
| CPU | 12th Gen Intel Core i9-12900K |
| Cores/Threads | 16 cores, 24 threads |
| Max Frequency | 5.2 GHz |
| Architecture | x86_64 |
The Approach
The optimization cycle we follow:
- Benchmark - Establish a "truthful" baseline (how fast is our current implementation).
- Profile - Find where and how time is spent.
- Analyze - Understand what's happening (which instructions are taking time, which code paths are slow and why).
- Optimize - Make a targeted change based on analysis
- Measure - Verify the improvement (via benchmarks/profiling..), repeat
This article (part 0 of our series) covers the first three steps: establishing a baseline and deep-diving into where time goes.
Tools
We'll use these throughout the series:
- Criterion - Statistical benchmarking framework for Rust. Reliable measurements with confidence intervals.
cargo bench - cargo-asm
- View the assembly generated for given code/fn's.
cargo asm --lib --release "function_name" - Intel VTune - CPU profiler for hotspot/microarchitecture analysis (and more). Deep dive into what our program actually does and how it behaves in relationship to the hardware.
For VTune profiling, we create a simple harness that runs the function 100,000 times to give the sampling profiler enough time to sample:
fn main() { for _ in 0..100_000 { std::hint::black_box(whatever_we_are_profiling()); } }
std::hint::black_box- Prevents the compiler from optimizing away our function call.- Release build - We profile optimized code (but keep debug
symbols to have more information via our profiler of choice)(
cargo build --release --example our_sample_harness).
The Example Problem
Don't spend too much time analyzing the problem, it's really just here as an excuse to go through the motion.
We have a circular dial with positions 0-99 (100 positions total). The dial starts pointing at position 50, and we process a sequence of rotation commands.Each command is a line (ends with \n) like L68 or R155:
- L means rotate left (subtract)
- R means rotate right (add)
- The number is how many positions to rotate
Goal: Count how many times the dial crosses or lands on position 0.
The counting rules:
- Each full rotation (100 units) crosses zero once
- A partial rotation that wraps around zero counts as a crossing
- Landing exactly on zero also counts
Our input is ~4,500 lines of these commands:
R20 R10 L11 R45 R13 R32 R46 L20 L1 L26 ...
Base Implementation
Here's a fairly straightforward Rust approach:
pub fn naive(input: &[u8]) -> i32 { // Convert bytes to string, split into lines, collect into Vec let input_str = unsafe { std::str::from_utf8_unchecked(input) }; let lines: Vec<&str> = input_str.split('\n').collect(); let mut cursor = 50; // Current dial position let mut clock_crosses = 0; // Total zero-crossings for l in lines { if l.is_empty() { continue; } // Parse "L68" into direction="L", amount=68 let (dir, amount_str) = l.split_at(1); let amount = amount_str.parse::<i32>().expect("Failed to parse integer"); // Full rotations always cross zero; remainder might cross once more let quotient = amount / 100; // Number of full rotations let rem = amount % 100; // Leftover after full rotations if dir == "L" { // --- LEFT ROTATION --- let raw = cursor - rem; // Wrap negative values: if raw < 0, add 100 (branchless) let new_cursor = raw + (100 & (raw >> 31)); // Did we cross zero? (cursor was positive and we passed through zero) let crosses_extra = (cursor != 0 && cursor < rem) as i32; // Count: full rotations + partial crossing + landing on zero clock_crosses += quotient + crosses_extra + (cursor != 0 && new_cursor == 0) as i32; cursor = new_cursor; } else { // --- RIGHT ROTATION --- let raw = cursor + rem; let overflow = (raw >= 100) as i32; // Wrap values >= 100: subtract 100 if overflow let new_cursor = raw - 100 * overflow; // Did we cross zero? (wrapped around, but didn't start or end at zero) let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32; clock_crosses += quotient + crosses_extra + (cursor != 0 && new_cursor == 0) as i32; cursor = new_cursor; } } clock_crosses }
Things to note:
split('\n').collect()gathers all lines into a Vecsplit_at(1)separates direction from numberparse::<i32>()converts the string to an integer
cargo bench (criterion) shows 43.6 us. This is our baseline wall time.
Let's find out where time actually goes.
Analysis
Where Time Goes
Note that when profiling we will typically check which lines/blocks of code take which % of CPU time, we will also map these to the relevent assembly. When we talk about time in us, ms or seconds for profiling it's important to understand that these are just to have a ballpark idea of the time it takes compared to other code bits, we run the functions or code snippet thousands of time so that the sampling profiler has enough information to form an accurate report. When we want to know the time that our code spends running we benchmark. When we discuss time for profiling is to know how much cycles were spent in relationship to other computes.
Benchmark's objective is to have statistically accurate timing on our current system. Profiling and analysing: deep insight about what's happening. They work in pair: profile/analyze to know what's happening and benchmark to have a good idea of our program's speed. Our benchmark (done via the handy cargo bench command (internally we use criterion as a library to benchmark and give stats/reports on different benchmark runs) showed us our first implementation took 43.6 us on our system. We are now interested in profiling/analyzing to understand why and how.
Let's profile with Intel VTune, starting with a high level per-function hotspot analysis, to see where time is actually spent.
| Function | CPU Time | % of CPU |
|---|---|---|
| Split<P>::next (iterator) | 0.264s | 52.4% |
| naive (our code) | 0.178s | 35.3% |
| __memcmp_avx2_movbe (libc) | 0.051s | 10.2% |
| alloc::raw_vec::finish_grow | 0.002s | 0.4% |
Only 35% of CPU time is spent in "our" code. (~50% spent in iterator code from "split_at" and "split" calls to find new lines).
The majority goes to:
- 52.4% - The
Splititerator walking through the string - 10.2% -
memcmpcalled internally by the iterator for slice operations
Let's focus on the biggest offender first: the iterator.
Inside the Iterator
VTune's assembly-level profiling shows where cycles are spent inside Split::next:
| Assembly | What it does | CPU Time |
|---|---|---|
callq 0x3ed0a(%rip) |
Call into memchr to search for \n in the remaining slice |
19.076ms |
jmp 0x15c79 |
Jump for the iterator state machine | 15.060ms |
pushq %r13 |
Save register | 14.056ms |
cmp %r10, %r9 |
Compare position to end (bounds checking) | 12.048ms |
jnz 0x15d20 |
Conditional branch: "found delimiter" and "keep searching" | 9.036ms |
mov $0x8080..., %r13 |
SIMD mask - checks 8 bytes at once for the delimiter | 7.028ms |
The mov $0x8080... constant is interesting: the iterator
uses SIMD-style technique to scan 8 bytes at a time to find
newlines. This might be great for general purpose parsing, but
our input lines are only 2-4 characters (i.e L68\n).
We call this ~4,500 times (once per line in the input). For our simple input line format, we might be able to do better.
Integer Parsing Overhead
What about amount_str.parse::<i32>()? Using cargo asm to look at the generated code for our function, we can see what the standard library's parsing does:
cmp r8d, 43 ; is first char '+' ? je .handle_plus cmp r8d, 45 ; is first char '-' ? je .handle_minus ; ... digit loop ... imul r8d, r8d, 10 seto bl ; overflow check on every digit!
The parser handles cases we don't have:
- Optional
+or-signs - we only have positive numbers - Overflow detection (
seto) - our numbers are small - Error handling paths - our input is well-formed
Our input is always [0-9]+ - positive integers, no signs, no overflow possible. These checks run on every digit of every number we parse.
Microarchitecture Analysis
VTune's microarchitecture exploration allows us to compare some key metrics of the iterator versus the part of our code that is just the algorithm:
| Split::next | Our Algorithm | |
|---|---|---|
| CPU Time | 313ms | 141ms |
| Instructions | 5.45 billion | 3.92 billion |
| CPI (Cycles/Instr) | 0.249 | 0.207 |
| IPC (Instr/Cycle) | ~4.0 | ~4.8 |
| Front-End Bound | 40.8% | 15.6% |
| Branch Mispredict | 8.0% | 11.1% |
IPC/CPI is a key metric, it represents the number of
instructions that run on average per cycle during sampling
(Instructions per Cycle). CPI it's just the other way around:
Cycle per Instructions. The theorical maximum for my chip (at
least that's what intel says) is
0.25 (SIMD and some high latency instructions might make this
smaller in practice which is what we see here). This typically
means that there is no major bottlenecks: instructions are being
decoded, scheduled and executed at a good pace, we are most
likely not stalled by things like cache misses are a lot of
interdependencies between instructions that hit the same
registers.
Note that this
does not mean that we are doing efficient work: it just means
that the work we are doing is being done efficiently. In our
case we can definitely reduce the amount of work that is done.
5.45 billion instructions to split 4,500 lines is ~1.2 million instructions per line. The iterator is general-purpose, safe and UTF-8-aware: we can probably do better for our specific input: we might want to focus on reducing the number of instructions to parse our input.
What About the Rest?
Now let's look at the remaining 35% our dial simulation algo. Source-level breakdown from VTune:
| Code | CPU Time | % Instr |
|---|---|---|
let rem = amount % 100 |
21.084ms | 3.8% |
for l in lines { |
9.036ms | 0.3% |
clock_crosses += ... (left branch) |
7.028ms | 1.1% |
clock_crosses += ... (right branch) |
7.028ms | 0.9% |
crosses_extra = (cursor != 0 && cursor < rem) |
6.024ms | 1.9% |
new_cursor = raw - 100 * overflow |
6.024ms | 0.5% |
if l.is_empty() { continue } |
5.020ms | 2.2% |
if dir == "L" { |
3.012ms | 0.4% |
The hottest line is the modulo operation (~4%). The compiler generates this for our math:
; Division by 100 via "magic multiplication" imul $0x51eb851f, %r8, %rsi ; multiply by magic constant sar $0x25, %rsi ; arithmetic shift right by 37 ; Branchless cursor wrapping sar $0x1f, %r8d ; sign extend (all 1s if negative) and $0x64, %r8d ; mask with 100 add %r8d, %edi ; add 100 if was negative
The imul $0x51eb851f is a classic: instead of using a
slow div instruction, the compiler multiplies by a
"magic constant" and shifts. This is quite common/standard for division by constants.
Our algorithm is fine (could definitely be optimized but not the main issue now). The bottleneck is clearly the iterator and parsing overhead.
What We've Learned
Our naive implementation spends more time in split_at/split iterator code than doing other computation:
| Component | Time | What it does |
|---|---|---|
Split::next |
52.4% | Walk through string finding newlines |
| Algo | 35.3% | The dial simulation |
memcmp |
10.2% | String comparison for dir == "L" |
In Part 1, we'll write a new implementation that:
- Streams through the input in one pass (no Vec allocation)
- Parse the input bytes ourselves (biggest issue at the moment)
- Parses integers without sign/overflow handling since we know our input is ok