Performance Analysis and Improvements [part 2]
In Part 1, we achieved a 5.1x speedup by replacing string-based parsing with simple inlined byte parser slightly more specific to our input. Now we continue the optimization cycle.
Recap: Where We Left Off
Our improved implementation processes ~4,500 dial rotation commands in 8.55 us, a 5.1x improvement over the naive implementation!
| Implementation | Time | Speedup |
|---|---|---|
| Naive (string-based) | 43.6 us | 1x (baseline) |
| Simple parser | 8.55 us | 5.1x |
Let's profile the newer implementation to see what's left to optimize.
Profiling the Simple Parser Implementation
VTune Hotspot Analysis
Let's see where clock cycles are spent:
| Code | CPU Time | % Instructions |
|---|---|---|
let rem = num % 100 |
376.5ms | 14.5% |
while idx < len { (inner loop) |
215.9ms | 7.5% |
num = num * 10 + (b - 0x30) as i32 |
199.8ms | 8.0% |
while idx < len { (outer loop) |
180.7ms | 3.6% |
let is_left = ... == b'L' |
176.7ms | 1.9% |
if b < 0x30 { break; } |
175.7ms | 7.5% |
let b = *input.get_unchecked(idx) |
165.7ms | 3.8% |
crosses_extra = (cursor != 0 && new_cursor != 0 && ...) |
112.5ms | 4.5% |
crosses_extra = (cursor != 0 && cursor < rem) |
98.4ms | 6.3% |
Hmm, seems like modulo is expensive, the loop bounds checks and crossing detection logic (some compute bits of our algo) also consume significant time. Let's dig deeper.
Assembly Hotspots
Hotspots grouped by code area:
Inner Parsing Loop (hottest)
while idx < len { let b = unsafe { *input.get_unchecked(idx) }; if b < 0x30 { break; } num = num * 10 + (b - 0x30) as i32; idx += 1; }
Key assembly (~15% combined):
jnz .LBB7 ; loop back — ~8% (hottest instruction) jb ... ; newline check — ~7%
Per digit processed: bounds check, load byte, compare, accumulate, branch. The jnz dominates because it executes every digit iteration. This is from our "new and improved" parser, maybe we can do even better!
Modulo Computation
let rem = num % 100;
Key assembly (~14.5% combined):
imul $0x51eb851f ; magic constant for division — ~7% add %r11d, %r9d ; part of modulo sequence — ~8%
Division by constant uses multiplication trick.
Crossing Detection
// Left let crosses_extra = (cursor != 0 && cursor < rem) as i32; // Right let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;
Key assembly (~11% combined):
setnz ... ; left crossing conditions — ~6% setnz ... ; right crossing conditions — ~5%
Each && becomes test + setcc + and. Multiple conditions
generates multiple instruction chains. Maybe we can merge these
conditions?
Microarchitecture Analysis
VTune's microarchitecture analysis shows baseline efficiency:
| Metric | Value | Meaning |
|---|---|---|
| CPI | 0.170 | ~6 instructions per cycle |
| Retiring | 94.4% | Almost no wasted work |
| Bad Speculation | 0.0% | Branches perfectly predicted |
| Front-End Bound | 3.0% | Instruction fetch is fine |
| Back-End Bound | 6.6% | Execution units rarely stall |
Key insight: The low CPI (0.170, ~6 instructions/cycle) combined with minimal front-end and back-end stalls indicates excellent instruction-level parallelism (ILP). The CPU's out-of-order execution engine can decode, schedule, and execute multiple independent instructions simultaneously. With 94.4% retiring and near-zero stalls, the instruction stream has few data dependencies blocking parallel execution. This is all great news: our algo is very CPU friendly. The bad news is that this implies it might be harder for us to find easy things to improve (but we'll manage).
Since we're already near the CPU's throughput ceiling, further optimization requires restructuring the algorithm.
Key Observations
- Loop overhead per input entry: Each input entry processes a loop iteration with bounds check, increment, and branch back
- Crossing detection has multiple conditions:
cursor != 0 && cursor < remrequires two comparisons and an AND - Each entry processes independently: No data dependencies between entries that would prevent parallelism
Analysis: What Could Be Improved?
Based on the profile data, we can group hotspots by what optimization would address them:
Target 1: Inner Parsing Loop (~25% combined)
while idx < len { let b = unsafe { *input.get_unchecked(idx) }; if b < 0x30 { break; } num = num * 10 + (b - 0x30) as i32; idx += 1; }
The inner loop runs once per digit. With ~4,500 entries averaging 2 digits each, that's ~9,000 iterations. Each iteration executes:
- Bounds check (
cmp,jae) - Load byte (
movzxb) - Newline check (
cmp,jb) - Accumulate (
leafornum * 10 + digit) - Increment + branch back (
inc,jnz)
Solution: Inline the parser to reduce the loop overhead! We can do lookahead parsing: check byte positions directly instead of looping since we know our input line width is capped to a few bytes per line.
Target 2: Crossing Detection (~11% combined)
// Left crossing (6.3%) let crosses_extra = (cursor != 0 && cursor < rem) as i32; // Right crossing (4.5%) let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;
Left crossing (5 operations):
test esi, esi ; cursor != 0? setne bl cmp esi, r11d ; cursor < rem? setl r11b and r11b, bl ; combine
Right crossing (8 operations):
test esi, esi ; cursor != 0? setnz ... test r8d, r8d ; new_cursor != 0? setnz ... test eax, eax ; overflow != 0? setnz ... and ... ; combine first two and ... ; combine third
Solution: Collapse multiple conditions into single comparisons.
Target 3: Outer Loop Overhead (~4%)
while idx < len { // ... process one entry (one line of the input)... }
Each entry: bounds check + increment + branch back.
Solution: We can do a bit of unrolling, (i.e 4x unrolling, one bounds check per 4 entries). I don't expect big gains from this specific one, most improvements should be from the other improvements but this one is basically as close as we get from a free lunch here.
4. Forced Inlining + Loop Unrolling
If we extract the per-entry logic into a function and force inlining, the compiler might:
- Better allocate registers across the unrolled sequence
- Eliminate redundant loads/stores between iterations
We include 4x loop unrolling because it's essentially free complexity: it amortizes loop overhead without making the code significantly harder to understand. The main gains come from lookahead parsing and algebraic simplification; unrolling is a small bonus.
The Optimized Implementation
The process_entry Function
#[inline(always)] unsafe fn process_entry( input: &[u8], mut idx: usize, cursor: &mut i32, clock_crosses: &mut i32 ) -> usize { let is_left = unsafe { *input.get_unchecked(idx) } == b'L'; idx += 1; // LOOKAHEAD PARSING // Instead of looping through digits, we check fixed positions. // Input format: "[LR][0-9]+\n" where numbers are 1-3 digits. // Safe because outer loop guarantees idx + 12 < len (room for 4 entries). let d0 = unsafe { (*input.get_unchecked(idx) - b'0') as i32 }; // wrapping_sub: if next byte is '\n' (0x0A), result is 0x0A - 0x30 = 0xDA > 9 let d1_raw = unsafe { (*input.get_unchecked(idx + 1)).wrapping_sub(b'0') }; let (quotient, rem) = if d1_raw <= 9 { // 2+ digits: d1_raw is valid digit let d1 = d1_raw as i32; let b2 = unsafe { *input.get_unchecked(idx + 2) }; if b2 < b'0' { // 2 digits: "NN\n" format idx += 3; (0, d0 * 10 + d1) } else { // 3 digits: "NNN\n" format idx += 4; let d2 = (b2 - b'0') as i32; (d0, d1 * 10 + d2) // hundreds digit becomes quotient } } else { // 1 digit: "N\n" format (d1_raw was '\n') idx += 2; (0, d0) }; if is_left { let raw = *cursor - rem; let new_cursor = raw + (100 & (raw >> 31)); // ALGEBRAIC SIMPLIFICATION // Original: cursor != 0 && cursor < rem // Trick: (x-1) as u32 wraps to u32::MAX when x==0 // So (cursor-1) as u32 < (rem-1) as u32 is false when cursor==0 // because u32::MAX is not < anything reasonable let crosses_extra = (((*cursor - 1) as u32) < ((rem - 1) as u32)) as i32; *clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32; *cursor = new_cursor; } else { let raw = *cursor + rem; let overflow = (raw >= 100) as i32; let new_cursor = raw - 100 * overflow; // ALGEBRAIC SIMPLIFICATION // Original: cursor != 0 && new_cursor != 0 && overflow != 0 // All three true if cursor started high enough that adding rem // pushed past 100 but not to exactly 0. // Equivalent: cursor > 100 - rem let crosses_extra = (*cursor > 100 - rem) as i32; *clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32; *cursor = new_cursor; } idx }
The Main Loop with 4x Unrolling
#[inline(never)] fn parse_and_compute(input: &[u8]) -> i32 { let len = input.len(); let mut idx = 0; let mut cursor: i32 = 50; let mut clock_crosses: i32 = 0; // Main loop: process 4 entries per iteration while idx + 12 < len { unsafe { idx = process_entry(input, idx, &mut cursor, &mut clock_crosses); idx = process_entry(input, idx, &mut cursor, &mut clock_crosses); idx = process_entry(input, idx, &mut cursor, &mut clock_crosses); idx = process_entry(input, idx, &mut cursor, &mut clock_crosses); } } // Handle remaining entries while idx < len { unsafe { idx = process_entry(input, idx, &mut cursor, &mut clock_crosses); } } clock_crosses }
Key Changes Explained
1. Lookahead-Based Digit Parsing
Instead of a while loop for digits, we use lookahead:
- Check if second byte is a digit (
d1_raw <= 9) - If yes, check if third byte is a digit
- This eliminates the inner parsing loop
2. Simplified Left Crossing Detection
Original: cursor != 0 && cursor < rem
Optimized: ((*cursor - 1) as u32) < ((rem - 1) as u32)
This works because:
- If
cursor == 0, thencursor - 1wraps tou32::MAX(4,294,967,295) u32::MAX < anythingis always false- So the zero case is automatically excluded by unsigned wraparound
3. Simplified Right Crossing Detection
Original: cursor != 0 && new_cursor != 0 && overflow != 0
Optimized: *cursor > 100 - rem
The three conditions combine to mean "cursor started high enough that adding rem pushed us past 100 but not to exactly 0 or from 0". This simplifies to a single comparison.
4. Unrolled main loop
Reduce loop overhead by 4x, might also have some slight compiler optimization benefits, TBD.
So will these changes be faster than our previous impl? And if so why? Let's bench and profile to see!
Results: Benchmarks
Criterion Results
| Implementation | Time | vs Naive | vs Simple parser |
|---|---|---|---|
| Naive | 43.6 us | 1x | — |
| Simple parser | 8.55 us | 5.1x | 1x |
| Improved | 4.93 us | 8.8x | 1.7x |
1.7x faster than simple parser, 8.8x faster than naive.
VTune Analysis After Optimization
High-Level Comparison
| Metric | Simple parser | Improved |
|---|---|---|
| CPI | 0.170 | 0.170 |
| Retiring | 94.4% | 100.0% |
| Back-End Bound | 6.6% | 0.5% |
| Hottest instruction share | dominates | evenly distributed |
Same CPI means individual instructions aren't faster. The speedup comes from:
- Eliminating bottlenecks: No single instruction dominates anymore, time is evenly distributed between instructions, as we will see.
- Better pipeline utilization: 100% retiring, near-zero backend stalls.
Assembly Hotspot Comparison
Simple parser — Dominant Bottleneck:
The inner loop jnz (branch back) dominates, with a few other hot instructions following.
| Assembly | What it does |
|---|---|
jnz .LBB7 |
Inner loop branch back — hottest instruction |
add %r11d, %r9d |
Part of modulo computation |
cmp $0x490e, %rsi |
Outer loop bounds check |
Improved — Evenly Distributed:
No single instruction dominates. Time is spread across many movzxb loads.
| Assembly | What it does |
|---|---|
movzxb 0x1(%r8,%rcx,1), %esi |
Load digit byte |
movzxb 0x1(%r9,%rcx,1), %edi |
Load digit byte |
movzxb 0x2(%rsi,%rcx,1), %ebx |
Load digit byte |
The inner loop jnz from the digit parser we made in
part1 is gone entirely, replaced by sequential loads with lookahead.
Crossing Detection Assembly Comparison
Left Crossing Before:
test esi, esi ; cursor != 0? setne bl cmp esi, r11d ; cursor < rem? setl r11b and r11b, bl ; combine
Left Crossing After:
dec %ebx ; cursor - 1 dec %edi ; rem - 1 cmp ... ; unsigned compare setb %dil ; single result
Right Crossing Before:
test esi, esi ; cursor != 0? setnz ... test r8d, r8d ; new_cursor != 0? setnz ... test eax, eax ; overflow != 0? setnz ... and ... ; combine first two and ... ; combine third
Right Crossing After:
mov r10d, 100 sub r10d, edi ; 100 - rem cmp r8d, r10d ; cursor > 100 - rem? setnle %r10b ; single result
Note the reduction in dependencies between instructions. The "Before" asm's clearly show patterns of: test-setnz-test...
Why Same CPI But Faster?
- CPI measures cycles per instruction, i.e overall throughput.
- Both versions have CPI of 0.170 (~6 instructions/cycle), which is better than intel's reported theorical maximum for my CPU (but this is not surprising, we would need to dig into instruction latencies to understand why this makes perfect sense).
- The speedup comes from doing smarter work, not from executing instructions faster
- Lookahead parsing and some algebraic simplifications express the same logic with different instruction sequences
Understanding Why: Assembly Analysis
The adc Instruction Trick
The compiler now generates better code for accumulating crossings:
movzx edi, dil ; zero-extend crosses_extra cmp esi, 1 ; is new_cursor == 0? adc edx, edi ; crosses += crosses_extra + carry
adc (add with carry) adds both the operand AND the carry
flag in one instruction. The cmp esi, 1 sets carry if
esi == 0 (because 0 - 1 borrows). This computes
crosses += crosses_extra + (new_cursor == 0) in just
2 instructions.
Simplified Right Crossing
mov r10d, 100 sub r10d, edi ; r10 = 100 - rem cmp r8d, r10d ; cursor > 100 - rem? setg dil ; crosses_extra = (cursor > 100 - rem)
Just 4 instructions vs 6+ for the original three-condition check.
Unsigned Comparison Trick (Left)
dec r8d ; cursor - 1 dec edi ; rem - 1 cmp r8d, edi setb dil ; (cursor-1) < (rem-1) unsigned
The setb (set if below) uses unsigned comparison. When
cursor == 0, decrementing wraps to 0xFFFFFFFF,
which is never "below" any reasonable rem - 1 value.
Loop Overhead Amortization
The main loop check runs once per 4 entries:
.LBB16_1: ; ... process entry 1 ... ; ... process entry 2 ... ; ... process entry 3 ... ; ... process entry 4 ... cmp rsi, 18690 ; check bounds once per 4 instead of every time jb .LBB16_1
Results
| Version | Time | Cumulative Speedup |
|---|---|---|
| Naive | 43.6 us | 1x |
| Simple parser | 8.55 us | 5.1x |
| Improved | 4.93 us | 8.8x |
Can we do even better? See you in part 3 to find out :)