Performance Analysis and Improvements [part 2]

rust, asm, perf part 2

In Part 1, we achieved a 5.1x speedup by replacing string-based parsing with simple inlined byte parser slightly more specific to our input. Now we continue the optimization cycle.

Recap: Where We Left Off

Our improved implementation processes ~4,500 dial rotation commands in 8.55 us, a 5.1x improvement over the naive implementation!

Implementation	Time	Speedup
Naive (string-based)	43.6 us	1x (baseline)
Simple parser	8.55 us	5.1x

Let's profile the newer implementation to see what's left to optimize.

Profiling the Simple Parser Implementation

VTune Hotspot Analysis

Let's see where clock cycles are spent:

Code	CPU Time	% Instructions
`let rem = num % 100`	376.5ms	14.5%
`while idx < len {` (inner loop)	215.9ms	7.5%
`num = num * 10 + (b - 0x30) as i32`	199.8ms	8.0%
`while idx < len {` (outer loop)	180.7ms	3.6%
`let is_left = ... == b'L'`	176.7ms	1.9%
`if b < 0x30 { break; }`	175.7ms	7.5%
`let b = *input.get_unchecked(idx)`	165.7ms	3.8%
`crosses_extra = (cursor != 0 && new_cursor != 0 && ...)`	112.5ms	4.5%
`crosses_extra = (cursor != 0 && cursor < rem)`	98.4ms	6.3%

Hmm, seems like modulo is expensive, the loop bounds checks and crossing detection logic (some compute bits of our algo) also consume significant time. Let's dig deeper.

Assembly Hotspots

Hotspots grouped by code area:

Inner Parsing Loop (hottest)

rust

while idx < len {
    let b = unsafe { *input.get_unchecked(idx) };
    if b < 0x30 { break; }
    num = num * 10 + (b - 0x30) as i32;
    idx += 1;
}

Key assembly (~15% combined):

x86 asm

jnz .LBB7    ; loop back — ~8% (hottest instruction)
jb ...       ; newline check — ~7%

Per digit processed: bounds check, load byte, compare, accumulate, branch. The jnz dominates because it executes every digit iteration. This is from our "new and improved" parser, maybe we can do even better!

Modulo Computation

rust

let rem = num % 100;

Key assembly (~14.5% combined):

x86 asm

imul $0x51eb851f    ; magic constant for division — ~7%
add %r11d, %r9d     ; part of modulo sequence — ~8%

Division by constant uses multiplication trick.

Crossing Detection

rust

// Left
let crosses_extra = (cursor != 0 && cursor < rem) as i32;

// Right
let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;

Key assembly (~11% combined):

x86 asm

setnz ...    ; left crossing conditions — ~6%
setnz ...    ; right crossing conditions — ~5%

Each && becomes test + setcc + and. Multiple conditions generates multiple instruction chains. Maybe we can merge these conditions?

Microarchitecture Analysis

VTune's microarchitecture analysis shows baseline efficiency:

Metric	Value	Meaning
CPI	0.170	~6 instructions per cycle
Retiring	94.4%	Almost no wasted work
Bad Speculation	0.0%	Branches perfectly predicted
Front-End Bound	3.0%	Instruction fetch is fine
Back-End Bound	6.6%	Execution units rarely stall

Key insight: The low CPI (0.170, ~6 instructions/cycle) combined with minimal front-end and back-end stalls indicates excellent instruction-level parallelism (ILP). The CPU's out-of-order execution engine can decode, schedule, and execute multiple independent instructions simultaneously. With 94.4% retiring and near-zero stalls, the instruction stream has few data dependencies blocking parallel execution. This is all great news: our algo is very CPU friendly. The bad news is that this implies it might be harder for us to find easy things to improve (but we'll manage).

Since we're already near the CPU's throughput ceiling, further optimization requires restructuring the algorithm.

Key Observations

Loop overhead per input entry: Each input entry processes a loop iteration with bounds check, increment, and branch back
Crossing detection has multiple conditions: cursor != 0 && cursor < rem requires two comparisons and an AND
Each entry processes independently: No data dependencies between entries that would prevent parallelism

Analysis: What Could Be Improved?

Based on the profile data, we can group hotspots by what optimization would address them:

Target 1: Inner Parsing Loop (~25% combined)

rust

while idx < len {
    let b = unsafe { *input.get_unchecked(idx) };
    if b < 0x30 { break; }
    num = num * 10 + (b - 0x30) as i32;
    idx += 1;
}

The inner loop runs once per digit. With ~4,500 entries averaging 2 digits each, that's ~9,000 iterations. Each iteration executes:

Bounds check (cmp, jae)
Load byte (movzxb)
Newline check (cmp, jb)
Accumulate (lea for num * 10 + digit)
Increment + branch back (inc, jnz)

Solution: Inline the parser to reduce the loop overhead! We can do lookahead parsing: check byte positions directly instead of looping since we know our input line width is capped to a few bytes per line.

Target 2: Crossing Detection (~11% combined)

rust

// Left crossing (6.3%)
let crosses_extra = (cursor != 0 && cursor < rem) as i32;

// Right crossing (4.5%)
let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;

Left crossing (5 operations):

x86 asm

test esi, esi      ; cursor != 0?
setne bl
cmp esi, r11d      ; cursor < rem?
setl r11b
and r11b, bl       ; combine

Right crossing (8 operations):

x86 asm

test esi, esi      ; cursor != 0?
setnz ...
test r8d, r8d      ; new_cursor != 0?
setnz ...
test eax, eax      ; overflow != 0?
setnz ...
and ...            ; combine first two
and ...            ; combine third

Solution: Collapse multiple conditions into single comparisons.

Target 3: Outer Loop Overhead (~4%)

rust

while idx < len {
    // ... process one entry (one line of the input)...
}

Each entry: bounds check + increment + branch back.

Solution: We can do a bit of unrolling, (i.e 4x unrolling, one bounds check per 4 entries). I don't expect big gains from this specific one, most improvements should be from the other improvements but this one is basically as close as we get from a free lunch here.

4. Forced Inlining + Loop Unrolling

If we extract the per-entry logic into a function and force inlining, the compiler might:

Better allocate registers across the unrolled sequence
Eliminate redundant loads/stores between iterations

We include 4x loop unrolling because it's essentially free complexity: it amortizes loop overhead without making the code significantly harder to understand. The main gains come from lookahead parsing and algebraic simplification; unrolling is a small bonus.

The Optimized Implementation

The `process_entry` Function

rust

#[inline(always)]
unsafe fn process_entry(
    input: &[u8],
    mut idx: usize,
    cursor: &mut i32,
    clock_crosses: &mut i32
) -> usize {
    let is_left = unsafe { *input.get_unchecked(idx) } == b'L';
    idx += 1;

    // LOOKAHEAD PARSING
    // Instead of looping through digits, we check fixed positions.
    // Input format: "[LR][0-9]+\n" where numbers are 1-3 digits.
    // Safe because outer loop guarantees idx + 12 < len (room for 4 entries).

    let d0 = unsafe { (*input.get_unchecked(idx) - b'0') as i32 };
    // wrapping_sub: if next byte is '\n' (0x0A), result is 0x0A - 0x30 = 0xDA > 9
    let d1_raw = unsafe { (*input.get_unchecked(idx + 1)).wrapping_sub(b'0') };

    let (quotient, rem) = if d1_raw <= 9 {
        // 2+ digits: d1_raw is valid digit
        let d1 = d1_raw as i32;
        let b2 = unsafe { *input.get_unchecked(idx + 2) };

        if b2 < b'0' {
            // 2 digits: "NN\n" format
            idx += 3;
            (0, d0 * 10 + d1)
        } else {
            // 3 digits: "NNN\n" format
            idx += 4;
            let d2 = (b2 - b'0') as i32;
            (d0, d1 * 10 + d2)  // hundreds digit becomes quotient
        }
    } else {
        // 1 digit: "N\n" format (d1_raw was '\n')
        idx += 2;
        (0, d0)
    };

    if is_left {
        let raw = *cursor - rem;
        let new_cursor = raw + (100 & (raw >> 31));

        // ALGEBRAIC SIMPLIFICATION
        // Original: cursor != 0 && cursor < rem
        // Trick: (x-1) as u32 wraps to u32::MAX when x==0
        // So (cursor-1) as u32 < (rem-1) as u32 is false when cursor==0
        // because u32::MAX is not < anything reasonable
        let crosses_extra = (((*cursor - 1) as u32) < ((rem - 1) as u32)) as i32;

        *clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
        *cursor = new_cursor;
    } else {
        let raw = *cursor + rem;
        let overflow = (raw >= 100) as i32;
        let new_cursor = raw - 100 * overflow;

        // ALGEBRAIC SIMPLIFICATION
        // Original: cursor != 0 && new_cursor != 0 && overflow != 0
        // All three true if cursor started high enough that adding rem
        // pushed past 100 but not to exactly 0.
        // Equivalent: cursor > 100 - rem
        let crosses_extra = (*cursor > 100 - rem) as i32;

        *clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
        *cursor = new_cursor;
    }

    idx
}

The Main Loop with 4x Unrolling

rust

#[inline(never)]
fn parse_and_compute(input: &[u8]) -> i32 {
    let len = input.len();
    let mut idx = 0;
    let mut cursor: i32 = 50;
    let mut clock_crosses: i32 = 0;

    // Main loop: process 4 entries per iteration
    while idx + 12 < len {
        unsafe {
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
        }
    }

    // Handle remaining entries
    while idx < len {
        unsafe {
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
        }
    }

    clock_crosses
}

Key Changes Explained

1. Lookahead-Based Digit Parsing

Instead of a while loop for digits, we use lookahead:

Check if second byte is a digit (d1_raw <= 9)
If yes, check if third byte is a digit
This eliminates the inner parsing loop

2. Simplified Left Crossing Detection

Original: cursor != 0 && cursor < rem

Optimized: ((*cursor - 1) as u32) < ((rem - 1) as u32)

This works because:

If cursor == 0, then cursor - 1 wraps to u32::MAX (4,294,967,295)
u32::MAX < anything is always false
So the zero case is automatically excluded by unsigned wraparound

3. Simplified Right Crossing Detection

Original: cursor != 0 && new_cursor != 0 && overflow != 0

Optimized: *cursor > 100 - rem

The three conditions combine to mean "cursor started high enough that adding rem pushed us past 100 but not to exactly 0 or from 0". This simplifies to a single comparison.

4. Unrolled main loop

Reduce loop overhead by 4x, might also have some slight compiler optimization benefits, TBD.

So will these changes be faster than our previous impl? And if so why? Let's bench and profile to see!

Results: Benchmarks

Criterion Results

Implementation	Time	vs Naive	vs Simple parser
Naive	43.6 us	1x	—
Simple parser	8.55 us	5.1x	1x
Improved	4.93 us	8.8x	1.7x

1.7x faster than simple parser, 8.8x faster than naive.

VTune Analysis After Optimization

High-Level Comparison

Metric	Simple parser	Improved
CPI	0.170	0.170
Retiring	94.4%	100.0%
Back-End Bound	6.6%	0.5%
Hottest instruction share	dominates	evenly distributed

Same CPI means individual instructions aren't faster. The speedup comes from:

Eliminating bottlenecks: No single instruction dominates anymore, time is evenly distributed between instructions, as we will see.
Better pipeline utilization: 100% retiring, near-zero backend stalls.

Assembly Hotspot Comparison

Simple parser — Dominant Bottleneck:

The inner loop jnz (branch back) dominates, with a few other hot instructions following.

Assembly	What it does
`jnz .LBB7`	Inner loop branch back — hottest instruction
`add %r11d, %r9d`	Part of modulo computation
`cmp $0x490e, %rsi`	Outer loop bounds check

Improved — Evenly Distributed:

No single instruction dominates. Time is spread across many movzxb loads.

Assembly	What it does
`movzxb 0x1(%r8,%rcx,1), %esi`	Load digit byte
`movzxb 0x1(%r9,%rcx,1), %edi`	Load digit byte
`movzxb 0x2(%rsi,%rcx,1), %ebx`	Load digit byte

The inner loop jnz from the digit parser we made in part1 is gone entirely, replaced by sequential loads with lookahead.

Crossing Detection Assembly Comparison

Left Crossing Before:

x86 asm

test esi, esi          ; cursor != 0?
setne bl
cmp esi, r11d          ; cursor < rem?
setl r11b
and r11b, bl           ; combine

Left Crossing After:

x86 asm

dec %ebx               ; cursor - 1
dec %edi               ; rem - 1
cmp ...                ; unsigned compare
setb %dil              ; single result

Right Crossing Before:

x86 asm

test esi, esi          ; cursor != 0?
setnz ...
test r8d, r8d          ; new_cursor != 0?
setnz ...
test eax, eax          ; overflow != 0?
setnz ...
and ...                ; combine first two
and ...                ; combine third

Right Crossing After:

x86 asm

mov r10d, 100
sub r10d, edi          ; 100 - rem
cmp r8d, r10d          ; cursor > 100 - rem?
setnle %r10b           ; single result

Note the reduction in dependencies between instructions. The "Before" asm's clearly show patterns of: test-setnz-test...

Why Same CPI But Faster?

CPI measures cycles per instruction, i.e overall throughput.
Both versions have CPI of 0.170 (~6 instructions/cycle), which is better than intel's reported theorical maximum for my CPU (but this is not surprising, we would need to dig into instruction latencies to understand why this makes perfect sense).
The speedup comes from doing smarter work, not from executing instructions faster
Lookahead parsing and some algebraic simplifications express the same logic with different instruction sequences

Understanding Why: Assembly Analysis

The `adc` Instruction Trick

The compiler now generates better code for accumulating crossings:

x86 asm

movzx edi, dil              ; zero-extend crosses_extra
cmp esi, 1                  ; is new_cursor == 0?
adc edx, edi                ; crosses += crosses_extra + carry

adc (add with carry) adds both the operand AND the carry flag in one instruction. The cmp esi, 1 sets carry if esi == 0 (because 0 - 1 borrows). This computes crosses += crosses_extra + (new_cursor == 0) in just 2 instructions.

Simplified Right Crossing

x86 asm

mov r10d, 100
sub r10d, edi                ; r10 = 100 - rem
cmp r8d, r10d                ; cursor > 100 - rem?
setg dil                     ; crosses_extra = (cursor > 100 - rem)

Just 4 instructions vs 6+ for the original three-condition check.

Unsigned Comparison Trick (Left)

x86 asm

dec r8d                      ; cursor - 1
dec edi                      ; rem - 1
cmp r8d, edi
setb dil                     ; (cursor-1) < (rem-1) unsigned

The setb (set if below) uses unsigned comparison. When cursor == 0, decrementing wraps to 0xFFFFFFFF, which is never "below" any reasonable rem - 1 value.

Loop Overhead Amortization

The main loop check runs once per 4 entries:

x86 asm

.LBB16_1:
    ; ... process entry 1 ...
    ; ... process entry 2 ...
    ; ... process entry 3 ...
    ; ... process entry 4 ...
    cmp rsi, 18690  ; check bounds once per 4 instead of every time
    jb .LBB16_1

Results

Version	Time	Cumulative Speedup
Naive	43.6 us	1x
Simple parser	8.55 us	5.1x
Improved	4.93 us	8.8x

Can we do even better? See you in part 3 to find out :)

Pker.xyz

Performance Analysis and Improvements [part 2]

Recap: Where We Left Off

Profiling the Simple Parser Implementation

VTune Hotspot Analysis

Assembly Hotspots

Inner Parsing Loop (hottest)

Modulo Computation

Crossing Detection

Microarchitecture Analysis

Key Observations

Analysis: What Could Be Improved?

Target 1: Inner Parsing Loop (~25% combined)

Target 2: Crossing Detection (~11% combined)

Target 3: Outer Loop Overhead (~4%)

4. Forced Inlining + Loop Unrolling

The Optimized Implementation

The `process_entry` Function

The Main Loop with 4x Unrolling

Key Changes Explained

1. Lookahead-Based Digit Parsing

2. Simplified Left Crossing Detection

3. Simplified Right Crossing Detection

4. Unrolled main loop

Results: Benchmarks

Criterion Results

VTune Analysis After Optimization

High-Level Comparison

Assembly Hotspot Comparison

Crossing Detection Assembly Comparison

Why Same CPI But Faster?

Understanding Why: Assembly Analysis

The `adc` Instruction Trick

Simplified Right Crossing

Unsigned Comparison Trick (Left)

Loop Overhead Amortization

Results

Uploaded 16-01-2026

Pker.xyz

Performance Analysis and Improvements [part 2]

Recap: Where We Left Off

Profiling the Simple Parser Implementation

VTune Hotspot Analysis

Assembly Hotspots

Inner Parsing Loop (hottest)

Modulo Computation

Crossing Detection

Microarchitecture Analysis

Key Observations

Analysis: What Could Be Improved?

Target 1: Inner Parsing Loop (~25% combined)

Target 2: Crossing Detection (~11% combined)

Target 3: Outer Loop Overhead (~4%)

4. Forced Inlining + Loop Unrolling

The Optimized Implementation

The process_entry Function

The Main Loop with 4x Unrolling

Key Changes Explained

1. Lookahead-Based Digit Parsing

2. Simplified Left Crossing Detection

3. Simplified Right Crossing Detection

4. Unrolled main loop

Results: Benchmarks

Criterion Results

VTune Analysis After Optimization

High-Level Comparison

Assembly Hotspot Comparison

Crossing Detection Assembly Comparison

Why Same CPI But Faster?

Understanding Why: Assembly Analysis

The adc Instruction Trick

Simplified Right Crossing

Unsigned Comparison Trick (Left)

Loop Overhead Amortization

Results

Uploaded 16-01-2026

The `process_entry` Function

The `adc` Instruction Trick