Pker.xyz

Performance Analysis and Improvements [part 2]

rust, asm, perfpart 2

In Part 1, we achieved a 5.1x speedup by replacing string-based parsing with simple inlined byte parser slightly more specific to our input. Now we continue the optimization cycle.


Recap: Where We Left Off

Our improved implementation processes ~4,500 dial rotation commands in 8.55 us, a 5.1x improvement over the naive implementation!

Implementation Time Speedup
Naive (string-based) 43.6 us 1x (baseline)
Simple parser 8.55 us 5.1x

Let's profile the newer implementation to see what's left to optimize.


Profiling the Simple Parser Implementation

VTune Hotspot Analysis

Let's see where clock cycles are spent:

Code CPU Time % Instructions
let rem = num % 100 376.5ms 14.5%
while idx < len { (inner loop) 215.9ms 7.5%
num = num * 10 + (b - 0x30) as i32 199.8ms 8.0%
while idx < len { (outer loop) 180.7ms 3.6%
let is_left = ... == b'L' 176.7ms 1.9%
if b < 0x30 { break; } 175.7ms 7.5%
let b = *input.get_unchecked(idx) 165.7ms 3.8%
crosses_extra = (cursor != 0 && new_cursor != 0 && ...) 112.5ms 4.5%
crosses_extra = (cursor != 0 && cursor < rem) 98.4ms 6.3%

Hmm, seems like modulo is expensive, the loop bounds checks and crossing detection logic (some compute bits of our algo) also consume significant time. Let's dig deeper.

Assembly Hotspots

Hotspots grouped by code area:

Inner Parsing Loop (hottest)

rust
while idx < len {
    let b = unsafe { *input.get_unchecked(idx) };
    if b < 0x30 { break; }
    num = num * 10 + (b - 0x30) as i32;
    idx += 1;
}

Key assembly (~15% combined):

x86 asm
jnz .LBB7    ; loop back — ~8% (hottest instruction)
jb ...       ; newline check — ~7%

Per digit processed: bounds check, load byte, compare, accumulate, branch. The jnz dominates because it executes every digit iteration. This is from our "new and improved" parser, maybe we can do even better!

Modulo Computation

rust
let rem = num % 100;

Key assembly (~14.5% combined):

x86 asm
imul $0x51eb851f    ; magic constant for division — ~7%
add %r11d, %r9d     ; part of modulo sequence — ~8%

Division by constant uses multiplication trick.

Crossing Detection

rust
// Left
let crosses_extra = (cursor != 0 && cursor < rem) as i32;

// Right
let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;

Key assembly (~11% combined):

x86 asm
setnz ...    ; left crossing conditions — ~6%
setnz ...    ; right crossing conditions — ~5%

Each && becomes test + setcc + and. Multiple conditions generates multiple instruction chains. Maybe we can merge these conditions?

Microarchitecture Analysis

VTune's microarchitecture analysis shows baseline efficiency:

Metric Value Meaning
CPI 0.170 ~6 instructions per cycle
Retiring 94.4% Almost no wasted work
Bad Speculation 0.0% Branches perfectly predicted
Front-End Bound 3.0% Instruction fetch is fine
Back-End Bound 6.6% Execution units rarely stall

Key insight: The low CPI (0.170, ~6 instructions/cycle) combined with minimal front-end and back-end stalls indicates excellent instruction-level parallelism (ILP). The CPU's out-of-order execution engine can decode, schedule, and execute multiple independent instructions simultaneously. With 94.4% retiring and near-zero stalls, the instruction stream has few data dependencies blocking parallel execution. This is all great news: our algo is very CPU friendly. The bad news is that this implies it might be harder for us to find easy things to improve (but we'll manage).

Since we're already near the CPU's throughput ceiling, further optimization requires restructuring the algorithm.

Key Observations

  1. Loop overhead per input entry: Each input entry processes a loop iteration with bounds check, increment, and branch back
  2. Crossing detection has multiple conditions: cursor != 0 && cursor < rem requires two comparisons and an AND
  3. Each entry processes independently: No data dependencies between entries that would prevent parallelism

Analysis: What Could Be Improved?

Based on the profile data, we can group hotspots by what optimization would address them:

Target 1: Inner Parsing Loop (~25% combined)

rust
while idx < len {
    let b = unsafe { *input.get_unchecked(idx) };
    if b < 0x30 { break; }
    num = num * 10 + (b - 0x30) as i32;
    idx += 1;
}

The inner loop runs once per digit. With ~4,500 entries averaging 2 digits each, that's ~9,000 iterations. Each iteration executes:

Solution: Inline the parser to reduce the loop overhead! We can do lookahead parsing: check byte positions directly instead of looping since we know our input line width is capped to a few bytes per line.

Target 2: Crossing Detection (~11% combined)

rust
// Left crossing (6.3%)
let crosses_extra = (cursor != 0 && cursor < rem) as i32;

// Right crossing (4.5%)
let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;

Left crossing (5 operations):

x86 asm
test esi, esi      ; cursor != 0?
setne bl
cmp esi, r11d      ; cursor < rem?
setl r11b
and r11b, bl       ; combine

Right crossing (8 operations):

x86 asm
test esi, esi      ; cursor != 0?
setnz ...
test r8d, r8d      ; new_cursor != 0?
setnz ...
test eax, eax      ; overflow != 0?
setnz ...
and ...            ; combine first two
and ...            ; combine third

Solution: Collapse multiple conditions into single comparisons.

Target 3: Outer Loop Overhead (~4%)

rust
while idx < len {
    // ... process one entry (one line of the input)...
}

Each entry: bounds check + increment + branch back.

Solution: We can do a bit of unrolling, (i.e 4x unrolling, one bounds check per 4 entries). I don't expect big gains from this specific one, most improvements should be from the other improvements but this one is basically as close as we get from a free lunch here.

4. Forced Inlining + Loop Unrolling

If we extract the per-entry logic into a function and force inlining, the compiler might:

We include 4x loop unrolling because it's essentially free complexity: it amortizes loop overhead without making the code significantly harder to understand. The main gains come from lookahead parsing and algebraic simplification; unrolling is a small bonus.


The Optimized Implementation

The process_entry Function

rust
#[inline(always)]
unsafe fn process_entry(
    input: &[u8],
    mut idx: usize,
    cursor: &mut i32,
    clock_crosses: &mut i32
) -> usize {
    let is_left = unsafe { *input.get_unchecked(idx) } == b'L';
    idx += 1;

    // LOOKAHEAD PARSING
    // Instead of looping through digits, we check fixed positions.
    // Input format: "[LR][0-9]+\n" where numbers are 1-3 digits.
    // Safe because outer loop guarantees idx + 12 < len (room for 4 entries).

    let d0 = unsafe { (*input.get_unchecked(idx) - b'0') as i32 };
    // wrapping_sub: if next byte is '\n' (0x0A), result is 0x0A - 0x30 = 0xDA > 9
    let d1_raw = unsafe { (*input.get_unchecked(idx + 1)).wrapping_sub(b'0') };

    let (quotient, rem) = if d1_raw <= 9 {
        // 2+ digits: d1_raw is valid digit
        let d1 = d1_raw as i32;
        let b2 = unsafe { *input.get_unchecked(idx + 2) };

        if b2 < b'0' {
            // 2 digits: "NN\n" format
            idx += 3;
            (0, d0 * 10 + d1)
        } else {
            // 3 digits: "NNN\n" format
            idx += 4;
            let d2 = (b2 - b'0') as i32;
            (d0, d1 * 10 + d2)  // hundreds digit becomes quotient
        }
    } else {
        // 1 digit: "N\n" format (d1_raw was '\n')
        idx += 2;
        (0, d0)
    };

    if is_left {
        let raw = *cursor - rem;
        let new_cursor = raw + (100 & (raw >> 31));

        // ALGEBRAIC SIMPLIFICATION
        // Original: cursor != 0 && cursor < rem
        // Trick: (x-1) as u32 wraps to u32::MAX when x==0
        // So (cursor-1) as u32 < (rem-1) as u32 is false when cursor==0
        // because u32::MAX is not < anything reasonable
        let crosses_extra = (((*cursor - 1) as u32) < ((rem - 1) as u32)) as i32;

        *clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
        *cursor = new_cursor;
    } else {
        let raw = *cursor + rem;
        let overflow = (raw >= 100) as i32;
        let new_cursor = raw - 100 * overflow;

        // ALGEBRAIC SIMPLIFICATION
        // Original: cursor != 0 && new_cursor != 0 && overflow != 0
        // All three true if cursor started high enough that adding rem
        // pushed past 100 but not to exactly 0.
        // Equivalent: cursor > 100 - rem
        let crosses_extra = (*cursor > 100 - rem) as i32;

        *clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
        *cursor = new_cursor;
    }

    idx
}

The Main Loop with 4x Unrolling

rust
#[inline(never)]
fn parse_and_compute(input: &[u8]) -> i32 {
    let len = input.len();
    let mut idx = 0;
    let mut cursor: i32 = 50;
    let mut clock_crosses: i32 = 0;

    // Main loop: process 4 entries per iteration
    while idx + 12 < len {
        unsafe {
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
        }
    }

    // Handle remaining entries
    while idx < len {
        unsafe {
            idx = process_entry(input, idx, &mut cursor, &mut clock_crosses);
        }
    }

    clock_crosses
}

Key Changes Explained

1. Lookahead-Based Digit Parsing

Instead of a while loop for digits, we use lookahead:

2. Simplified Left Crossing Detection

Original: cursor != 0 && cursor < rem

Optimized: ((*cursor - 1) as u32) < ((rem - 1) as u32)

This works because:

3. Simplified Right Crossing Detection

Original: cursor != 0 && new_cursor != 0 && overflow != 0

Optimized: *cursor > 100 - rem

The three conditions combine to mean "cursor started high enough that adding rem pushed us past 100 but not to exactly 0 or from 0". This simplifies to a single comparison.

4. Unrolled main loop

Reduce loop overhead by 4x, might also have some slight compiler optimization benefits, TBD.


So will these changes be faster than our previous impl? And if so why? Let's bench and profile to see!

Results: Benchmarks

Criterion Results

Implementation Time vs Naive vs Simple parser
Naive 43.6 us 1x
Simple parser 8.55 us 5.1x 1x
Improved 4.93 us 8.8x 1.7x

1.7x faster than simple parser, 8.8x faster than naive.


VTune Analysis After Optimization

High-Level Comparison

Metric Simple parser Improved
CPI 0.170 0.170
Retiring 94.4% 100.0%
Back-End Bound 6.6% 0.5%
Hottest instruction share dominates evenly distributed

Same CPI means individual instructions aren't faster. The speedup comes from:

  1. Eliminating bottlenecks: No single instruction dominates anymore, time is evenly distributed between instructions, as we will see.
  2. Better pipeline utilization: 100% retiring, near-zero backend stalls.

Assembly Hotspot Comparison

Simple parser — Dominant Bottleneck:

The inner loop jnz (branch back) dominates, with a few other hot instructions following.

Assembly What it does
jnz .LBB7 Inner loop branch back — hottest instruction
add %r11d, %r9d Part of modulo computation
cmp $0x490e, %rsi Outer loop bounds check

Improved — Evenly Distributed:

No single instruction dominates. Time is spread across many movzxb loads.

Assembly What it does
movzxb 0x1(%r8,%rcx,1), %esi Load digit byte
movzxb 0x1(%r9,%rcx,1), %edi Load digit byte
movzxb 0x2(%rsi,%rcx,1), %ebx Load digit byte

The inner loop jnz from the digit parser we made in part1 is gone entirely, replaced by sequential loads with lookahead.

Crossing Detection Assembly Comparison

Left Crossing Before:

x86 asm
test esi, esi          ; cursor != 0?
setne bl
cmp esi, r11d          ; cursor < rem?
setl r11b
and r11b, bl           ; combine

Left Crossing After:

x86 asm
dec %ebx               ; cursor - 1
dec %edi               ; rem - 1
cmp ...                ; unsigned compare
setb %dil              ; single result

Right Crossing Before:

x86 asm
test esi, esi          ; cursor != 0?
setnz ...
test r8d, r8d          ; new_cursor != 0?
setnz ...
test eax, eax          ; overflow != 0?
setnz ...
and ...                ; combine first two
and ...                ; combine third

Right Crossing After:

x86 asm
mov r10d, 100
sub r10d, edi          ; 100 - rem
cmp r8d, r10d          ; cursor > 100 - rem?
setnle %r10b           ; single result

Note the reduction in dependencies between instructions. The "Before" asm's clearly show patterns of: test-setnz-test...

Why Same CPI But Faster?


Understanding Why: Assembly Analysis

The adc Instruction Trick

The compiler now generates better code for accumulating crossings:

x86 asm
movzx edi, dil              ; zero-extend crosses_extra
cmp esi, 1                  ; is new_cursor == 0?
adc edx, edi                ; crosses += crosses_extra + carry

adc (add with carry) adds both the operand AND the carry flag in one instruction. The cmp esi, 1 sets carry if esi == 0 (because 0 - 1 borrows). This computes crosses += crosses_extra + (new_cursor == 0) in just 2 instructions.

Simplified Right Crossing

x86 asm
mov r10d, 100
sub r10d, edi                ; r10 = 100 - rem
cmp r8d, r10d                ; cursor > 100 - rem?
setg dil                     ; crosses_extra = (cursor > 100 - rem)

Just 4 instructions vs 6+ for the original three-condition check.

Unsigned Comparison Trick (Left)

x86 asm
dec r8d                      ; cursor - 1
dec edi                      ; rem - 1
cmp r8d, edi
setb dil                     ; (cursor-1) < (rem-1) unsigned

The setb (set if below) uses unsigned comparison. When cursor == 0, decrementing wraps to 0xFFFFFFFF, which is never "below" any reasonable rem - 1 value.

Loop Overhead Amortization

The main loop check runs once per 4 entries:

x86 asm
.LBB16_1:
    ; ... process entry 1 ...
    ; ... process entry 2 ...
    ; ... process entry 3 ...
    ; ... process entry 4 ...
    cmp rsi, 18690  ; check bounds once per 4 instead of every time
    jb .LBB16_1

Results

Version Time Cumulative Speedup
Naive 43.6 us 1x
Simple parser 8.55 us 5.1x
Improved 4.93 us 8.8x

Can we do even better? See you in part 3 to find out :)

Uploaded 16-01-2026