Pker.xyz

Performance Analysis and Improvements [part 1]

rust, asm, perfpart 1

In Part 0, we profiled a naive implementation and found that 52% of CPU time went to iterator machinery, not our actual algorithm. Let's try to fix that.


Recap: Where We Left Off

Our naive implementation processed ~4,500 dial rotation commands in 43.6 us. VTune profiling revealed:

Component Time Issue
Split::next 52.4% Iterator overhead (parsing using std lib's split_at)
"Our" algorithm 35.3% Solving the problem after parsing input
memcmp 10.2% Iterator internals

Our first objective is to reduce the parsing time.


We know our input format exactly: [LR][0-9]\n

Every line is:

  1. One character: L or R
  2. One or more digits
  3. A newline

Let's write simple specialized parser. We will then benchmark again and profile to see what changed if anything.


The Implementation

rust
fn first_improvement(input: &[u8]) -> i32 {
    let mut idx = 0;
    let len = input.len();
    let mut cursor: i32 = 50;
    let mut clock_crosses: i32 = 0;

    while idx < len {
        // Direction: single byte comparison (if it's not left then it's right..)
        let is_left = unsafe { *input.get_unchecked(idx) } == b'L';
        idx += 1;

        // Parse first digit(we know there's always at least
        // one digit after the first L/R byte): ASCII '0'-'9' maps to 0x30-0x39
        // Subtracting b'0' (0x30) converts ASCII digit to integer
        let mut num = unsafe { (*input.get_unchecked(idx) - b'0') as i32 };
        idx += 1;

        // Parse remaining digits until we hit a non-digit
        while idx < len {
            let b = unsafe { *input.get_unchecked(idx) };
            // Newline (0x0A) and other control chars are < 0x30 ('0')
            // This single comparison detects end-of-number
            if b < 0x30 {
                idx += 1;  // Skip the newline
                break;
            }
            // Standard decimal parsing: num = num * 10 + digit
            num = num * 10 + (b - 0x30) as i32;
            idx += 1;
        }

        // Same logic as the base implementation: the rest is identical
        let quotient = num / 100;
        let rem = num % 100;

        if is_left {
            let raw = cursor - rem;
            let new_cursor = raw + (100 & (raw >> 31));
            let crosses_extra = (cursor != 0 && cursor < rem) as i32;
            clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
            cursor = new_cursor;
        } else {
            let raw = cursor + rem;
            let overflow = (raw >= 100) as i32;
            let new_cursor = raw - 100 * overflow;
            let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;
            clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
            cursor = new_cursor;
        }
    }

    clock_crosses
}

Differences

Aspect Naive first_improvement
UTF-8 validation Yes No
Heap allocation Yes (Vec) No
Iterator overhead Yes (Split) No
Bounds checking Yes No (we explicitly use unsafe unchecked)
Integer parsing Signs, overflow Positive only
Newline detection memchr b < 0x30 (1 comparison)

Results

Criterion Benchmarks

Implementation Time
naive 43.511 us - 43.697 us
first_improvement 8.536 us - 8.560 us

5.1x faster. Noice.

VTune: Before and After

Naive - Time Distribution:

Function CPU Time %
Split::next (iterator) 0.264s 52.4%
naive (our code) 0.178s 35.3%
memcmp (iterator internals) 0.051s 10.2%

New impl - Time Distribution:

Line Code CPU Time %
32 num = num * 10 + (b - 0x30) as i32 10.040ms 7.8%
37 let rem = num % 100 10.040ms 17.0%
28 if b < 0x30 { break; } 9.036ms 7.2%
27 let b = *input.get_unchecked(idx) 8.032ms 4.7%
26 while idx < len { 7.028ms 7.4%
20 let is_left = ... == b'L' 6.024ms 1.9%
44 crosses_extra = (cursor != 0 && cursor < rem) 98.4ms 6.3%
54 crosses_extra = (cursor != 0 && new_cursor != 0 && ...) 112.5ms 4.5%

Assembly: Where Cycles Actually Go

New code, Parsing Loop Hotspots:

Assembly CPU Time Instructions
jb (newline check: b < 0x30) 9.036ms 83,200,000
movzxb (load byte from input) 8.032ms 118,400,000
jnz (loop continuation) 7.028ms 131,200,000
imul (division by 100) 5.020ms 35,200,000
cmpb (direction check: == 'L') 6.024ms 48,000,000
x86 asm
; Byte-level: specialized, no checks needed
lea r9d, [r9 + 4*r9]       ; num * 5
add r11b, -48              ; digit = byte - '0'
lea r9d, [r11 + 2*r9]      ; num = num*10 + digit

Lea (load effective address) computes num * 5 and num * 10 + digit without overflow checking.


What's Next

We've achieved a 5.1x speedup, good start.

We have now done one full loop: we benchmarked our initial implementation, we then profiled it to understand where time is being spent and what were the main problems. We came up with an alternative implementation and benchmarked again. We saw a massive speedup and profiled a little bit to see why that was the case and have a better understanding of our new code.

In the next part we will go through the motion again and find new areas to explore and improve.

Uploaded 11-01-2026