Performance Analysis and Improvements [part 1]

rust, asm, perf part 1

In Part 0, we profiled a naive implementation and found that 52% of CPU time went to iterator machinery, not our actual algorithm. Let's try to fix that.

Recap: Where We Left Off

Our naive implementation processed ~4,500 dial rotation commands in 43.6 us. VTune profiling revealed:

Component	Time	Issue
`Split::next`	52.4%	Iterator overhead (parsing using std lib's split_at)
"Our" algorithm	35.3%	Solving the problem after parsing input
`memcmp`	10.2%	Iterator internals

Our first objective is to reduce the parsing time.

We know our input format exactly: [LR][0-9]\n

Every line is:

One character: L or R
One or more digits
A newline

Let's write simple specialized parser. We will then benchmark again and profile to see what changed if anything.

The Implementation

rust

fn first_improvement(input: &[u8]) -> i32 {
    let mut idx = 0;
    let len = input.len();
    let mut cursor: i32 = 50;
    let mut clock_crosses: i32 = 0;

    while idx < len {
        // Direction: single byte comparison (if it's not left then it's right..)
        let is_left = unsafe { *input.get_unchecked(idx) } == b'L';
        idx += 1;

        // Parse first digit(we know there's always at least
        // one digit after the first L/R byte): ASCII '0'-'9' maps to 0x30-0x39
        // Subtracting b'0' (0x30) converts ASCII digit to integer
        let mut num = unsafe { (*input.get_unchecked(idx) - b'0') as i32 };
        idx += 1;

        // Parse remaining digits until we hit a non-digit
        while idx < len {
            let b = unsafe { *input.get_unchecked(idx) };
            // Newline (0x0A) and other control chars are < 0x30 ('0')
            // This single comparison detects end-of-number
            if b < 0x30 {
                idx += 1;  // Skip the newline
                break;
            }
            // Standard decimal parsing: num = num * 10 + digit
            num = num * 10 + (b - 0x30) as i32;
            idx += 1;
        }

        // Same logic as the base implementation: the rest is identical
        let quotient = num / 100;
        let rem = num % 100;

        if is_left {
            let raw = cursor - rem;
            let new_cursor = raw + (100 & (raw >> 31));
            let crosses_extra = (cursor != 0 && cursor < rem) as i32;
            clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
            cursor = new_cursor;
        } else {
            let raw = cursor + rem;
            let overflow = (raw >= 100) as i32;
            let new_cursor = raw - 100 * overflow;
            let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32;
            clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32;
            cursor = new_cursor;
        }
    }

    clock_crosses
}

Differences

Aspect	Naive	first_improvement
UTF-8 validation	Yes	No
Heap allocation	Yes (Vec)	No
Iterator overhead	Yes (Split)	No
Bounds checking	Yes	No (we explicitly use unsafe unchecked)
Integer parsing	Signs, overflow	Positive only
Newline detection	memchr	`b < 0x30` (1 comparison)

Results

Criterion Benchmarks

Implementation	Time
naive	43.511 us - 43.697 us
first_improvement	8.536 us - 8.560 us

5.1x faster. Noice.

VTune: Before and After

Naive - Time Distribution:

Function	CPU Time	%
Split::next (iterator)	0.264s	52.4%
naive (our code)	0.178s	35.3%
memcmp (iterator internals)	0.051s	10.2%

New impl - Time Distribution:

Line	Code	CPU Time	%
32	`num = num * 10 + (b - 0x30) as i32`	10.040ms	7.8%
37	`let rem = num % 100`	10.040ms	17.0%
28	`if b < 0x30 { break; }`	9.036ms	7.2%
27	`let b = *input.get_unchecked(idx)`	8.032ms	4.7%
26	`while idx < len {`	7.028ms	7.4%
20	`let is_left = ... == b'L'`	6.024ms	1.9%
44	`crosses_extra = (cursor != 0 && cursor < rem)`	98.4ms	6.3%
54	`crosses_extra = (cursor != 0 && new_cursor != 0 && ...)`	112.5ms	4.5%

Assembly: Where Cycles Actually Go

New code, Parsing Loop Hotspots:

Assembly	CPU Time	Instructions
`jb (newline check: b < 0x30)`	9.036ms	83,200,000
`movzxb (load byte from input)`	8.032ms	118,400,000
`jnz (loop continuation)`	7.028ms	131,200,000
`imul (division by 100)`	5.020ms	35,200,000
`cmpb (direction check: == 'L')`	6.024ms	48,000,000

x86 asm

; Byte-level: specialized, no checks needed
lea r9d, [r9 + 4*r9]       ; num * 5
add r11b, -48              ; digit = byte - '0'
lea r9d, [r11 + 2*r9]      ; num = num*10 + digit

Lea (load effective address) computes num * 5 and num * 10 + digit without overflow checking.

What's Next

We've achieved a 5.1x speedup, good start.

We have now done one full loop: we benchmarked our initial implementation, we then profiled it to understand where time is being spent and what were the main problems. We came up with an alternative implementation and benchmarked again. We saw a massive speedup and profiled a little bit to see why that was the case and have a better understanding of our new code.

In the next part we will go through the motion again and find new areas to explore and improve.

Pker.xyz