Performance Analysis and Improvements [part 1]
In Part 0, we profiled a naive implementation and found that 52% of CPU time went to iterator machinery, not our actual algorithm. Let's try to fix that.
Recap: Where We Left Off
Our naive implementation processed ~4,500 dial rotation commands in 43.6 us. VTune profiling revealed:
| Component | Time | Issue |
|---|---|---|
Split::next |
52.4% | Iterator overhead (parsing using std lib's split_at) |
| "Our" algorithm | 35.3% | Solving the problem after parsing input |
memcmp |
10.2% | Iterator internals |
Our first objective is to reduce the parsing time.
We know our input format exactly: [LR][0-9]\n
Every line is:
- One character:
LorR - One or more digits
- A newline
Let's write simple specialized parser. We will then benchmark again and profile to see what changed if anything.
The Implementation
fn first_improvement(input: &[u8]) -> i32 { let mut idx = 0; let len = input.len(); let mut cursor: i32 = 50; let mut clock_crosses: i32 = 0; while idx < len { // Direction: single byte comparison (if it's not left then it's right..) let is_left = unsafe { *input.get_unchecked(idx) } == b'L'; idx += 1; // Parse first digit(we know there's always at least // one digit after the first L/R byte): ASCII '0'-'9' maps to 0x30-0x39 // Subtracting b'0' (0x30) converts ASCII digit to integer let mut num = unsafe { (*input.get_unchecked(idx) - b'0') as i32 }; idx += 1; // Parse remaining digits until we hit a non-digit while idx < len { let b = unsafe { *input.get_unchecked(idx) }; // Newline (0x0A) and other control chars are < 0x30 ('0') // This single comparison detects end-of-number if b < 0x30 { idx += 1; // Skip the newline break; } // Standard decimal parsing: num = num * 10 + digit num = num * 10 + (b - 0x30) as i32; idx += 1; } // Same logic as the base implementation: the rest is identical let quotient = num / 100; let rem = num % 100; if is_left { let raw = cursor - rem; let new_cursor = raw + (100 & (raw >> 31)); let crosses_extra = (cursor != 0 && cursor < rem) as i32; clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32; cursor = new_cursor; } else { let raw = cursor + rem; let overflow = (raw >= 100) as i32; let new_cursor = raw - 100 * overflow; let crosses_extra = (cursor != 0 && new_cursor != 0 && overflow != 0) as i32; clock_crosses += quotient + crosses_extra + (new_cursor == 0) as i32; cursor = new_cursor; } } clock_crosses }
Differences
| Aspect | Naive | first_improvement |
|---|---|---|
| UTF-8 validation | Yes | No |
| Heap allocation | Yes (Vec) | No |
| Iterator overhead | Yes (Split) | No |
| Bounds checking | Yes | No (we explicitly use unsafe unchecked) |
| Integer parsing | Signs, overflow | Positive only |
| Newline detection | memchr | b < 0x30 (1 comparison) |
Results
Criterion Benchmarks
| Implementation | Time |
|---|---|
| naive | 43.511 us - 43.697 us |
| first_improvement | 8.536 us - 8.560 us |
5.1x faster. Noice.
VTune: Before and After
Naive - Time Distribution:
| Function | CPU Time | % |
|---|---|---|
| Split::next (iterator) | 0.264s | 52.4% |
| naive (our code) | 0.178s | 35.3% |
| memcmp (iterator internals) | 0.051s | 10.2% |
New impl - Time Distribution:
| Line | Code | CPU Time | % |
|---|---|---|---|
| 32 | num = num * 10 + (b - 0x30) as i32 |
10.040ms | 7.8% |
| 37 | let rem = num % 100 |
10.040ms | 17.0% |
| 28 | if b < 0x30 { break; } |
9.036ms | 7.2% |
| 27 | let b = *input.get_unchecked(idx) |
8.032ms | 4.7% |
| 26 | while idx < len { |
7.028ms | 7.4% |
| 20 | let is_left = ... == b'L' |
6.024ms | 1.9% |
| 44 | crosses_extra = (cursor != 0 && cursor < rem) |
98.4ms | 6.3% |
| 54 | crosses_extra = (cursor != 0 && new_cursor != 0 && ...) |
112.5ms | 4.5% |
Assembly: Where Cycles Actually Go
New code, Parsing Loop Hotspots:
| Assembly | CPU Time | Instructions |
|---|---|---|
jb (newline check: b < 0x30) |
9.036ms | 83,200,000 |
movzxb (load byte from input) |
8.032ms | 118,400,000 |
jnz (loop continuation) |
7.028ms | 131,200,000 |
imul (division by 100) |
5.020ms | 35,200,000 |
cmpb (direction check: == 'L') |
6.024ms | 48,000,000 |
; Byte-level: specialized, no checks needed lea r9d, [r9 + 4*r9] ; num * 5 add r11b, -48 ; digit = byte - '0' lea r9d, [r11 + 2*r9] ; num = num*10 + digit
Lea (load effective address) computes num *
5 and num * 10 + digit without overflow
checking.
What's Next
We've achieved a 5.1x speedup, good start.
We have now done one full loop: we benchmarked our initial implementation, we then profiled it to understand where time is being spent and what were the main problems. We came up with an alternative implementation and benchmarked again. We saw a massive speedup and profiled a little bit to see why that was the case and have a better understanding of our new code.
In the next part we will go through the motion again and find new areas to explore and improve.