Pker.xyz

Performance Aware Programming

Performance awareness is a programming model in which the focus is placed on a deep understanding of programming, OS and hardware fundamentals to build robust and fast software.
The core idea is to focus on what programs actually need to do, and build software that does the minimum amount of work to achieve the desired goal. Sometimes this means limiting userspace-kernel copies. Sometimes it means using cpu intrinsics instead of userland algorithms. It could also means using a collection that is more data cache friendly, or reducing the dependencies between instructions to facilitate cpu level parallelization.

This seems trivial but it's actually not how most modern software is built. I believe the term was initially coined by Casey Muratori.

Fundamentals

At it's core a program does two primary things: move data around and do computation on said data. These are therefore the two pillars when it comes to performance: we need to understand how the data moves and how the computing happens.

There are three main layers: userspace, where the code we write runs. The kernel, which interfaces with the hardware. And the hardware itself, where our code actually is stored and executed.

Moving Data

If you care about performance characteristics of your program you have no choice but to look at how data is moving. This is very much language agnostic: reading from a file is typically done via a read syscall (or via memory-mapped files), but choosing how you read the file, if it's better to do buffered or unbuffered IO, trying to limit page faults..

These are all decisions that greatly impact performance. Should I use fread() style wrappers or libc's read() directly for my use case? What size should my chunks be if I do buffered IO? A big chunk that implies more page faults or a smaller one that implies more kernel/userspace copies? Can we use large/huge pages, is it something our OS supports? Do we need to read arbitrary, non-contiguous parts of the file? If so could mmap help us?

Can I preallocate the memory I need instead of allocating on the spot? What actually happens when I call glic's malloc via my language's wrapper? (might have some surprises there..).

Can I use SIMD to increase bandwith? What CPU/OS is my program going to run on and what do they offer me, the programmer, to do what I want to do efficiently?

Computations

The questions to ask ourselves are very similar to the above section: what does our OS and hardware need to do for our program to do what we want it to? Would it be better to use multiple threads? Can we leverage existing cpu intrinsics, or pre-implemented OS level mechanisms to do the work for us?

Algorithmics are a pretty well understood area, we have mathematical models that help us understand the characteristics of our algorithms, such as Big O notation. The issue with these is that they are mathematical mental models: they have no real direct mapping to what actually happens on our machine when we run code (which is what really matters to us).

These models are useful to give us a general idea of the algorithms we are dealing with but in truth there are no shortcuts to writing performant software: we need to know the characteristics of the software we are writing, the OS and the hardware. For performance awareness programming there is no need to be an OS or hardware expert, simply to have some knowledge of how things work and of the things that are going to happen when a certain piece of code is executed.

Know your Software

The easiest and probably best way to build software with decent performance is to know your software (the characteristics of the data that comes in and out and the operations/computations that need to be done on said data).

If you do then it becomes much easier to write fast algorithms: need to compare strings and know that the bytes that are coming in have certain characteristics (i.e a maximum or minimum length, a certain encoding, etc..)? Write an algorithm that leverage these.

Parsing a simple http request that you know does not carry an arbitrary length headers/body? Might be sufficient to just parse the first n bytes by reading the tcp socket and seeing if they match a pre-baked hardcoded path instead of pulling down all of the bloated http machinery (obviously not advocating this in all cases, but you get the gist..).

When designing data structures and data schemas that are going to be shared between programs think about how you could design these to minimalize the work required by parsers and other software that will be receiving and interpreting this data.
Instead of json could I use a simple custom binary format that will be much faster to parse and operate on?
If I generate a numerical id for every instance of an object, could some bytes of the id hold specific meaning to make sorting or other computations faster down the line? (i.e similarly to how pointers work in virtual memory systems..)
Are my types full of padding or are they well packed?

Make your software do the minimum amount of moving data around and computation. That's the core idea. To do this you need to have some knowledge of your software and of the OS and hardware it will run on.


Uploaded 09-10-2024