Performance Aware Programming
Performance awareness is a programming model in
which the focus is placed on a deep understanding of programming,
OS and hardware fundamentals to build robust and fast software.
The core idea is to focus on what programs actually need to do, and
build software that does the minimum amount of work to achieve the
desired goal.
Sometimes this means limiting userspace-kernel copies. Sometimes it
means using cpu intrinsics instead of userland algorithms. It could
also means using a collection that is more data cache friendly,
or reducing the dependencies between instructions to
facilitate cpu level parallelization.
This seems trivial but it's actually not how most modern software is built.
I believe the term was initially coined by Casey Muratori.
Fundamentals
At it's core a program does two primary things: move data around and do computation on said data. These are therefore the two pillars when it comes to performance: we need to understand how the data moves and how the computing happens.
There are three main layers: userspace, where the code we write runs. The kernel, which interfaces with the hardware. And the hardware itself, where our code actually is stored and executed.
Moving Data
If you care about performance characteristics of your program you
have no choice but to look at how data is moving. This is very much
language agnostic: reading from a file is typically done via a
read syscall (or via memory-mapped files), but choosing how you
read the file, if it's better to do buffered or unbuffered IO,
trying to limit page faults..
These are all decisions that
greatly impact performance. Should I use fread() style wrappers or
libc's read() directly for my use case?
What size should my chunks be if I do buffered IO? A big chunk that
implies more page faults or a smaller one that implies more
kernel/userspace copies? Can we use large/huge pages, is it
something our OS supports? Do we need to read arbitrary,
non-contiguous parts of the file? If so could mmap help us?
Can I preallocate the memory I need instead of allocating on the
spot? What actually happens when I call glic's malloc via my
language's wrapper? (might have some surprises there..).
Can I use SIMD to increase bandwith? What CPU/OS is my program going
to run on and what do they offer me, the programmer, to do
what I want to do efficiently?
Computations
The questions to ask ourselves are very similar to the above
section: what does our OS and hardware need to do for our
program to do what we want it to? Would it be better to use multiple
threads? Can we leverage existing cpu intrinsics, or pre-implemented
OS level mechanisms to do the work for us?
Algorithmics are a pretty well understood area, we have mathematical models
that help us understand the characteristics of our
algorithms, such as Big O notation. The issue with these is
that they are mathematical mental models: they have no real
direct mapping to what actually happens on our
machine when we run code (which is what really matters to us).
These models are useful to give us a general idea of the algorithms
we are dealing with but in truth there are no shortcuts to
writing performant software: we need to know the
characteristics of the software we are writing, the OS and the hardware.
For performance awareness programming there is no need to be an OS or
hardware expert, simply to have some knowledge of how things
work and of the things that are going to happen when a
certain piece of code is executed.
Know your Software
The easiest and probably best way to build software with decent
performance is to know your software (the characteristics of the
data that comes in and out and the operations/computations that
need to be done on said data).
If you do then it becomes much easier to write
fast algorithms: need to compare strings and know that the bytes
that are coming in have certain characteristics (i.e a maximum or
minimum length, a certain encoding, etc..)? Write an
algorithm that leverage these.
Parsing a simple http request that
you know does not carry an arbitrary length headers/body? Might be sufficient to just
parse the first n bytes by reading the tcp socket and seeing if
they match a pre-baked hardcoded path instead of pulling down all of
the bloated http machinery (obviously
not advocating this in all cases, but you get the gist..).
When designing data structures and data schemas that are going to be
shared between programs think about how you could design these to
minimalize the work required by parsers and other software
that will be receiving and interpreting this data.
Instead of json
could I use a simple custom binary format that will be much faster
to parse and operate on?
If I generate a numerical id for every instance of an object, could
some bytes of the id hold specific meaning to make sorting or other
computations faster down the line? (i.e similarly to how
pointers work in virtual memory systems..)
Are my types full of padding or are they well packed?
Make your software do the minimum amount of moving data around and
computation. That's the core idea. To do this you need to have some
knowledge of your software and of the OS and hardware it will run on.