Branch Prediction
All modern micro architectures are pipelined. This means that it will work on calculating the outcome of several instructions at once.
To facilitate this, when there's a branch (perhaps made by an if statement in your code) the CPU has to guess whether or not to take the branch. This is called branch prediction
When a branch prediction is wrong the entire pipeline has to be flushed. This could mean losing 30+ or so cycles if you're running a P4 or about half that on a Pentium M or Core 2 architecture. It's a big performance loss if you have a lot of missed branches in an inner loop.
I recommend you visit my Wikipedia links that I cross-referenced if you haven't learned this in school, because my explanations are awful.
Mitigating Branch Misses
The easiest and best way to avoid missing branch predictions is to avoid branches.
Conway's game of life requires either MIN/MAX or some sort of equality to calculate whether or not cell will be populated. In standard x86 this require jumps and whatnot, but I found a way to avoid this all together using SSE instructions.
There's an equality instruction that will set a byte to 0xff or 0x00 if the compared bytes are either equal or not.
Generally in Conway's Game of Life you would have something like this
if( sum_of_surrounding == 3 || ( cell_is_populated && sum_of_surrounding == 2 ) { alive = true }
This would generate several conditional jumps in the assembly.
In this code taken from trunk/block.c and demonstrates how I got around using branches by using the _mm_cmpeq_ intrinsics. (it also calculates 16 cells at once instead of 1)
sum1 = _mm_cmpeq_epi8( sum3, populate ); sum3 = _mm_add_epi8( unpack( self, j ), sum3 ); sum3 = _mm_cmpeq_epi8( sum3, populate ); sum3 = _mm_or_si128( sum1, sum3 ); temp = _mm_or_si128( temp, pack( sum3, j ) );
