Bit Packing
Background
Conway's Game of Life is a cellular automaton where each cell in a grid is either 0 or 1. On or off.
Writing an implementation of this in software requires some form of storing the data in memory.
Rationale
Now adays, memory isn't an issue as far as size goes, but memory bandwidth is. Especially when the number of processors in a system is increasing exponentially (1->2->4->8->?) every couple years now.
When I was figuring out how I was going to do this I decided that I wanted to take advantage of every bit I had available, and made the application more CPU bound than memory bound. I wanted this application to be able to scale in performance almost linearly to the number of processors it was running on in a threaded implementation.
I also wanted the data stored in an easy form to unpack and pack where I could rely completely only just two or three to give me data I can manipulate.
XMM Registers are 128 Bit's so let's use them all
I put together a table showing the bit ordering I use. (I used a 32 bit register instead of 128 for simplicity's sake)
bit 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 byte 0 - - - - - - - 1 - - - - - - - 2 - - - - - - - 3 - - - - - - - packed bit 28 24 20 16 12 8 4 0 29 25 21 17 13 9 5 1 30 26 22 18 14 10 6 2 31 27 23 19 15 11 7 3
This is what the mask looks like I use
mask 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
This is optimal because I can extract 8 bytes to do math with very simply. We get the data by doing something like
temp_numbers = ( packed_data >> n ) & mask
If n is 0, then we will have 8-bit integer representations for the bits numbers 0, 1, 2, and 3 from the packed bits. If n is 1 then we get the next four values.
This is important because because with SIMD we can perform 16 8-bit integer operations on a 128bit XMM register.
Doing this with SSE intrinsics with 128bit XMM registers looks a little like this
static inline __m128i unpack( __m128i A, int row ) { const __m128i onemask = _mm_set1_epi8( 0x01 ); __m128i shift = _mm_srli_epi32( A, row ); return _mm_and_si128( shift, onemask ); }
and packing them back together looks like
// MAKE SURE A IS IN THE FORM (0xFF) not (0x01 ) // ( a | mask ) | ( ~mask & b ) static inline __m128i pack( __m128i A, int row ) { const __m128i onemask = _mm_set1_epi8( 0x01 ); A = _mm_and_si128( onemask, A ); return _mm_slli_epi32(A, row); }
I require A to be in the form of 0xFF for two reasons. 1, because the input is after a cmpeq instruction wich sets each byte to be 0x00 or 0xFF.
Also, if the bytes are either 0x00 or 0xFF I can just shift a mask that contains consecutive 0x01s and and with what I want to get the right bits.
Quickly being able to convert from packed to unpacked and back is very important for the speed of this program because it's done so many times and allow us to store 8 times as much data in memory than a conventional implementation that stores each cell as one byte.
Other Advantages
There's some other things that are really convenient. I can shift all 128 bits left or right (in the proper order) using just a few instructions.
I also have a method that uses two instructions to shift them left or right 127 bits so I can just put bit 0 where bit 127 is and vice versa.
for more information check out trunk/helpers.h
