I'm looking for a fast and memory efficient approach for implementing Conway's Game of Life.
Constraints: a 96x128 board, approximately 2kB RAM available and 52MHz processor (see the tech specs here: http://www.getinpulse.com/features).
My current naive solution that represents each cell as a single bit in a matrix (96*128/8=1,536 bytes) works but is too slow. What tricks can be used to improve performance?
Storing the coordinates of live cells (for example in this implementation http://dotat.at/prog/life/life.html) would use too much memory.
Looks like a fun piece of hardware. Storing 1 bit per pixel of a 96x128 pixel display gives 12,288 bits. That's already over half of the 16,384 bits you say are "available". If you can't even store 2 bits per cell, there's not a lot of room to do much.
A few ideas:
You have a 32-bit processor. There are several bit-twiddling tricks to take a bitmap and calculate the number of neighbors of several cells in parallel on such a processor.
It's often faster to store the neighbor counts, incrementing all 8 neighbors at birth and decrementing all 8 neighbors at death, rather than recalculating the number of neighbors from scratch every generation -- but it doesn't look like you have enough memory for this approach.
Perhaps you could use 2x2 pixels per cell (so only 3,072 cells visible) or 3x3 pixels per cell (so only 1,376 cells visible), so your software does less work and so gives the illusion of running faster. (This also frees up enough RAM that you can do the neighbor count mentioned above).
Since many life patterns have large areas of dead space, perhaps have a small bitmap of "live regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region is entirely dead, and "1" if any of the cells in the corresponding region is alive. The next-generation update only needs to look at live regions and the 1-cell-wide border of dead regions; you can skip checking the 6x6 cell core of dead regions. Also, you can completely skip the entire 8x8 cell region if its 4 nearest neighboring regions are also dead.
Since many life patterns have large areas of static unchanging "still life" patterns, perhaps have a small bitmap of "dynamic regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region had no changes in the last generation update, and "1" if any of the cells in the corresponding region changed in the last update. The next generation update only needs to look at dynamic regions and the 1-cell-wide border of static regions; you can skip checking the 6x6 cell core of static regions, since it will be the same in the next generation.
If a pattern is "large enough", a representation based on Gosper's Hashlife may be able to store it in less RAM than storing a bitmap directly. Alas, I suspect you're far below the "large enough" threshold.