I'm currently considering an n x n matrix M of 64-bit integer elements stored in main memory in row-major order. I have an L1 data cache of 16KB split in 64B blocks (no L2 or L3). My code is meant to print out each element of the array one at a time, by either traversing the matrix in row-first order or column-first order.
In the case where n = 16 (i.e. 16 x 16 matrix), I've counted 0 cache misses using both row-first order and column-first order since the matrix M fits entirely in the 16KB cache (it never needs to jump to main memory to fetch an element). How would I deal with the case of, say, n = 256 (256 x 256 matrix of 64-bit ints); i.e. when M doesn't fully fit in the cache? Do I count all the ints that don't fit as misses, or can spatial locality be leveraged somehow? Assume the cache is initially empty.
The "0 cache misses" seems to assume you start out with M already in cache. That's already a bit suspicious, but OK.
For the 256x256 case, you need to simulate how the cache behaves. You must have cache misses to bring in the missing entries. Each cache miss brings in not just the requested int, but also 7 adjacent ints.