Fastest use of a dataset of just over 64 bytes?

79 views Asked by At

Structure: I have 8 64-bit integers (512 bits = 64 bytes, the assumed cache line width) that I would like to compare to another, single 64-bit integer, in turn, without cache misses. The data set is, unfortunately, absolutely inflexible -- it's already as small as possible.

Access pattern: Each uint64_t is in fact an array of 4x4x4 bits, each bit representing the presence or absence of a voxel. This means sometimes I will be using half of one chunk and half of another, or even corners of 8 different 64-bit chunks.... I guess what this means is there is a high likelihood of a lack of alignment.

How can I do this as fast as possible i.e. without thrashing the cache?

P.S. The idea is that this code will ultimately run on a fairly wide range of architectures of at least a 64B cache line width, so I'd prefer this were absolutely as fast as possible. This also means I can't rely on MOVNTDQA, which anyway may incur a performance hit of it's own inspite of loading the 9th element directly to the CPU.

P.P.S. My knowledge of this area is fairly limited so please take it easy on me. But please spare me the premature optimisation comments; be sure that this is the 3% of this application that really counts.

2

There are 2 answers

1
vlad_tepesch On BEST ANSWER

Are you sure you get cache-misses? Even if the comparing value is not in an register, i think your first uint64 array should be on one cache stage (or what ever it is called) and your other data in another. Your cache surely has some n-way associativity, that prevents your data row from being removed from the cache just by accessing your compare value.

Do not lose your time on Micro Optimizations. Improve your algorithms and data structures.

3
skrrgwasme On

I wouldn't worry about it. If your dataset is really only 9 integers, most of it will likely be stored in registers anyway. Also, there isn't really any way to optimize cache usage without specifying an architecture, since cache structure is architecture dependent. If you can list several target architectures you may be able to find some commonalities that you can optimize toward, but without knowing those architectures, I don't think there's much we can do for you.

Lastly, this seems like a good example of optimizing too early. I would suggest you take the following steps:

  1. Decide what your maximum acceptable run time is
  2. Finish your program in C
  3. Compile for all of your target architectures
  4. For those platforms that don't meet your speed spec, hand-optimize the intermediate assembly files and recompile until you meet your spec.