I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8 instructions. In a microbenchmark, the lookup table is faster by 40%. If I microbenchmark, trying to invalidate the cache very 100 iterations, they appear about the same. In my real program, it appears that the non-loading version is faster, but it's really hard to get a provably good measurement, and I've had measurements go both ways.
I'm just wondering if there are some good ways to think about which one would be better to use, or standard benchmarking techniques for this type of decision.
Look-up tables are rarely a performance win in real-world code, especially when they're as large as 4k bytes. Modern processors can execute computations so quickly that it is almost always faster to just do the computations as needed, rather than trying to cache them in a look-up table. The only exception to this is when the computations are prohibitively expensive. That's clearly not the case here, when you're talking about a difference of 30 vs. 8 instructions.
The reason your micro-benchmark is suggesting that the LUT-based approach is faster is because the entire LUT is getting loaded into cache and never evicted. This makes its usage effectively free, such that you are comparing between executing 8 and 30 instructions. Well, you can guess which one will be faster. :-) In fact, you did guess this, and proved it with explicit cache invalidation.
In real-world code, unless you're dealing with a very short, tight loop, the LUT will inevitably be evicted from the cache (especially if it's as large as this one is, or if you execute a lot of code in between calls to the code being optimized), and you'll pay the penalty of re-loading it. You don't appear to have enough operations that need to be performed concurrently such that this penalty can be mitigated with speculative loads.
The other hidden cost of (large) LUTs is that they risk evicting code from the cache, since most modern processors have unified data and instruction caches. Thus, even if the LUT-based implementation is slightly faster, it runs a very strong risk of slowing everything else down. A microbenchmark won't show this. (But actually benchmarking your real code will, so that's always a good thing to do when feasible. If not, read on.)
My rule of thumb is, if the LUT-based approach is not a clear performance win over the other approach in real-world benchmarks, I don't use it. It sounds like that is the case here. If the benchmark results are too close to call, it doesn't matter, so pick the implementation that doesn't bloat your code by 4k.