3D FFT with data larger than cache

319 views Asked by At

I have searched for an answer to this question but have not found anything that can directly help me.

I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.

My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.

Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?

Any help would be much appreciated.

Thanks,

Dylan

3

There are 3 answers

2
TheCppZoo On

I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.

It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.

Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.

For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.

To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind

2
bazza On

2^30 data points in a single FFT counts as being quite big!

The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.

Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.

I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.

You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.

Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.

GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).

3
AudioBubble On

You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.

The rest of this post assumes that the FFT is really at fault and we need to optimize.

The standard trick to get data locality out of an FFT is to

  • Arrange the data in a two-dimensional array
  • Do an FFT along each row
  • Apply twiddle factors
  • Do a matrix transpose
  • Do an FFT along each row

This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.

The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.

Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.

The library you're using really should be doing this already. If it's not, then some options are:

  • Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
  • You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.