Can multiple processes hide latency of SSE instructions?

304 views Asked by At

I'm in need of high-performance merging and came accross: Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture by Jatin Chhugani et al.

Their aim is to get the most performance out of 1 CPU, one part of their solution is to use a bitonic sorting network on SIMD level. To hide latency of the min/max and shuffle operations they perform 4 sorting networks simultaneously (though I think they meant interleaved.). This gives up to a claimed 3.25x increase of performance.

My problem is somewhat relaxed, I have multiple pairs of arrays which need to be processed (read independent) so I can simply run multiple processes and thus easily gain higher throughput.

Though if I oversubscribe the amount of processes to available cores, does this hide latency as well? but induced on a higher level? Or are we treading here in the realm of hyperthreading and I'll never pass the limit of 2 processes sharing the same functional units in a CPU-core?

I could of course try, but changing the existing code is rather involved and I'd like to hear theories first.

2

There are 2 answers

2
Sneftel On BEST ANSWER

No, threading is not an effective solution to pipeline bubbles. The granularity doesn't fit: Context switching takes hundreds of cycles, whereas the sort of stall caused by a naive implementation of bitonic sorting is in 2-4 cycle pieces.

With that said, it's not clear what your use-case is, or where the bottleneck will turn out to be, so multiprocessing could help. Only one way to find out.

0
Paul R On

I've done some experiments with this, and the benefit of HT seems to be marginal - on the one hand you see some small improvements from hiding latency, but on the other hand you double the pressure on cache usage and FSB bandwidth (and double the memory contention too). In some cases I've seen a small gain, in others a small reduction in performance - it all depends on memory access pattern and cache footprint, but from what I've seen HT doesn't really help much overall.

Having said that, there may be cases for code that isn't particularly well optimised as far as memory access patterns are concerned, where HT might buy you something, but if you haven't optimised usage of the the cache/memory hierarchy then SSE optimisation is probably premature anyway.