I'm in need of high-performance merging and came accross: Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture by Jatin Chhugani et al.
Their aim is to get the most performance out of 1 CPU, one part of their solution is to use a bitonic sorting network on SIMD level. To hide latency of the min/max and shuffle operations they perform 4 sorting networks simultaneously (though I think they meant interleaved.). This gives up to a claimed 3.25x increase of performance.
My problem is somewhat relaxed, I have multiple pairs of arrays which need to be processed (read independent) so I can simply run multiple processes and thus easily gain higher throughput.
Though if I oversubscribe the amount of processes to available cores, does this hide latency as well? but induced on a higher level? Or are we treading here in the realm of hyperthreading and I'll never pass the limit of 2 processes sharing the same functional units in a CPU-core?
I could of course try, but changing the existing code is rather involved and I'd like to hear theories first.
No, threading is not an effective solution to pipeline bubbles. The granularity doesn't fit: Context switching takes hundreds of cycles, whereas the sort of stall caused by a naive implementation of bitonic sorting is in 2-4 cycle pieces.
With that said, it's not clear what your use-case is, or where the bottleneck will turn out to be, so multiprocessing could help. Only one way to find out.