I was recently asked about a piece of code to decimate/downsample the array "in-place". This "decimation" function takes an array of ints and stores an entry at an even index i
in the array at the index i/2
. It does it for all entries in the array.
This would move all even indexed entries in the original array to the first-half of the array. The rest of the array can then be initialized to 0. The overall result is an array that preserved all even index entries in the original array (by moving them to the first-half) and the second-half of the array being 0. This is apparently used to downsample signals in signal processing.
The code looks something like this:
void decimate (vector<int>& a) {
int sz = a.size();
for (int i =0; i < sz; i++) {
if (i%2 == 0) {
a[i/2] = a[i];
}
}
for (int i =(sz-1)/2; i < sz; i++) a[i] = 0;
}
After suggesting basic improvements that keep certain variables in registers, I can't find any further way to optimize it but not sure if that can't be done.
Are there ways one could optimize the memory access pattern in the loop for better cache performance? Or any other ways to optimize the main copy operations of compressing/down-sampling the array into the first-half ? (e.g. by vectorization for platforms that support it)
for (int i =0; i < sz; i++) {
if (i%2 == 0) {
a[i/2] = a[i];
}
}
Are there any loop transformations (such as tiling/strip-mining) that can lead to highly efficient code for such decimate loop?
EDIT: There are a few different ways suggested in the answers below that seem to take advantage of memset/fill or pointer arithmetic to gain speed efficiency. This question is main focused on whether there are well-defined Loop Transformations that can significantly improve locality or cache misses (e.g. if it were a loop-nest with two loops, one could potentially look into loop tiling to optimize cache misses)
Here is a version using pointer arithmetic and placement new which uses the fact that std::vector uses a continuous memory layout internally:
On my machine this code runs 3 times as fast as yours with disabled optimizations and about 30 % faster than your version when compiled with -o3 on gcc7.2. I tested this with an vector size of 20 000 000 elements.
And I think that in your version line:
should be
otherwise there will be set too many elements to zero.
Taking into account John Zwinck's question I did some quick test with memset and std::fill instead of placement new.
Here are the results:
It seems that memset is a little bit faster on large vectors and std::fill a little bit faster on smaller vectors. But the difference is very small.