Simple speed up of C++ OpenMP kernel

212 views Asked by At

I have never worked with OpenMP or optimization of C++, so all help is welcome. I'm probably doing some very stupid things that slow down the process drastically. It doesn't need to be the fastest, but I think some easy tricks will significantly speed it up. Anyone? Thanks a lot!

This function calculates the standard deviation of a patch, given a kernel size and greyscale OpenCV image. The middle pixel of the patch is kept if it is below the given threshold, else it is rejected. This is done for each pixel except the border.

#include "stdafx.h"
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include "opencv2/photo/photo.hpp"
#include <stdlib.h>
#include <stdio.h>
#include "utils.h"
#include <windows.h>
#include <string.h>
#include <math.h>
#include <numeric>

using namespace cv;
using namespace std;
Mat low_pass_filter(Mat img, int threshold, int kernelSize)
{
    unsigned char *input = (unsigned char*)(img.data);
    Mat output = Mat::zeros(img.size(), CV_8UC1);
    unsigned char *output_ptr = (unsigned char*)(output.data);

    #pragma omp parallel for
    for (int i = (kernelSize - 1) / 2; i < img.rows - (kernelSize - 1) / 2; i++){
        for (int j = (kernelSize - 1) / 2; j < img.cols - (kernelSize - 1) / 2; j++){
            double sum, m, accum, stdev;
            vector<double> v;
            // Kernel Patch
            for (int kx = i - (kernelSize - 1) / 2; kx <= i + (kernelSize - 1) / 2; kx++){
                for (int ky = j - (kernelSize - 1) / 2; ky <= j + (kernelSize - 1) / 2; ky++){
                    v.push_back((double)input[img.step * kx + ky]);//.at<uchar>(kx, ky));
                }
            }
            sum = std::accumulate(std::begin(v), std::end(v), 0.0);
            m = sum / v.size();

            accum = 0.0;
            std::for_each(std::begin(v), std::end(v), [&](const double d) {
                accum += (d - m) * (d - m);
            });

            stdev = sqrt(accum / (v.size() - 1));
            if (stdev < threshold){
                output_ptr[img.step * i + j] = input[img.step * i + j];
            }
        }
    }
    return output;
}
1

There are 1 answers

5
Toby Speight On BEST ANSWER

Vector v is not required. Instead of adding items to it, maintain accumulators of d and d*d, and then use variance = E(v²) / E(v)² so that your inner code becomes:

        double sum = 0;
        double sum2 = 0;
        int n = kernelSize * kernelSize;
        // Kernel Patch
        for (int kx = ...) {
            for (int ky = ...) {
                sum += d;
                sum2 += d*d;
            }
        }

        double mean = sum/n;
        double stddev = sqrt(sum2/n - mean*mean);
        if (stddev < threshold) {
            ...;
        }

After that, consider that the sum of elements centred around (x+1,y) can be found from the result for (x,y) simply by subtracting all the elements in the previous left-hand column, and adding all the elements in the new right-hand column. An analogous operation works vertically.

Also, check your compiler options - are you auto-vectorizing loops, and using SIMD instructions (if available)?