OpenCV FAST corner detection SSE implementation walkthrough

682 views Asked by At

Could someone help me understanding the SSE implementation of the FAST corner detection in OpenCV? I understand the algorithm but not the implementation. Could somebody walk me through the code?

The code is long, so thank you in advance.

I am using OpenCV 2.4.11 and the code goes like this:

__m128i delta = _mm_set1_epi8(-128);
__m128i t = _mm_set1_epi8((char)threshold);
__m128i m0, m1;
__m128i v0 = _mm_loadu_si128((const __m128i*)ptr);

I think the following have something to do with threshold checking, but can't understand the use of delta

__m128i v1 = _mm_xor_si128(_mm_subs_epu8(v0, t), delta);
v0 = _mm_xor_si128(_mm_adds_epu8(v0, t), delta);

Now it checks the neighboring 4 pixels, but again, what is the use of delta?

__m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[0])), delta);
__m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta);
__m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta);
__m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta);
m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0), _mm_cmpgt_epi8(x1, v0));
m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x1), _mm_cmpgt_epi8(v1, x2)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x2, v0), _mm_cmpgt_epi8(x3, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x2), _mm_cmpgt_epi8(v1, x3)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x3, v0), _mm_cmpgt_epi8(x0, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x3), _mm_cmpgt_epi8(v1, x0)));
m0 = _mm_or_si128(m0, m1);

Here it checks the continuity of the neighboring pixels. (Right?)

int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
    continue;

This is another puzzle for me. Why shifting 8 bytes to the left? I assume the mask tells me the location of the corner candidate, but why 8 bytes?

if( (mask & 255) == 0 )
{
    j -= 8;
    ptr -= 8;
    continue;
}

I gave up at this point...

__m128i c0 = _mm_setzero_si128(), c1 = c0, max0 = c0, max1 = c0;
for( k = 0; k < N; k++ )
{
    __m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
    m0 = _mm_cmpgt_epi8(x, v0);
    m1 = _mm_cmpgt_epi8(v1, x);

    c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0);
    c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);

    max0 = _mm_max_epu8(max0, c0);
    max1 = _mm_max_epu8(max1, c1);
}

max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));

for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
    if(m & 1)
    {
        cornerpos[ncorners++] = j+k;
        if(nonmax_suppression)
            curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
    }
2

There are 2 answers

2
akarsakov On BEST ANSWER

As harold said, delta is used to make unsigned comparsion.

Let's describe this implementation by steps:

  1. __m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[0])), delta); __m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta); __m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta); __m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta); m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0), _mm_cmpgt_epi8(x1, v0)); m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1)); m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0))); ......

Here it's not checking of 4 neighboring pixels. It checks 4 points, for example, like this: enter image description here

  1. Here they check that "corner condition" is true for this 4 points, because if it's not true there are no 8 neighboring pixels that satisfy "corner condition", so it's not corner pixel. If mask is zero it means that all pixels in vector can't be corner so we shift left for 16 pixels.
int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
    continue;
  1. If mask is not zero, but for first 8 pixels "corner condition" is not true they shift left only for 8 pixels to check remain pixels on next iteration.
if( (mask & 255) == 0 )
{
    j -= 8;
    ptr -= 8;
    continue;
}
  1. And final step. Here they count number of neighboring pixels which are greater than x + threshold to c0 counter and which are less than x - threshold to c1 counter.

Here generating mask for such conditions:

__m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
m0 = _mm_cmpgt_epi8(x, v0);
m1 = _mm_cmpgt_epi8(v1, x);

Note that if condition is true for element of vector his value set to 0xFF or -1 since we treat him as signed char.

c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0); 
c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);

If element of mask is -1 it accumulates to c0 or c1 counter since of substraction (for example c0 - (-1)) . But if it equal to zero they reset counter to zero (_mm_and_si128).

Than they need to store maximum value of counters:

max0 = _mm_max_epu8(max0, c0);
max1 = _mm_max_epu8(max1, c1);

So they store maximum number of neighboring pixels which satisfy "corner condition".

Here they determine which pixels are actually corners and which are not:

max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));

for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
    if(m & 1)
    {
        cornerpos[ncorners++] = j+k;
        if(nonmax_suppression)
            curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
    }

I hope it will help. I'm sorry for my bad English.

1
harold On

delta is a mask in which only the signbits are set. They use it because they want to compare for greater than unsigned, but there is only a signed comparison.

Adding 128 (or subtracting it, because -128 == 128) and xoring with it do the same (if you're working with bytes), because

a + b == (a ^ b) + ((a & b) << 1)

and if b only has the top bit set, the ((a & b) << 1) term must be zero (a & b can have the top bit set, but it's shifted out).

Then as you can see in the diagram below, subtracting 128 "shifts" the entire range down in such a way that a signed comparison will give the same result as an unsigned comparison would have given on the original range.

         |0 ... 127 ... 255|  unsigned
|-128 ... 0 ... 127|          signed

I don't know about the rest, I hope someone else can answer that.